digital vlsi design with verilog: a textbook from silicon valley technical institute

Digital VLSI Design with Verilog

John Williams

Digital VLSI Designwith Verilog

A Textbook from Silicon ValleyTechnical Institute

Foreword by Don Thomas

123

Dr. John WilliamsSVTI Inc.Silicon Valley Technical Institute1762 Technology DriveSan Jose CA 95110Suite [email protected]

ISBN: 978-1-4020-8445-4 e-ISBN: 978-1-4020-8446-1

Library of Congress Control Number: 2008925050

c© 2008 John Michael WilliamsAll rights reserved.No part of this work may be reproduced, stored in a retrieval system, or transmittedin any form or by any means, electronic, mechanical, photocopying, microfilming, recordingor otherwise, without written permission from the Publisher, with the exceptionof any material supplied specifically for the purpose of being enteredand executed on a computer system, for exclusive use by the purchaser of the work.

Design Compiler, Design Vision, Liberty, ModelSim, PrimeTime, QuestaSim, Silos, VCS, Verilog(capitalized), and VirSim are trademarks of their respective owners.

Printed on acid-free paper

9 8 7 6 5 4 3 2 1

springer.com

To my loving grandparents,William Joseph Young (ne Jung) andMary Elizabeth Young (nee Egan)who cared for my brother Kevin and mewhen they didn’t have to.

Foreword

Verilog and its usage has come a long way since its original invention in the mid-80sby Phil Moorby. At the time the average design size was around ten thousand gates,and simulation to validate the design was its primary usage. But between then andnow designs have increased dramatically in size, and automatic logic synthesis fromRTL has become the standard design flow for most design. Indeed, the language hasevolved and been re-standardized too.

Over the years, many books have been written about Verilog. My own, coauthoredwith Phil Moorby, had the goal of defining the language and its usage, providing ex-amples along the way. It has been updated with five new editions as the languageand its usage evolved.

However this new book takes a very different and unique view; that of thedesigner. John Michael Williams has a long history of working and teaching in thefield of IC and ASIC design. He brings an indepth presentation of Verilog and howto use it with logic synthesis tools; no other Verilog book has dealt with this topicas deeply as he has.

If you need to learn Verilog and get up to speed quickly to use it for synthesis,this book is for you. It is sectioned around a set of lessons including presentationand explanation of new concepts and approaches to design, along with lab sessions.The reader can follow these at his/her own rate. It is a very practical method to learnto use the language, one that many of us have probably followed ourselves. Namely,learn some basics, try them out, and continually move on to more advanced topicswhich will also be tried out. This book, based on the author’s experience in teachingVerilog, is well organized to support you in learning Verilog.

Pittsburgh, PA Don Thomas2008

vii

Preface

This book is based on the lab exercises and order of presentation of a coursedeveloped and given by the author over a period of years at Silicon Valley Tech-nical Institute, San Jose, California.

At the time, to the author’s best knowledge, this course was the only one evergiven which (a) presented the entire verilog language; (b) involved implementationof a full-duplex serdes simulation model; or (c) included design of a synthesizabledigital PLL.

The author wishes to thank the owner and CEO of Silicon Valley TechnicalInstitute, Dr. Ali Iranmanesh, for his patience and encouragement during the coursedevelopment and in the preparation of this book.

ix

Contents

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xix1 Course Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xix2 Using this Book . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xx

2.1 Contents of the CD-ROM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xx2.2 Performing the Lab Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . xx2.3 Proprietary Information and Licensing Limitations . . . . . . . . xxi

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xxi

1 Week 1 Class 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Introductory Lab 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.1.1 Lab 1 Postmortem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111.2 Verilog Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131.3 Operator Lab 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

1.3.1 Lab Postmortem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171.4 First-Day Wrapup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

1.4.1 VCD File Dump . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171.4.2 The Importance of Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . 181.4.3 SDF File Dump. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181.4.4 Additional Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

2 Week 1 Class 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212.1 More Language Constructs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212.2 Parameter and Conversion Lab 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

2.2.1 Lab Postmortem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 302.3 Procedural Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

2.3.1 Procedural Control in Verilog . . . . . . . . . . . . . . . . . . . . . . . . . . 302.3.2 Combinational and Sequential Logic . . . . . . . . . . . . . . . . . . . . 312.3.3 Verilog Strings and Messages . . . . . . . . . . . . . . . . . . . . . . . . . . 332.3.4 Shift Registers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 352.3.5 Reconvergence Design Note . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

2.4 Nonblocking Control Lab 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 372.4.1 Lab Postmortem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 412.4.2 Additional Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

xi

xii Contents

3 Week 2 Class 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 433.1 Net Types, Simulation, and Scan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

3.1.1 Variables and Constants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 433.1.2 Identifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 443.1.3 Concurrent vs. Procedural Blocks . . . . . . . . . . . . . . . . . . . . . . 443.1.4 Miscellaneous Other Verilog Features . . . . . . . . . . . . . . . . . . . 453.1.5 Backus-Naur Format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 453.1.6 Verilog Semantics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 463.1.7 Modelling Sequential Logic . . . . . . . . . . . . . . . . . . . . . . . . . . . 483.1.8 Design for Test (DFT): Scan Lab Introduction . . . . . . . . . . . . 50

3.2 Simple Scan Lab 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 533.2.1 Lab Postmortem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 593.2.2 Additional Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

4 Week 2 Class 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 614.1 PLLs and the SerDes Project . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

4.1.1 Phase-Locked Loops . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 614.1.2 A 1 × Digital PLL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 614.1.3 Introduction to SerDes and PCI Express . . . . . . . . . . . . . . . . . 674.1.4 The SerDes of this Course . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 694.1.5 A 32 × Digital PLL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

4.2 PLL Clock Lab 6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 714.2.1 Lab Postmortem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 814.2.2 Additional Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

5 Week 3 Class 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 835.1 Data Storage and Verilog Arrays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

5.1.1 Memory: Hardware and Software Description . . . . . . . . . . . . 835.1.2 Verilog Arrays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 845.1.3 A Simple RAM Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 875.1.4 Verilog Concatenation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 875.1.5 Memory Data Integrity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 885.1.6 Error Checking and Correcting (ECC) . . . . . . . . . . . . . . . . . . . 905.1.7 Parity for SerDes Frame Boundaries . . . . . . . . . . . . . . . . . . . . 93

5.2 Memory Lab 7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 955.2.1 Lab Postmortem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 995.2.2 Additional Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

6 Week 3 Class 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1016.1 Counter Types and Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

6.1.1 Introduction to Counters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1016.1.2 Terminology: Behavioral, Procedural, RTL, Structural . . . . . 1026.1.3 Adder Expression vs. Counter Statement . . . . . . . . . . . . . . . . . 1046.1.4 Counter Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

Contents xiii

6.2 Counter Lab 8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1086.2.1 Lab Postmortem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1116.2.2 Additional Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111

7 Week 4 Class 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1137.1 Contention and Operator Precedence . . . . . . . . . . . . . . . . . . . . . . . . . . 113

7.1.1 Verilog Net Types and Strengths . . . . . . . . . . . . . . . . . . . . . . . . 1137.1.2 Race Conditions, Again . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1167.1.3 Unknowns in Relational Expressions . . . . . . . . . . . . . . . . . . . . 1197.1.4 Verilog Operators and Precedence . . . . . . . . . . . . . . . . . . . . . . 120

7.2 Digital Basics: Three-State Buffer and Decoder . . . . . . . . . . . . . . . . . 1227.3 Strength and Contention Lab 9 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123

7.3.1 Strength Lab postmortem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1297.4 Back to the PLL and the SerDes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129

7.4.1 Named Blocks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1297.4.2 The PLL in a SerDes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1307.4.3 The SerDes Packet Format Revisited . . . . . . . . . . . . . . . . . . . . 1317.4.4 Behavioral PLL Synchronization (language digression) . . . . 1327.4.5 Synthesis of Behavioral Code . . . . . . . . . . . . . . . . . . . . . . . . . . 1407.4.6 Synthesizable, Pattern-Based PLL Synchronization . . . . . . . . 140

7.5 PLL Behavioral Lock-In Lab 10 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1417.5.1 Lock-in Lab Postmortem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1447.5.2 Additional Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144

8 Week 4 Class 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1458.1 State Machine and FIFO design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145

8.1.1 Verilog Tasks and Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . 1458.1.2 A Function for Synthesizable PLL Synchronization . . . . . . . 1488.1.3 Concurrency by fork-join . . . . . . . . . . . . . . . . . . . . . . . . . . 1498.1.4 Verilog State Machines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1508.1.5 FIFO Functionality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1518.1.6 FIFO Operational Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1548.1.7 A Verilog FIFO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158

8.2 FIFO Lab 11 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1648.2.1 Lab Postmortem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1678.2.2 Additional Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168

9 Week 5 Class 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1699.1 Rise-Fall Delays and Event Scheduling . . . . . . . . . . . . . . . . . . . . . . . . 169

9.1.1 Types of Delay Expression . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1699.1.2 Verilog Simulation Event Queue . . . . . . . . . . . . . . . . . . . . . . . 1729.1.3 Simple Stratified Queue Example . . . . . . . . . . . . . . . . . . . . . . . 1749.1.4 Event Controls . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1779.1.5 Event Queue Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178

xiv Contents

9.2 Scheduling Lab 12. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1799.2.1 Lab Postmortem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1849.2.2 Additional Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184

10 Week 5 Class 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18510.1 Built-in Gates and Net Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185

10.1.1 Verilog Built-in Gates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18510.1.2 Implied Wire Names . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18610.1.3 Net Types and their Default . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18610.1.4 Structural Use of Wire vs. Reg . . . . . . . . . . . . . . . . . . . . . . . . . 18710.1.5 Port and Parameter Syntax Note . . . . . . . . . . . . . . . . . . . . . . . . 18810.1.6 A D Flip-flop from SR Latches . . . . . . . . . . . . . . . . . . . . . . . . . 189

10.2 Netlist Lab 13 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19210.2.1 Lab Postmortem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19410.2.2 Additional Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194

11 Week 6 Class 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19511.1 Procedural Control and Concurrency . . . . . . . . . . . . . . . . . . . . . . . . . . 195

11.1.1 Verilog Procedural Control Statements . . . . . . . . . . . . . . . . . . 19511.1.2 Verilog case Variants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19911.1.3 Procedural Concurrency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20211.1.4 Verilog Name Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204

11.2 Concurrency Lab 14 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20711.2.1 Lab Postmortem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20911.2.2 Additional Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209

12 Week 6 Class 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21112.1 Hierarchical Names and generate Blocks . . . . . . . . . . . . . . . . . . . . 211

12.1.1 Hierarchical Name Access . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21112.1.2 Verilog Arrayed Instances . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21312.1.3 generate Statements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21412.1.4 Conditional Macroes and Conditional generates . . . . . . . . 21412.1.5 Looping Generate Statements . . . . . . . . . . . . . . . . . . . . . . . . . . 21612.1.6 generate Blocks and Instance Names . . . . . . . . . . . . . . . . . 21612.1.7 A Decoding Tree with Generate . . . . . . . . . . . . . . . . . . . . . . . . 22012.1.8 Scope of the generate Loop . . . . . . . . . . . . . . . . . . . . . . . . . 224

12.2 Generate Lab 15 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22412.2.1 Lab Postmortem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22912.2.2 Additional Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230

13 Week 7 Class 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23113.1 Serial-Parallel Conversion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231

13.1.1 Simple Serial-Parallel Converter . . . . . . . . . . . . . . . . . . . . . . . . 23113.1.2 Deserialization by Function and Task . . . . . . . . . . . . . . . . 232

13.2 Lab Preface: The Deserialization Decoder . . . . . . . . . . . . . . . . . . . . . . 23413.2.1 Some Deserializer Redesign – An Early ECO . . . . . . . . . . . . 23613.2.2 A Partitioning Question . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237

Contents xv

13.3 Serial-Parallel Lab 16 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23813.3.1 Lab Postmortem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24213.3.2 Additional Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242

14 Week 7 Class 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24314.1 UDPs, Timing Triplets, and Switch-level Models . . . . . . . . . . . . . . . . 243

14.1.1 User-Defined Primitives (UDPs) . . . . . . . . . . . . . . . . . . . . . . . . 24314.1.2 Delay Pessimism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24614.1.3 Gate-Level Timing Triplets . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24714.1.4 Switch-Level Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24914.1.5 Switch-Level Net: The Trireg . . . . . . . . . . . . . . . . . . . . . . . . 253

14.2 Component Lab 17 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25414.2.1 Lab Postmortem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25714.2.2 Additional Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 258

15 Week 8 Class 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25915.1 Parameter Types and Module Connection . . . . . . . . . . . . . . . . . . . . . . 259

15.1.1 Summary of Parameter Characteristics . . . . . . . . . . . . . . . . . . 25915.1.2 ANSI Header Declaration Format . . . . . . . . . . . . . . . . . . . . . . 25915.1.3 Traditional Header Declaration Format . . . . . . . . . . . . . . . . . . 26015.1.4 Instantiation Formats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26015.1.5 ANSI Port and Parameter Options . . . . . . . . . . . . . . . . . . . . . . 26115.1.6 Traditional Module Header Format and Options . . . . . . . . . . . 26115.1.7 Defparam . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262

15.2 Connection Lab 18 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26315.2.1 Connection Lab Postmortem . . . . . . . . . . . . . . . . . . . . . . . . . . . 267

15.3 Hierarchical Names and Design Partitions . . . . . . . . . . . . . . . . . . . . . . 26815.3.1 Hierarchical Name References . . . . . . . . . . . . . . . . . . . . . . . . . 26815.3.2 Scope of Declarations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26815.3.3 Design Partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26915.3.4 Synchronization Across Clock Domains . . . . . . . . . . . . . . . . . 271

15.4 Hierarchy Lab 19 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27315.4.1 Lab Postmortem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27615.4.2 Additional Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 277

16 Week 8 Class 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27916.1 Verilog Configurations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 279

16.1.1 Libraries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27916.1.2 Verilog Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 279

16.2 Timing Arcs and specify Delays . . . . . . . . . . . . . . . . . . . . . . . . . . . 28116.2.1 Arcs and Paths . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28116.2.2 Distributed and Lumped Delays . . . . . . . . . . . . . . . . . . . . . . . . 28216.2.3 specify Blocks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28516.2.4 specparams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28616.2.5 Parallel vs. Full Path Delays . . . . . . . . . . . . . . . . . . . . . . . . . . . 28716.2.6 Conditional and Edge-Dependent Delays . . . . . . . . . . . . . . . . 288

xvi Contents

16.2.7 Conflicts of specify with Other Delays . . . . . . . . . . . . . . . . 28916.2.8 Conflicts Among specify Delays . . . . . . . . . . . . . . . . . . . . . 289

16.3 Timing Lab 20 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28916.3.1 Lab Postmortem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29316.3.2 Additional Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293

17 Week 9 Class 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29517.1 Timing Checks and Pulse Controls . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295

17.1.1 Timing Checks and Assertions . . . . . . . . . . . . . . . . . . . . . . . . . 29517.1.2 Timing Check Rationale . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29617.1.3 The Twelve Verilog Timing Checks . . . . . . . . . . . . . . . . . . . . . 29717.1.4 Negative Time Limits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30017.1.5 Timing Check Conditioned Events . . . . . . . . . . . . . . . . . . . . . . 30217.1.6 Timing Check Notifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30217.1.7 Pulse Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30317.1.8 Improved Pessimism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30517.1.9 Miscellaneous time-Related Types . . . . . . . . . . . . . . . . . . . . . . 305

17.2 Timing Check Lab 21 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30617.2.1 Additional Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 310

18 Week 9 Class 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31118.1 The Sequential Deserializer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31118.2 PLL Redesign . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 312

18.2.1 Improved VFO Clock Sampler . . . . . . . . . . . . . . . . . . . . . . . . . 31318.2.2 Synthesizable Variable-Frequency Oscillator . . . . . . . . . . . . . 31418.2.3 Synthesizable Frequency Comparator . . . . . . . . . . . . . . . . . . . 31618.2.4 Modifications for a 400 MHz 1 ××× PLL . . . . . . . . . . . . . . . . . . 31818.2.5 Wrapper Modules for Portability . . . . . . . . . . . . . . . . . . . . . . . 321

18.3 Sequential Deserializer I Lab 22 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32218.3.1 Lab Postmortem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33518.3.2 Additional Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 335

19 Week 10 Class 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33719.1 The Concurrent Deserializer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 337

19.1.1 Dual-porting the Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33819.1.2 Dual-clocking the FIFO State Machine . . . . . . . . . . . . . . . . . . 33819.1.3 Upgrading the FIFO for Synthesis . . . . . . . . . . . . . . . . . . . . . . 33819.1.4 Upgrading the Deserialization Decoder for Synthesis . . . . . . 339

19.2 Concurrent Deserializer II Lab 23 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33919.2.1 Lab Postmortem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36019.2.2 Additional Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 360

20 Week 10 Class 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36120.1 The Serializer and The SerDes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 361

20.1.1 The SerEncoder Module . . . . . . . . . . . . . . . . . . . . . . . . . . . 362

Contents xvii

20.1.2 The SerialTx Module . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36220.1.3 The SerDes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 362

20.2 SerDes Lab 24 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36220.2.1 Lab Postmortem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37320.2.2 Additional Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 373

21 Week 11 Class 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37521.1 Design for Test (DFT) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 375

21.1.1 Design for Test Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 37521.1.2 Assertions and Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37621.1.3 Observability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37621.1.4 Coverage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37721.1.5 Corner-Case vs. Exhaustive Testing . . . . . . . . . . . . . . . . . . . . . 37821.1.6 Boundary Scan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37921.1.7 Internal Scan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38021.1.8 BIST . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 382

21.2 Scan and BIST Lab 25 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38321.2.1 Lab postmortem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 392

21.3 DFT for a Full-Duplex SerDes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39221.3.1 Full-Duplex SerDes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39221.3.2 Adding Test Logic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 393

21.4 Tested SerDes Lab 26 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39321.4.1 Lab Postmortem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40321.4.2 Additional Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 403

22 Week 11 Class 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40522.1 SDF Back-Annotation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 405

22.1.1 Back-Annotation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40522.1.2 SDF Files in Verilog Design Flow . . . . . . . . . . . . . . . . . . . . . . 40522.1.3 Verilog Simulation Back-Annotation . . . . . . . . . . . . . . . . . . . . 406

22.2 SDF Lab 27 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40722.2.1 Lab Postmortem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41122.2.2 Additional Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 411

23 Week 12 Class 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41323.1 Wrap-up: The Verilog Language . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 413

23.1.1 Verilog-1995 vs. 2001 (or 2005) Differences . . . . . . . . . . . . . 41323.1.2 Verilog Synthesizable Subset Review . . . . . . . . . . . . . . . . . . . 41323.1.3 Constructs Not Exercised in this Course . . . . . . . . . . . . . . . . . 41423.1.4 List of all Verilog System Tasks and Functions . . . . . . . . . . . . 41523.1.5 List of all Verilog Compiler Directives . . . . . . . . . . . . . . . . . . 41723.1.6 Verilog PLI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 417

23.2 Continued Lab Work (Lab 23 or later) . . . . . . . . . . . . . . . . . . . . . . . . . 41823.2.1 Additional Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 418

xviii Contents

24 Week 12 Class 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42124.1 Deep-Submicron Problems and Verification . . . . . . . . . . . . . . . . . . . . . 421

24.1.1 Deep Submicron Design Problems . . . . . . . . . . . . . . . . . . . . . . 42124.1.2 The Bigger Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42424.1.3 Modern Verification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42424.1.4 Formal Verification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42524.1.5 Nonlogical Factors on the Chip . . . . . . . . . . . . . . . . . . . . . . . . 42624.1.6 System Verilog . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 427

24.2 Continued Lab Work (Lab 23 or later) . . . . . . . . . . . . . . . . . . . . . . . . . 42824.2.1 Additional Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 428

Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 429

Introduction

1 Course Description

This book may be used as a combined textbook and workbook for a 12-week,2-day/week interactive course of study. The book is divided into chapters, eachchapter being named for the week and day in the anticipated course schedule.

The course was developed for attendees with a bachelor’s degree in electricalengineering, or the equivalent, and with digital design experience. In addition to thiskind of background, an attendee was assumed familiar with software programmingin a modern language such as C.

Someone fulfilling the course requirements would expect to spend approximately12 hours per week for 12 weeks to learn the content and do all the labs. Of course,now that it is a book, a reader can expect to be able to proceed more at his orher own pace; therefore, the required preparation can be more flexible than for theprogrammed classroom presentation.

Topic List (partial):Discussion: Modules and hierarchy; Blocking/nonblocking assignment;

Combinational logic; Sequential logic; Behavioral modelling; RTL mod-elling; Gate-level modelling; Hardware timing and delays; Verilog parame-ters; Basic system tasks; Timing checks; Generate statement; Simulation eventscheduling; Race conditions; Synthesizer operation; Synthesizable constructs;Netlist optimization; Synthesis control directives; Verilog influence on opti-mization; Use of SDF files; Test structures; Error correction basics.

Lab Projects: Shift and scan registers; counters; memory and FIFO mod-els; digital phase-locked loop (PLL); serial-parallel (and υ-υ) converter;serializer-deserializer (serdes); primitive gates; switch-level design; netlistback-annotation.

xix

xx Introduction

2 Using this Book

The reader is encouraged to read the chapters in order but always to assume thattopics may be covered in a multiple-pass approach: Often, a new idea or languagefeature will be mentioned briefly and explained only incompletely; later, perhapsmany pages later, the presentation will return to the idea or feature to fill in details.

Each chapter ends with supplementary readings from two highly recommendedworks, textbooks by Thomas and Moorby and by Palnitkar. When a concept remainsdifficult, after discussion and even a lab exercise, it is possible that these, or other,publications listed in the References may help.

2.1 Contents of the CD-ROM

The CD-ROM contains problem files and complete solutions to all the lab exercisesin the book. A redundant backup of everything is stored on the CD-ROM in a tarfile, for easy copying to disc in a Linux or Unix working environment.

Be sure to read the ReadMe.txt file on the CD-ROM before using it.The misc directory on the CD-ROM contains an include file required for Lab 1,

plus some other side files. It contains PDF instructions for basic operation of theVCS or QuestSim simulators.

The misc directory also contains nonproprietary verilog library files, writtenby the author, which allow approximately correct simulation of a verilog netlist.These netlist models are not correct for design work in TSMC libraries, but theywill permit simulation for training purposes. DO NOT USE THESE VERILOGLIBRARIES FOR DESIGN WORK. If you are designing for a TSMC tape-out,use the TSMC libraries provided by Synopsys, with properly back-annotated timinginformation.

2.2 Performing the Lab Exercises

The book contains step-by-step lab instructions. Start by setting up a working en-vironment in a new directory and copying everything on the CD-ROM into it, pre-serving directory structure.

The reader must have access to simulation software to simulate any of the ver-ilog source provided in the book or the lab solutions. Readers without access toEDA tools can do almost all the source verilog lab simulations (except the serdesproject) with the demo version of the Silos simulator delivered on CD-ROM withthe Thomas and Moorby or Palnitkar texts. A more functional simulator is availablefrom Aldec (student version). Verilog simulators also often are supplied free withFPGA hardware programming kits. Netlist simulations of designs based on ASIClibraries generally will require higher-capacity tools such as VCS or QuestaSim.

The Synopsys Design Compiler synthesizer and VCS simulator should be usedfor best performance of the labs. Almost all verilog simulator waveform displays

Introduction xxi

were created using the Synopsys VCS simulator, which is designed for large, ASIC-oriented designs.

Keep in mind that all verilog simulators are incomplete works: Different simu-lators emphasize different features, but all have some verilog language limitations.Attempts have been made in the text to guide the user through the most importantof these limitations.

For the professional reader’s interest, the CD-ROM labs in this edition were per-formed using Synopsys Design Compiler (Z-2007.03 SP2) and VCS (MX 2007) ona 1 GHz x86 machine with 384 megs of RAM, running Red Hat Enterprise Linux 3.TSMC 90-nm front-end libraries (typical PVT) from Synopsys were used forsynthesis.

2.3 Proprietary Information and Licensing Limitations

Publishing of operational performance details of VCS, Design Compiler, or Ques-taSim may require written permission from Synopsys or Mentor, and readers usingthe recommended EDA tools to perform the labs in this book are advised not tocopy, duplicate, post, or publish any of their work showing specific tool propertieswithout verifying first with the manufacturer that trade secrets and other proprietaryinformation have been removed. This is a licensing issue unrelated to copyright, fairuse, or patent ownership. Know your tools’ license provisions!

Also, the same applies to the TSMC library files which may be available tolicensed readers. The front-end libraries are designed for synthesis, floorplanning,and timing verification only, but they may contain trade secrets of TSMC. Do notmake copies of anything from the TSMC libraries, including TSMC documentation,without special permission from TSMC and Synopsys.

The verilog simulation library files delivered on the accompanying CD-ROMresemble those in tool releases; however, they are not produced by Synopsys orTSMC. These files should be considered copyrighted but not otherwise proprietary;they are available for copying, study, or modification by the purchaser of this book,for individual learning purposes.

Verilog netlists produced by the synthesizer are not proprietary, although the Lib-erty models used by the synthesizer are proprietary and owned by Synopsys. Spe-cific details of synthesized netlist quality may be considered proprietary and shouldnot be published or distributed without special permission from Synopsys.

References

(the date after a link is the latest date the link was used by the author)Accellera Organization. System Verilog Language Reference Manual v. 3.1a. Draft standard avail-

able free for download from Accellera web site at http://www.accellera.org/home.Anonymous. “Design for Test (DFT)”, Chapter 3 of The NASA ASIC Guide: Assuring

ASICs for Space. http://klabs.org/DEI/References/design guidelines/content/guides/nasa asic guide/Sect.3.3.html (updated 2003-12-30).

xxii Introduction

Anonymous. SerDes Transceivers. Freescale Semiconductor, Inc. http://www.freescale.com/webapp/sps/site/overview.jsp?nodeId=01HGpJ2350NbkQ (2004-11-16).

Barrett, C. (Ed.) Fractional/Integer-N PLL Basics. Texas Instruments Technical Brief SWRA029,August, 1999. http://focus.ti.com/lit/an/swra029/swra029.pdf (2004-12-09).

Bertrand, R. “The Basics of PLL Frequency Synthesis”, in the Online Radio and ElectronicsCourse. http://www.radioelectronicschool.com/reading/pll.pdf (2004-12-09).

Bhasker, J. A Verilog HDL Primer (3rd ed.). Allentown, Pennsylvania: Star Galaxy Publishing,2005.

Cipra, B. A. “The Ubiquitous Reed Solomon Codes”. SIAM News, 26(1), 1993. http://www.eccpage.com/reed solomon codes.html (2007-09-18).

Cummings, C. E. “Simulation and Synthesis Techniques for Asynchronous FIFO Design” (rev.1.1). Originally presented at San Jose, California: The Synopsys Users Group Conference,2002. http://www.sunburst-design.com/papers/CummingsSNUG2002SJFIFO1 rev1 1.pdf (2004-11-22).

Cummings, C. E. and Alfke, P. “Simulation and Synthesis Techniques for AsynchronousFIFO Design with Asynchronous Pointer Comparisons” (rev. 1.1). Originally presented atSan Jose, CA: The Synopsys Users Group Conference, 2002. http://www.sunburst-design.com/papers/CummingsSNUG2002SJ FIFO2 rev1 1.pdf (2004-11-22).

IEEE Std 1364-2005. Verilog Hardware Description Language. Piscataway, New Jersey: The IEEEComputer Society, 2005. Revised in 2001 and reaffirmed, with some System Verilog addedcompatibility, in 2005. If you plan to do any serious work in verilog, you should have a copy ofthe standard. It is not only normative, but it includes numerous examples and explanatory notesconcerning every detail of the language. In this text, we refer to “verilog 2001” syntax where itis the same as in the 2005 standard.

Keating, M., et al. Low Power Methodology Manual for System-on-Chip Design. Springer Scienceand Business Solutions, 2007. Available from Synopsys as a free PDF for personal use only:http://www.synopsys.com/lpmm (2007-11-06).

Knowlton, S. “Understanding the Fundamentals of PCI Express”. Synopsys White Paper,2007. Available at the Technical Bulletin, Technical Papers page at http://www.synopsys.com/products/designware/dwtb/dwtb.php. Free registration andlogin (2007-10-17).

Koeter, J. What’s an LFSR?, at http://focus.ti.com/lit/an/scta036a/scta036a.pdf (2007-01-29).

Mead, C. and Conway, L. Introduction to VLSI Systems. Menlo Park, CA: Addison-Wesley,1980. Excellent but old introduction to switch-level reality and digital transistor design andfabrication.

Palnitkar, S. Verilog HDL (2nd ed.). Palo Alto, CA: Sun Microsystems Press, 2003. A good ba-sic textbook useful for supplementary perspective. Also includes a demo version of the Silossimulator on CD-ROM. Our daily Additional Study recommendations include many optionalreadings and exercises from this book.

Seat, S. “Gearing Up Serdes for High-Speed Operation”, posted at http://www.commsdesign.com/design corner/showArticle.jhtml?articleID=16504769(2004-11-16).

Suckow, E. H. “Basics of High-Performance SerDes Design: Part I” at http://www.analogzone.com/iot 0414.pdf; and, Part II at http://www.analogzone.com/iot 0428.pdf (2004-11-16).

Sutherland, S. “The IEEE Verilog 1364-2001 Standard: What’s New, and WhyYou Need it”. Based on an HDLCon 2000 presentation. http://www.sutherland-hdl.com/papers/2000-HDLCon-paper Verilog-2000.pdf(2005-02-03).

Thomas, D. E. and Moorby, P. R. The Verilog Hardware Description Language (5th ed.).New York: Springer, 2002. A very good textbook which was used in the past as the textbook

Introduction xxiii

for this course. Includes a demo version of the Silos simulator on CD-ROM. Our AdditionalStudy recommendations include many optional readings and exercises from this book

Wallace, H. “Error Detection and Correction Using the BCH Code” (2001). http://www.aqdi.com/bch.pdf (2007-10-04).

Wang, D. T. “Error Correcting Memory - Part I”. http://www.realworldtech.com/page.cfm?ArticleID=RWT121603153445&p=1 (2004-12-15: there doesn’t seem tobe any Part II).

Weste, N. and Eshraghian, K. Principles of CMOS VLSI Design: A Systems Perspective. MenloPark, CA: Addison-Wesley, 1985. Old, but overlaps and picks up where Mead and Conwayleave off, especially on the CMOS technology per se.

Zarrineh, K., Upadhyaya, S. J., and Chickermane, V. “System-on-Chip Testability Using LSSDScan Structures”, IEEE Design & Test of Computers, May–June 2001 issue, pp. 83–97.

Ziegler, J. F. and Puchner, H. SER – History, Trends, and Challenges. San Jose,Cypress Semiconductor Corporation, 2004. (stock number 1-0704SERMAN). Contact:[email protected].

Zimmer, P. “Working with PLLs in PrimeTime – avoiding the ‘phase locked oops”’. Draftedfor a Synopsys User Group presentation in 2005. Downloadable at http://www.zimmerdesignservices.com (2007-04-12).

Recommended Interactive Verilog Reference Guide

Evita Verilog Tutorial. Available in MS Windows environment. This is a freedownload for individual use. Sponsored by Aldec at http://www.aldec.com/Downloads. The language level is verilog-1995 only.

This interactive tutorial includes animations, true-false quizzes, language refer-ence, search utility, and a few minor bugs and content errors. It is very easy to runand may be useful especially for understanding the content of the first few Chaptersof the present book from a different perspective.

Chapter 1Week 1 Class 1

1.1 Introductory Lab 1

Lab ProcedureThis is a stereotyped, automated lab. Just do everything by rote; the lab will beexplained in a postmortem lecture which will follow it immediately. Verilog sourcecode files have a .v extension. The files in the design have been listed in the textfile, Intro Top.vcs, for your convenience in using the VCS simulator.

An include file, Extras.inc, provided in the CD-ROM misc directory, willhave to be referenced properly in the TestBench.v file or copied to a loca-tion usable by the simulator of your choice. To do this, you may will have toedit TestBench.v, changing the path in ‘include "../../VCS/Extras.inc". Notice that there is no semicolon at the end of a ‘include. If you com-ment out the ‘include, the simulation will work, but no VCD file will be createdduring simulation.

Step 1. In your Lab01 directory, use a text editor to look at the top level of thedesign in Intro Top.v. The top level structure may be represented schematicallyas in Fig. 1.1.

Fig. 1.1 Schematic representation of Intro Top.v. The blocks are labelled with their modulenames

J. Williams, Digital VLSI Design with Verilog, 1c© Springer Science+Business Media B.V. 2008

2 1 Week 1 Class 1

In the verilog file Intro Top.v, documentation and code comments begin with“//”. A copy of the contents is reproduced below for your convenience.

Notice that almost everything is contained in a “module”. The top mod-ule in Intro Top.v includes port declarations (I/O’s), wire declarations, oneassignment statement (a continuous assignment wire-connection statement), andthree component instances. This module, combined with the component instances,represents the structure of the design.

// ===========================================================// Intro Top: Top level of a simple design using verilog// continuous assignment statements.//// This module contains the top structure of the design, which// is made up of three lower-level modules and one inverter gate.// The structure is represented by module instances.//// All ports are wire types, because this is the default; also,// there is no storage of state in combinational statements.//// ANSI module header.// ------------------------------------------------------------// 2004-11-25 jmw: v. 1.0 implemented.// ============================================================module Intro Top (output X, Y, Z, input A, B, C, D);//wire ab, bc, q, qn; // Wires for internal connectivity.//// Implied wires may be assumed in this combinational// design, when connecting declared ports to instance ports.// The #1 is a delay time, in ‘timescale units://

assign #1 Z =∼qn; // Inverter by continuous assignment statement.

//AndOr InputCombo01 (.X(ab), .Y(bc), .A(A), .B(B), .C(C));

SR SRLatch01 (.Q(q), .Qn(qn), .S(bc), _R(D));XorNor OutputCombo01 (.X(X), .Y(Y), .A(ab), .B(q), .C(qn));//

endmodule // Intro Top.

Step 2. Look at the verilog which will be used to run logic simulation on thisdesign; it is in TestBench.v. The contents of this file are reproduced below. Anequivalent schematic is given in Fig. 1.2.

The verilog simulator will apply stimuli to the inputs on the left, and it will readresults from the outputs on the right. The testbench module (in TestBench.v)will be omitted when synthesizing the design, because it is not part of thedesign.

1.1 Introductory Lab 1 3

Fig. 1.2 Schematic view of the Lab01 TestBench module

// ===========================================================// TestBench: Simulation driver module (stimulus block)// for the top-level block instance of Intro Top.//// This module includes an initial block which assigns various// values to top-level inputs for simulation. initial blocks// are ignored in logic synthesis.//// No module port declaration.// ------------------------------------------------------------// 2004-11-25 jmw: v. 1.0 implemented.// ============================================================//‘timescale 1 ns/100ps // No semicolon after ‘anything.//module TestBench; // Stimulus blocks have no port.//wire Xwatch, Ywatch, Zwatch; // To connect to design instance.reg Astim, Bstim, Cstim, Dstim; // To accept initialization.

//initialbegin//// Each ‘#’ precedes a delay time increment, here in 1 ns units://#1 Astim = 1’b0; // For Astim, 1 bit, representing a binary 0.#1 Bstim = 1’b0;#1 Cstim = 1’b0;(other stimuli omitted here)#50 Dstim = 1’b1;#50 Astim = 1’b0;#50 Cstim = 1’b0;#50 Dstim = 1’b0;#50 $finish; // Terminates simulation 50 ns after the last stimulus.end // No semicolon after end.

//// The instance of the design is named Topper01, and its// ports are associated by name with stimulus input and simulation// output wires://Intro Top Topper01 ( .X(Xwatch), .Y(Ywatch), .Z(Zwatch)

, .A(Astim), .B(Bstim), .C(Cstim), .D(Dstim));

//endmodule // TestBench.

4 1 Week 1 Class 1

Notice the use of two different keyboard characters which may appear verysimilar:

• In the timescale specifier, ‘timescale, the ‘‘’ is a backquote. This char-acter also is used in macroes and compiler directives: ‘define, ‘include,‘ifdef, ‘else, ‘endif, etc.

• In the width specifier for literal constants, 1’b1 etc., the ‘’’ is a single-quote.

Step 3. Look briefly at the verilog for the three other modules in the design; this isin files AndOr.v, SR.v, and XorNor.v:

Fig. 1.3 Schematic forAndOr.v

// ===========================================================// AndOr: Combinational logic using & and |.// This module represents simple combinational logic including// an AND and an OR expression.//// ANSI module header.// ------------------------------------------------------------// 2004-11-25 jmw: v. 1.0 implemented.// ============================================================module AndOr (output X, Y, input A, B, C);

//assign #10 X = A & B;assign #10 Y = B | C;//

endmodule // AndOr.

The only assignment statements are continuous assignments, recognizable by theassign verilog keyword. These are wire connection statements, mapping directlyto Fig. 1.3. The ‘=’ sign merely connects the left side to the right side. This connec-tion is permanent and can not be modified during simulation.

assign means: “connect to this wire”

The “#10” literals are simulator programmed delays.


Fig. 1.4 Schematic for SR.v

// ===========================================================

// SR: An S-R Latch using ∼ and &.

// This module represents the functionality of a simple latch,

// which is a sequential logic device, using combinational

// ∼AND expressions connected to feed back on each other.

//

// ANSI module header.

// ------------------------------------------------------------

// 2005-04-09 jmw: v. 1.1 modified comment on wire declarations.

// 2004-11-25 jmw: v. 1.0 implemented.

// ============================================================

module SR (output Q, Qn, input S, R);

wire q, qn; // For internal wiring.

//

assign #1 Q = q;

assign #1 Qn = qn;

//

assign #10 q = ∼(S & qn);

assign #10 qn = ∼(R & q );

//

endmodule // SR.

Four more continuous assignments in the SR.v verilog code again map directlyto the schematic wiring in Fig. 1.4.

Fig. 1.5 Schematic forXorNor.v

6 1 Week 1 Class 1

// ===========================================================// XorNor: Combinational logic using ˆ and ∼|.// This module represents simple combinational logic including// an XOR and a NOR expression.//// ANSI module header.// ------------------------------------------------------------// 2005-04-09 jmw: v. 1.1 modified comment on wire declarations.// 2004-11-25 jmw: v. 1.0 implemented.// ============================================================module XorNor (output X, Y, input A, B, C);wire x; // To illustrate use of internal wiring.//assign #1 X = x; // Verilog is case-sensitive; ‘X’ and ‘x’ are different.//assign #10 x = A ˆ B;assign #10 Y = ∼(x | C);//

endmodule // XorNor.

Three more continuous assignments map the Fig. 1.5 wiring to the verilog.

Step 4. Load TestBench.v into the simulator and simulate it, if necessary usingthe handout sheet (VCS Simulator Summary or QuestaSim Simulator Summary)provided in PDF format on the CD-ROM.

Don’t ponder the result; just be sure that the simulator runs without issuing anerror message, and that you can see the resulting waveforms displayed, as in Fig. 1.6through 1.8, depending on the simulator of your choice.

Fig. 1.6 Intro Top simulation waveforms in the Synopsys VCS simulator

Fig. 1.7 Intro Top simulation waveforms in the Mentor QuestSim simulator


Fig. 1.8 Intro Top simulation waveforms in the Aldec Active-HDL simulator

Step 5. The Design Compiler (DC) text interface, dc shell-t, is the standardinterface used by designers to synthesize, and especially optimize, digital logic.Typically, various dc shell options, and directives embedded in the code, aretweaked repeatedly before the resulting netlist is satisfactory, meeting all timing andarea requirements. These are text-editing exercises. There is a great advantage oftext scripting in these activities. Graphical manipulations of synthesizer commandscan be inefficient, undocumented, irreproducible, and error-prone when tweaking asynthesis netlist.

Our graphical interface to DC is design vision, which we shall use in thiscourse very occasionally, just to display schematics. We use the TcL (Tool ControlLanguage) scripting interface, indicated by “-t”, for dc shell-t; this also is thedefault scripting interface for design vision-t.

There isn’t much to synthesize in Intro Top, it’s already described almost on agate level. But, continuous assignment statements aren’t gates; so, their expressionswill be replaced visibly by gates during synthesis.

Load the design into the logic synthesizer and synthesize it to a gate-level netlist,using the instructions below.

The gates we shall be using for synthesis in this course are provided in a 90-nmlibrary from the TSMC fab house.

Lab 1 Synthesis Instructions

A. Invoke the synthesizer this way:

dc shell-t -f Intro Top.sct

There will be some text messages; then, you should see the dc shell-tprompt. A .scr (Synopsys script) file holds a list of commands to be run beforedc shell presents its interactive prompt. Instead of .scr, we use .sct in thiscourse to indicate Tcl syntax.

We’ll have plenty of time later to dwell on these messages. For now, youshould scroll back and find where you invoked dc shell-t, and then justquickly read forward, making sure there was no error message. The warningsare OK.

8 1 Week 1 Class 1

What the commands in Intro Top.sct did, was this:

• They specified the technology and graphical symbol libraries for the synthe-sizer to use;

• they associated the design logical library (in memory) with a directory;• they then analyzed and elaborated the design as disc files into that library; they

also established operating conditions (NCCOM = nominal-case, commercialtemperature range) and generic delay parameters (the wire-load model).

After this setup, the commands then set several specific constraints (max al-lowed area on chip, max allowed delays, estimated loads and drives) and thencame to a stop in interactive mode. This design is named Intro Top.

B. To the dc shell prompt, enter these commands, waiting for each to finish:

dc shell-xg-t> compiledc shell-xg-t> report areadc shell-xg-t> report timing

C. Briefly look over the results which were printed to the terminal:The negative slack message means that the timing requirements were not met;

we won’t worry about this for now. Areas are approximate transistor counts, forthis library.

D. Now save the result to disc:

dc shell-xg-t> write -hierarchy -format ddc

The write command saved the results to a set of Synopsys .ddc files. Thesebinary files are extremely efficient when saving a large design. But, we want ver-ilog, so, write out a verilog netlist file and exit:

dc shell-xg-t> write -hierarchy -format verilog -output Intro Netlist.vdc shell-xg-t> exit

E. Take a look at the synthesized netlist schematic.The netlist is verilog, so it may be examined in a text editor. However, be-

cause it is small and simple, we can convert it to a schematic and view it usingthe synthesis GUI interface. At the shell prompt, enter

design vision

When the GUI has appeared, use the File/Read command to read inIntro Netlist.v, which you just wrote out from DC.

Then, view the schematic by picking the “and gate” icon near the middle ofthe upper menu bar.


Notice that there isn’t much change from the original design schematic above,because by default DC preserves design structure and leaves hierarchy intact.

Double-click a block to see its substructure; pick the up-arrow to return.After viewing the schematic, use design vision File/Exit.

F. Next, repeat the run of DC shell to flatten the netlist hierarchy.Flattening typically is done to improve the timing optimization and the area.

Again,

dc shell-t -f Intro Top.sct

When the setup commands have run, again scroll up briefly to check themessages.

Assuming no error, next,

dc shell-xg-t> ungroup -all -flattendc shell-xg-t> compile -map effort highdc shell-xg-t> report timing

The timing has been improved, and it now fulfills all constraints in the .sctfile. Save the synthesized and ungrouped netlist, which no longer has any hier-archy, but don’t exit yet:

dc shell-xg-t> write -hierarchy -format verilog -output Intro TopFlat.v

We’ll do one more thing, while we have the synthesized netlist in memory:Write out a Standard Delay Format (SDF) file which contains the netlist timingcomputed from the library used by the synthesizer. We’ll look at this file later inthe day.

Be sure to give the new file a name with an SDF extension:

dc shell-xg-t> write sdf Intro TopFlat.sdfdc shell-xg-t> exit

Then, after exiting DC, view the flattened and optimized netlist schematic:Invoke the synthesis GUI again. Use the File/Read command indesign vision again to read in the new Intro TopFlat.v.

Examine the resulting synthetic schematic display, which should appear aboutthe same as in Fig. 1.9. Each gate in this netlist could be mapped to a mask pat-tern taken from a physical layout library, layed out, and fabricated in workingsilicon.

10 1 Week 1 Class 1

Q

Qn

qn

q

S

R

X

Y

Z

XX

YY

X

Y

A

B

C

A

ab

bc

A

BB

C

A

B

C

C

ZLo

D

DSR

SRLatch01

OutputCombo01

XorNor

AndOrInputCombo01

Schematic.1 Intro_Top

Fig. 1.9 The design vision representation of the hierarchical top level of the synthesizedIntro top netlist

After viewing the schematic, pick File/Exit to finish the lab.

A note on “flatten” terminology: Although “flatten” often means to removehierarchy, this word has a special meaning to Design Compiler. In DC, “flat-ten” refers to logical expressions, not to the design. To flatten the hierarchy ina design, the DC command is ungroup.

In DC, one may use a set flatten command to flatten combinationallayers of logic to a two-layer boolean sum-of-products; the set structurecommand factorizes the logic and in a sense is the inverse of set flatten.


1.1.1 Lab 1 Postmortem

Keywords. There were several verilog keywords used in the Lab 1 files:module, endmodule, assign, wire, reg. All were lower case.

Verilog keywords are lower case; but, the language is case-sensitive. Thus, onecan avoid any possible keyword name conflict by declaring ones own names al-ways with at least one upper-case character. So declaring, module Module . . .endmodule, is perfectly legal, although a bit vague in practice.

Comments. In verilog, the comment tokens are the same as in C++: “//” fora line comment; “/*” and “*/” for a block comment. Block comments are allowedto include multiple lines or to be located in the middle of a line.

Modules. The only user-declared object type in verilog is the module; every-thing else is predeclared and predetermined. Designers just decide how to nameand use the other data types available. This makes verilog a simple language, and itpermits a verilog compiler to be implemented bug-free.

A module is delimited by the keywords module and endmodule. Everythingbetween these keywords defines the functionality of the module. A module is thesmallest verilog object which can be compiled or simulated. It corresponds to adesign block on a schematic. In the synthesizer, a verilog module is referred to as a“design”.

Initial Blocks. The TestBench module contained an initial block. Aninitial block includes one or more procedural statements which are read andexecuted by the simulator, beginning just before simulation time 0. Statements inan initial block are delayed by various amounts and are used to define the sequenceof test-vector application in a testbench. Our Lab 1 testbench initial block in-cluded many statements; blocks of statements in verilog are grouped by the key-words, begin and end. Only statements end with a semicolon; blocks do notrequire a semicolon.

To terminate the simulation, add a $finish command at the end of the test-bench initial block. Otherwise, the simulation may hang (depending on simu-lator configuration) and may never stop.

Module Header. A module begins with a module header. The header startswith the declared module name and ends with the first semicolon in the module.Except testbenches, which usually have no I/O, module headers are used to declarethe I/O ports of the module. There are two different formats used for module head-ers in verilog: The older, traditional verilog-1995 format, and the newer, ANSI-C,verilog-2001 format. We shall use only the newer format, because it is less repetitivethan the older one and is less prone to user error.

In the module header, ANSI port declarations each begin with a direction key-word, output, input, or inout; the keyword may be followed with bus indiceswhen the port is more than 1 bit wide. Ports may be declared in any order, but usu-ally it is best to declare output ports first (we’ll see why later). Each directionkeyword declaration is followed by one or more names of the I/O ports (wires). Todeclare additional port names, one uses a comma (‘,’), another direction keyword,and another list of one or more names.

12 1 Week 1 Class 1

For example, here is a header with one 32-bit output, two 16-bit inputs, and two1-bit inputs:

module ALU (output[31:0] Z, input[15:0] A, B, input Clock, Ena);

Assignment Statements. Our Lab 1 exercise contained two different kinds ofverilog statement, continuous assignment statements to wires, and blocking as-signment statements to regs (in the TestBench initial block). In verilog, wiretypes only can be connected and can not be assigned procedurally. One declares adifferent kind of data object, a reg, to assign procedurally. While procedural state-ments are being executed, regs can be changed multiple times, in different ways.To get a procedural result into the design, one uses continuous assignment to assignthe reg to a wire or port.

For example,

module And2 (output Z, input A, B);reg Zreg;// A connection statement:assign #1 Z = Zreg; // Put the value of Zreg on the output port.// Procedural statements:initial

beginZreg = 1’b0;

#2 Zreg = A && B;#5 $finish;end

endmodule

This and gate model is rather useless, (the initial block terminates the sim-ulation at time 2 + 5 = 7); but, it shows how a reg and a wire are used together.Zreg is set to 0 before the simulation starts; the delay on the assignment to Z causesZ to go from unknown to 0 at time 1. At time 2, Zreg is updated from the moduleinputs, and Z may change then, for the last time, at time 3.

The blocking assignments in the TestBench and in the model above are in-dicated by ‘=’. There is another kind of procedural assignment, the nonblockingassignment, which is indicated by “<=”. We shall study nonblocking assignmentslater.

Simple Boolean operators. Verilog has about the same operator set as C orC++. In particular, the logical operators, ‘!’, “&&” and “||”, return a ‘1’ or a ‘0’,representing true or false, respectively. There also are bitwise operators, ‘&’, ‘|’, ‘ˆ’(exclusive-or), and ‘∼’ which are the same as logical operators for 1-bit operands.

Four Logic Levels. In addition to the logically defined levels ‘1’ and ‘0’, ver-ilog has two special levels to accommodate simulation states: ‘x’ means either ‘1’ or

1.2 Verilog Vectors 13

‘0’, but that the simulator can not determine which; ‘z’ means “turned off” (three-state off) and generally means the same as ‘x’ except when multiple drivers on asingle wire are in contention.

It usually is good practice to specify the width of every verilog literal expression,even when the width is just 1. So, we usually try to write “1‘b0” instead of just ‘0’.The width in bits precedes the quote (’), the number base (‘b’, ‘h’, or ‘d’) followsthe quote, and the literal value is last.

Thus, 4’b000z means a 4-bit value with least-significant bit turned off.12’b101 and 12’h5 have the same value, 5, and both are 12 bits wide.

Delay. A delay value is preceded by ‘#’ and always is in decimal base. Themeaning of a delay value is determined by a timescale macro, usually at the begin-ning of the first file compiled for simulation. ‘timescale 1ns/100ps meansthat #1 equals a 1 ns delay, and that the simulator should calculate all delays to amaximum precision of 100 ps.

Component Instantiation. Module instances are the structural basis of all bigdesigns. A module must be declared before it can be instantiated. A module instancealso may be referred to as a component instance, especially when the instance isfrom a library of relatively small gates. Instantiations are statements and, like otherstatements in verilog, end with a semicolon.

The basic syntax of instantiation is: module name instance name port map;for example, from Lab 1, the device under test was instantiated in TestBenchthis way:

Intro Top Topper01 ( .X(Xwatch), .Y(Ywatch), .Z(Zwatch), .A(Astim), .B(Bstim), .C(Cstim), .D(Dstim));

The declared name of the instantiated module was Intro Top; this instancewas named Topper01, and the port map followed in the outer parentheses.

Port Map. The preceding instantiation shows how to map the port names ofthe module, Intro Top, to the wires in TestBench: Each port name is precededby a period (.), and the mapped wire, connected to that port, is immediately after itin parentheses.

1.2 Verilog Vectors

Verilog vector notation is used for ordered sets, or busses, of logical bits.Verilog has similar vector and array constructs. We’ll look at arrays later.A declared vector can be assigned to another one without explicit indexing; or,

alternatively, a bit or part select may be made on either side of the assignment.A select means that part of the vector is being named explicitly. All these arelegal:

14 1 Week 1 Class 1

reg[7:0] HighByte, LowByte1, LowByte2;reg Bit;...HighByte = LowByte1; // Assigns one vector to another....Bit = LowByte2[7]; // This is called a "bit select"....LowByte2[3:0] = LowByte2[7:4]; // Two examples of "part select"....

All the assignments here are procedural, because a reg must be assigned proce-durally. However, the module declaration and the blocks containing the proceduralstatements have been omitted for simplicity.

Notice the difference between verilog and C language declarations: The indexrange appears after the type name (reg or wire), not after the name being declared!

“Register” means “regular line-up”. Registered data is the defining characteristicof Register-Transfer Logic (RTL), the level of abstraction most frequently used indigital simulation and synthesis.

Don’t read verilog “reg” as register! Perhaps transistor would be better, but itstill would be wrong. The “reg” declaration represents localization or storage ofinformation, even if of just one bit. In verilog, a reg localizes a logic state; wiresnever are considered regs, even when regularly lined up in busses. Verilog wiresjust connect logic from one place to another; with multiple endpoints, they localizenothing. Both nets (wires) and reg’s are called variables in verilog, because theirvalues can be changed by events during simulation.

The value of a wire can be changed only if the logic driving it changes value.The value of a reg can be changed by assigning it a different value in an initialor always block. An assignment to a reg is called a procedural assignment, be-cause it can be done by a procedure of individual statements inside an initial oran always block.

Continuous assignments are done by continuous assignment statements(“assign” keyword) and represent permanent connections of something to a net.For example,

module ModuleName(...);wire x;...assign x = a & b | c; // LHS must be wire; RHS wire or reg.endmodule

All the assignments in the first lab’s Intro Top design were continuous as-signments to wires; there were no reg’s in that design (excluding the testbench);therefore, there were no procedural assignments.

1.2 Verilog Vectors 15

Procedural assignments are done in procedural blocks (“always” or“initial” keyword) and represent logical changes in a certain order. Forexample,

module ModuleName(..., input a, b, c);reg x;wire y;//assign y = a & b;assign y = y | c; // y wired to two drivers probably is an error!...always @(a,b,c) // LHS must be reg; RHS may be wire or reg.

beginx = a & b;x = x | c; // x gets c | (a & b). No problem.end

endmodule

In any kind of assignment statement, binary, octal, decimal, and hexadecimalexpressions are allowed for constant values (literals):

1’b1, 1’b0, 1’bx, 1’bz. 8’b00zz xx11. 64’h33b4 1223 1112 af01.

8’b1010 0010 is the same as 8’ha2 or 8’hA2 or 8’o242 or 8’d162.

During logic simulation, limitations on displayed expressions tend to fail safe,in the sense that ambiguity is propagated throughout the values represented bya hex, decimal, or octal numeral. If any bit in a hex digit is ‘z’ or ‘x’, allfour bits are treated as ‘z’ or ‘x’, respectively, when hex notation is used torepresent the value. The same holds for decimal or octal numeral representa-tions. Thus, assigning a variable 8’b000z xx11 means that it will be displayedin hex as 8’hzx. The ‘x’ is stronger than the ‘z’, so 4’b0z0x is displayedas 4’hx.

Timing expressions such as #10 are ignored by the logic synthesizer; unknownsare treated as known by the synthesizer, but in special ways. Neither ‘x’ nor ‘z’ ismeaningful as an actual value for synthesis: ‘x’ means the simulator can’t assign a‘1’ or ‘0’; ‘z’ refers to three-state gate functionality, not to a logic state. The synthe-sizer has to apply its own interpretation to assignments of ‘x’ or ‘z’, so as to makethe synthesized gates simulate the same way (except timing) as the presynthesisdesign.

The significance of the bits in a vector is unalterable. If 8’ha2 is assigned toan 8-bit variable declared reg[7:0], bit [7] is the MSB, which gets a ‘1’; if itis assigned to a reg[0:7], bit [0] is the MSB, and it still gets a ‘1’. The MSBalways is on the left of a verilog vector.

16 1 Week 1 Class 1

// Some example vector operations:‘timescale 1ns/100psmodule Vector;

reg [0:15] MyBus; // A vector of 16 bits of storage.// The verilog MSB always is on the left.

//wire[15:0] mybus; // Another vector, a 16-bit bus....MyBus[0:2] = mybus[10:8]; // This is called a "part select".MyBus = mybus;MyBus[0:15] = mybus[15:0];...

endmodule

‘timescale detail: It allows use of fractional delay expressions such as #0.1.For example,

‘timescale 1ns/10ps...#0.05 X = Y & Z; // The delay is 50 ps.

1.3 Operator Lab 2

Log in and change to the Lab02 directory; do all your work here for this lab.

Lab Procedure

Step 1. Vector exercise:

A. Create a file named Vector.v, and copy the “Vector” module declaration intoit from the preceding lecture example (above). Fill in the missing parts (“. . .”)of this module any way you want: But, use continuous assignment to drivesome value onto mybus, and use blocking assignments in an initial blockto schedule the assignments shown in the lecture example. Supply your owntime delays. Check your work by simulating in VCS.

B. Now try to reverse bus directions by adding MyBus[0:2] = mybus[8:10]or MyBus[15:0] = mybus[15:0]. Compile in the simulator. What hap-pens?

C. If you tried mybus = MyBus in your initial block, what kind of problemmight occur?

1.4 First-Day Wrapup 17

D. Declare a 48-bit reg named My48bits and assign it in an initial block asfollows: #5 My48bits = ’bz; #5 My48bits = ’bx;#5 My48bits =’b0; #5 My48bits = ’b1; Notice that no width specification is givenin these literal values. Simulate after adding #5 $finish. What happens?Suppose you used 1’bz, 1’bx, etc.?

Step 2. Boolean operator exercise:

A. Declare a reg[4:0] named X and two others named A and B. Try simulatingassigning A and B to X as follows: Initialize A = 5’b01010;B = 5’b11100; then, run these in order: X = A & B; X = A | B;X = A ˆ B; X = ∼A | ∼B.

B. After that, try X[0] = &A; X[1] = |A; X[2] = &B;X[3] = ˆA & ∼ˆB; X[4] = A[4] ˆ B[4]; X = ˆX.

C. Parentheses may be used for grouping: Try X = ((∼A &ˆB)|(A & B))ˆA;

These operators also work with unequal-sized vectors; we’ll look at this later inthe course.

1.3.1 Lab Postmortem

Things to have noticed:

Vector extension for logic 0, x, or z, but not 1Binary vs. reduction operatorsWhat happens with #0.1 and ‘timescale 1ns/1ns?

1.4 First-Day Wrapup

1.4.1 VCD File Dump

The logic simulator can create a VCD file. The VCD (Value-Change Dump) fileformat is specified in IEEE Std 1364, section 18. Briefly, this file provides an ASCIIfile record of a simulation in an extremely compact format. Because it is plain,ASCII text, such a file is accessible to homemade scripts or other utilities for variouspurposes, for example, estimation of test coverage or detection of timing violationsof rules not in force when the simulation was run.

The VCD file consists of a header with information such as a design modulename, and a table of abbreviations of variable names. Ports can be dumped option-ally as though they were variables in the module. Variable names are represented byone-character ASCII symbols such as ‘!’, ‘#’, or ‘(’. The body of the file is a listof simulation times written as #time (absolute; not a delay), followed immediately

18 1 Week 1 Class 1

by a list of variable values which were new at that time. For this reason, simulatoroutput files of this kind often are called “time-value” files. Example:

...$scope module TestBench $end...$var reg 1 # Cstim $end$var reg 1 $ Dstim $end$var wire 1 % Xwatch $end...#300##400$...

The values dumped can be selected as level-only (4-state) or level plus strength.Because the VCD file format is an IEEE standard, it is portable across all toolsreading or writing this format. One important use for it in a design flow is as asample of simulation activity for an estimate of design dynamic power dissipation;this is beyond the present course.

1.4.2 The Importance of Synthesis

Although simulation might be considered enough for a front-end designer, this isquite incorrect. True, the simulator may be used to validate functionally the originaldesign (source verilog) as well as any synthesized netlist; but, the timing used bythe simulator is entirely artificial – it is entered manually by the designer, basedon estimates and expectations. Only after a netlist has been synthesized, can thepropagation delays of the gates, the setup, hold, and other clocking constraints, andthe (back-annotated) trace delays be estimated with any accuracy.

More importantly, a netlist can be fabricated; it is a product of value in the mar-ketplace. Simulator waveforms can’t be marketted or sold, so they are strictly forthe benefit of the designer. Thus, a good netlist is a designer’s contribution to thesuccess of the company, and the logic synthesizer is the tool which almost always isthe means of creating that netlist.

1.4.3 SDF File Dump

An SDF file can be used to back-annotate delays into a netlist. The logic synthesizercan create this kind of file, but usually it is most useful when created by a tool with

1.4 First-Day Wrapup 19

access to a chip floorplan or layout. The Standard Delay File format resembles thatof lisp or EDIF, with punctuation supplied almost solely by parentheses.

For example, the SDF delays for a noninverting buffer in a netlist might be:

...(CELL

(CELLTYPE "BUFFD0")(INSTANCE U3)(DELAY

(ABSOLUTE(IOPATH I Z (0.064:0.064:0.064) (0.074:0.074:0.074)))

))...(INTERCONNECT U3/Z U4/A (0.002:0.002:0.002) (0.003:0.003:0.003)...

An SDF file contains a complete timing representation of a design netlist, withdelay information derived from a library and a static timing analysis or other delayextraction process. We shall look at this file again toward the end of the course, whenwe discuss back-annotation of layout timing into a simulation or layout netlist.

1.4.4 Additional Study

Read Thomas and Moorby (2002) chapter 1 on verilog basics.Read Thomas and Moorby (2002) chapter 3 on synthesizable verilog through

section 3.4. Ignore 3.4.4 (casez and casex).Do Thomas and Moorby (2002) 1.6 Ex. 1.2 (different adders).Find the .VCD file in the Lab01 directory and open it in a text editor. When

you ran your Lab01 simulation, this .VCD file was created because of a directive,‘include ‘‘../../VCS/Extras.inc’’, which was present in the Lab01testbench file (TestBench.v). Examine the VCD file, just to see what it lookslike. We won’t be making use of VCD in this course, but it is good to know about.Detailed discussion is in Bhasker (2005), section 10.10.

Optional Readings in Palnitkar (2003)

Read through the exercises in chapters 1–4, and try a few if you want.Study the S-R latch model in section 4.1; there is a runnable example on the

Palnitkar CD as Chap04/Chap EG/SR Latch.v. We shall do further work withthe S-R latch in a later lab.

20 1 Week 1 Class 1

Study section 2.6, a ripple-carry counter example. The code is available on thePalnitkar CD as Chap02/Chap EG/ripple.v. We shall study various counterdesigns later in the course.

Read section 9.5.6 for some general information on VCD files and section 10.4on SDF files. We’ll do a lab on SDF format in one of our last chapters.


2.1 More Language Constructs

Today we’ll present enough of the verilog language to be able to design somethingmeaningful. Also, we’ll be cementing our understanding of certain basic constructswhich we shall use again and again in the rest of the course. After an initial lab onconversion basics, we’ll look into the essential concept of a shift register and thendesign one in lab.

Traditional module header format. We never shall use the traditional formatfor our lab designs in this course, and it is deprecated for all new design entry;however, many tools (and older textbooks) still write headers in the older format, sothe 1995 format should be understood.

The traditional format follows that of K&R C, whereas the newer format fol-lows that of ANSI C. In both verilog formats, a port name automatically is as-sociated with an implied net of the same name, making all ports wire types bydefault.

The main difference is that in the 2001 format, header declarations are entirelycontained in the header, which occupies the first statement in the module. The firstsemicolon in a 2001 module always terminates the header.

In the 1995 format, only the names of the ports are in the first statement (firstparentheses); declarations of width or directionality follow and may be located any-where in the body of the module.

This implies that, in the traditional format, it is necessary to enter the name ofevery port at least twice: Once in the first parentheses and once in a subsequentdirectionality and width declaration. In a small module, this is only a small burden;however, in a module with dozens or hundreds of I/O’s, it greatly increases the riskof error.

In the traditional format, because the name of every output port had to betyped twice anyway, it was allowed, and actually common practice, to type it yet athird time, declaring the port to be a reg type if the design intent was to assign itprocedurally. This saved the (trivial) effort of declaring a separate internal reg andusing a continuous assignment statement to wire the internal reg to the outputport, as is almost always done in modern design.


22 2 Week 1 Class 2

However, this apparently saved effort caused three problems: First, we nowhad three declarations of the same name, a condition which was confusing todesigners who wanted their declarations each to reserve a different name forevery different object. Second, it increased the likelihood of a maintenance error:The three different naming statements always had to be coordinated whenever anyI/O was changed in the design. Third, it required different output ports to betreated fundamentally differently – with no evident difference in the initial paren-theses naming those ports: Ports later named a reg had to be assigned procedurally,only in always blocks, and ports not later named a reg had to be assigned only inwiring statements such as continuous assignments.

Finally, in the traditional format, the “header” never had a well-defined locationin the module; in principle, the header went on and on interminably, until finally,somewhere inside the module, the last I/O named in the initial parentheses wasgiven a width and direction.

All these problems were overcome in the 2001 format, in which all I/O declara-tions were confined only to the header, and only were allowed to occur once.

Two design management problems with the traditional format also arose:First, because module contents could modify the header’s meaning, especially

the I/O widths, at the start of a project, 1995 headers could not be sketched usefullyand distributed to the design team. In 2001 format, it is easy to distribute com-plete headers and require that no designer modify a header without managementapproval.

Second, a parameter (= a named configuration constant; see below) declaredand defaulted after the header in the 1995 format was fundamentally ambiguous:It could not be made clear which parameters should be overridden in a mod-ule instantiation, and which were meant to be strictly internal to the module. In the2001 format, although parameters not declared in the header still can be overrid-den in an instantiation, design-team procedures easily can be stated so that instanceoverrides are restricted to parameters in the header.

Here are two alternative module declarations illustrating the header differences.The module functionality has been omitted. In 1995 header format:

module MyCounter(CountOut, CountReady, StartCount, Load, Clock, Reset);

output[15:0] CountOut;output CountReady;input[15:0] StartCount;input Load, Clock, Reset;reg[15:0] CountOut; // Could be anywhere in the module....endmodule

The 1995-format header never really ends.

2.1 More Language Constructs 23

In 2001 header format:

module MyCounter(output[15:0] CountOut, output CountReady, input[15:0] StartCount, input Load, Clock, Reset);

// Header has ended; module itself now begins:reg[15:0] CountOutR;assign CountOut = CountOutR;...endmodule

One additional aspect of the difference is that the 2001 continuous assignmentto the CountOut net allows for specification of different delay values for rise andfall. This is a minor advantage but a real one. Procedural delays can have only onedelay value, implying that 1995-style direct declaration of CountOut as a regwould mean that it could be assigned only one value for rise and fall. The differentways of assigning verilog delays will be presented later.

In 2001 format, it still is possible to declare an output port to be a reg,

module My2001Module (output reg[15:0] CountOut, ...);

however, this should be avoided in a modern, manually-entered design. Althoughreg output port declarations in the 2001 format create only a minor inconvenience,we shall not allow them in our coursework.

Side note: An input or inout port never can be a reg, because a reg on aninput only could be changed procedurally, and there are no header procedures.Therefore, a reg input always would be uninitialized and would contend with any-thing wired to that input when the module was instantiated, creating a perpetual(and illegal) ‘x’ on that input.

Verilog comments. There are two different formats, the same as for C ++:

• Line comment: “/ /” Starts anywhere on a line.• Block comment: “/ *” begins a comment which doesn’t end until “* /”.

Comment examples:

assign X = a && b; // Put the AND of a and b on X.//assign Y = a /*left operand*/ && b /*right operand*/;//assign Z = (a>5)? a&&b : a||b;/* The preceding assignment puts the AND of a and b

on Z if a is greater than 5; otherwise, it putsthe OR of a and b on Z. */

24 2 Week 1 Class 2

Comments also may be effected by nonexistent-macro regions: Refer to anundefined macro variable name in “‘ifdef undefined name”; this will causeeverything until “‘endif” to be ignored by a verilog-conforming compiler (sim-ulation or synthesis). Thus, an undefined macro name may be used effectively tocommented out code. Later, the designer may ‘define the name and thereby un-comment the code and put it back into the verilog. This is discussed more fullybelow.

Comments can be used by tools to impose constraints not part of the veriloglanguage: Synthesis directives or other attributes referring locally to verilog codeconstructs can be embedded in line comments if the comment containing them in-cludes certain tool-specific keywords. A tool looking for keywords will read at leastpart of every verilog comment in the code. A typical such comment would be in theform of “//rtl synthesis ...”, with a command or constraint following thekeyword “rtl synthesis” on the same line.

Always blocks. A change of value in the sensitivity list (event control expression)of an always block during simulation can cause the statements in the block tobe reread and reexecuted. A logic synthesizer usually will synthesize logic for ev-ery always block. Omission of a variable from the always block sensitivity listmeans that the state of the logic does not change when that variable does, possiblyimplying latched data of some sort. We shall discuss this idea in detail later in thecourse.

Thomas and Moorby (2002) includes examples of always blocks without sen-sitivity lists. There is nothing wrong with this usage; but, in this course, we avoidit for reasons of style. Omission of an event control (sensitivity list) is not commonin design practice, because location of an event control at the top of a block makesthe functionality of the block more easily understood. In this course, we recom-mend not to use “always” except when immediately followed by an event control(“always @(variable list)”).

...always // Avoid this style.#10 ClockIn <= !ClockIn;

...always@(ClockIn) // Recommended style.#10 ClockIn <= !ClockIn;

Initial blocks. An initial block has no sensitivity list and is read only once,beginning before simulation time 0, when no event control can be expressed. Allinitial blocks are ignored by synthesis tools. Because initial blocks can notbe activated and disabled concurrently but only can schedule events after variousdelays from time 0, it usually makes good sense only to use one initial block ina simulation; more than one would just make it hard to determine the order in timeof the events to be scheduled.

Exceptions to limiting the design to one initial block may be useful whena second initial block is reserved to implement a simulation clock (using


forever) or is reserved for actions unrelated to design functionality, such as todefine an SDF file to load, or to set up a $monitor task.

Continuous assignments. We should mention these again here, for completeness.The three most common concurrent blocks in verilog are always blocks, initialblocks, and continuous assignments – in addition to structural (hierarchical) designinstances. These three blocks, and hierarchical instances, are the designer’s mainwork when entering a design in verilog. A continuous assignment is sensitive to achange of anything on its right-hand side. A few examples follow.

A wire assignment (continuous assign) is sensitive to the expression on the RHS:

assign #2 X = A[0] && B[0] || !Clock;

A procedural assignment is sensitive to changes in its event control list:Xreg is not updated on changes in Clock:

always@ (A[0], B[0]) Xreg = A[0] && B[0] || !Clock;

Yreg is updated only when Clock goes to 1’b1:

always@ (posedge Clock) Yreg = a && b;

Vectors and vector values. All vectors are read as numerical values. Most ver-ilog vector types (reg or net) are read as unsigned by default. Exceptions areintegers and reals, which are read as signed. reals generally are not syn-thesizable.

Assuming no overflow, the bit-pattern representing a specific number is identical,whether the storage is signed or unsigned; the difference is when the value is beingused in a comparison of some sort. When the value is used, the MSB of a signednumber is used to determine whether the number is positive or negative (‘1’ meansnegative, as usual in 2’s complement arithmetic). The MSB of an unsigned numbersimply is the most significant bit representing its value; an unsigned number can notbe negative, no matter what its bit pattern.

Example of the meaning of the sign bit:

reg[31:0] A;integer I;...A = -1; // Default is to assume decimal integers.I = -1; // Now, both A and I hold 32’hffff ffff.//if ( I > 32’h0 )

$display("I is positive.");else $display("I is not positive."); //Prints "I is not positive."if ( A > 32’h0 )

$display("A is positive.");else $display("A is not positive."); //Prints "A is positive."

26 2 Week 1 Class 2

Verilog logical operators !, &&, and || treat operands as true when any bitin a vector operand is nonzero and false for all-zero, only.

Verilog binary bitwise operators &, |, and ∧ operate bit-by-bit. The unary ∼operator inverts each bit in a vector. Every binary bitwise operator can be used as areduction operator: &A expresses the and of all bits in vector A.

Unnamed constant values are called literals. A literal expression consists of awidth specification, a delimiting tick symbol (’), a numerical base specifier, andfinally a number expressed in that base: width’base(value). For example,5’h1d defines a 5-bit wide vector literal with hex value 1d (= 16*1 + 1*d =16 + 13 = decimal 29). The expression, 16’h1d, also evaluates to decimal 29,but it is much wider.

If width and base are omitted, a literal number is assumed to be in decimal for-mat; if a whole number, it is assumed to be a 32-bit integer (signed).

Vector type conversions are performed consistently and simply, avoiding theneed for explicit conversion operators. Verilog actually has no user-defined type, sothe rules can be quite simple:

1. The significance of vector bits is predefined; the MSB is always on the left.2. All values in an expression are aligned to the rightmost bit (LSB).3. All expressions on the right-hand side of an assignment statement are adjusted to

the width of the destination before operators are applied.4. Width adjustment is by simple truncation of excess bits, or by filling with 0 bits.

When an operand is signed, the widening fill is by the value of the original MSB(sign bit).

Examples of type conversions for vector operations:

reg X;reg[3:0] A, B, Z;...X = A && B; // X gets 1’b0 only if either A or B is 4’b0.Z = A && B; // Z gets 4’b0 only if either A or B is 4’b0.

// Otherwise, Z gets 4’b0001Z = A & B; // Each bit of Z get the and of the

// corresponding bits of A and B.X = !Z; // X gets 1’b1 only if all bits of Z are 0.X = ∼Z; // Z narrows to its LSB and is inverted; X gets !Z[0].

Truncation and widening examples (unsigned only, for now). Notice how a vari-able can be initialized when it is declared; this is for simulation, only:

reg A = 1’b1, B = 1’b0;reg[3:0] C = 4’b1001, D = 4’b1100;reg[15:0] E = 16’hfffe;...C = A | B; // C gets A=4’b0001 | B=4’b0000 --> 4’b0001.C = E & D; // C gets E=4’b1110 & D=4’b1100 --> 4’b1100.E = A + C; // E gets A=16’h1 + C=16’h9 --> 16’ha.


Verilog includes parameter declarations. Parameters are named constants whichmust be assigned a value when declared. If declared in a module header, the valueassigned is a default; this value may be overridden when the module is instantiated.

Inside a module,

module ModuleName ( ... I/O’s ... ); // <-- semicolon ends header....parameter Awid = 32, Bwid = 16;...reg[Awid-1:0] Abus, Zbus;wire[Bwid-1:0] Bwire;...endmodule

In a module header,

module ModuleName#(parameter Awid = 32, Bwid = 16) // <-- no comma!( ... I/O’s ... ); // <-- semicolon ends header.

...reg[Awid-1:0] Abus, Zbus;wire[Bwid-1:0] Bwire;...endmodule

Parameters are unsigned by default, and are as wide as required to hold the valueassigned.

Conditional commenting with macroes. To comment out a block of statementsor other design data, one way is to use the verilog block comment tokens, /∗ and ∗/.

This can be useful when unsynthesizable code would cause the synthesizer toissue an error. But, often, a better way is to know that when (our) synthesizer readsa verilog file, it defines a global macro DC before it starts. This macro is empty,but its name is always declared. This means that you can comment out code forthe synthesizer but not for the simulator by surrounding the offending lines with‘ifndef DC (“if DC not defined”) and ‘endif.

For example, the synthesizer rejects verilog system tasks. So, use the DC macroto make verilog system-task assertions invisible to the synthesizer but visible to asimulation compiler (which actually can use them):

always@(posedge Clk)beginXreg = NewValue;‘ifndef DC$display("time=%05d: Xreg=[%h]", $time, Xreg);‘endifYreg = ...end

28 2 Week 1 Class 2

The Silos demo simulator included with the books named in the References doesnot accept ‘ifndef, but it does accept ‘ifdef and ‘else; so, a more portableway of writing the example above would be as follows:

always@(posedge Clk)beginXreg = NewValue;‘ifdef DC‘else$display("time=%05d: Xreg=[%h]", $time, Xreg);‘endif...

A good coding rule is to use ‘define and other macro definitions sparingly:They are global, and more than one of the same name may exist inadvertently ina large design. Then, a new value during compilation may replace an older one,possibly causing design errors.

To avoid errors because of compilation order, a defined macro should be unde-fined as soon as it has served its purpose, preferably in the same file as the one inwhich it was defined. Exception: ‘timescale.

Example:

‘define BLOCKING‘define Awid 32module ModuleName ( ... I/O’s ... );

...reg[‘Awid-1:0] Abus, Zbus; // The ‘ is required for value!...always@( ... )begin‘ifdef BLOCKINGQreg = Areg | Breg;‘elseQreg <= Areg | Breg;‘endifend

endmodule‘undef BLOCKING // Use ‘undef unless you want these to‘undef Awid // be seen in subsequently compiled files.

2.2 Parameter and Conversion Lab 3 29

2.2 Parameter and Conversion Lab 3

Work in the Lab03 directory for these lab exercises.

Lab Procedure

Step 1. Synthesis of a design with parameterized widths. The design in the Lab03directory consists of a top-level verilog module in ParamCounter Top.v andtwo submodules in Counter.v and Converter.v. A schematic is given inFig. 2.1, with only those pins shown which entail a name change in the port map-ping. For example, the port OutEnable is connected to an Enable pin:

Fig. 2.1 Schematic for Lab03, Step 1

The design is a resettable up-counter which puts its result in the lower-orderbits of an output bus of configurable width. The design is intended to show howparameters are used to configure a design. In this exercise, only bus width is param-eterized.

Don’t bother simulating, unless you want to do it after finishing this lab.Use the DC macro to comment out the entire TestBench module beforesynthesizing.

When you are ready, change PADWIDTH to 8, and synthesize the design; opti-mize it for area and then for speed. To synthesize for area, just comment out allspeed-related constraints (but leave the design rules); to synthesize for speed, leavethese constraints in, but keep the area goal at 0. Area and speed are somewhat cor-related in practice.

After experimenting with PADWIDTH set to 8, increase the value of the pPadparameter and resynthesize to see what happens.

Step 2. Creation of a design to simulate arithmetic. Next, try some signed andunsigned arithmetic, using the simulator to view the results:

Create a verilog module in a new file Integers.v and check the results of thefollowing (you may change the variable names if you wish to do them all in oneinitial block in one module, simultaneously):

30 2 Week 1 Class 2

A. integer A = 16, B = -8, X; X = A + B; X = A - B;X = B - A;

B. Same as above with all declared as reg[31:0] (use 32’d to assign to thisreg).

C. (optional) Same as above with the following successive changes: Just X declaredas integer; just A declared integer, and just B declared integer.

Step 3. To see the effect of truncation and widening of various types, create a newverilog module in a file Truncate.v and try the following in the simulator: De-clare data objects as follows: integer Int; reg[7:0] Byte; reg[31:0]Word; reg[63:0] Long; and reg[127:0] Dlong. Initialize each one with’b0. Then,

A. Assign 6’d1 and then later -6’d1 to each of the above objects and notice whathappens, both with hex and decimal radix display:

B. Assign 36’d1 and then later -36’d1 to each as in A.C. Assign 1 and then -1 to Int, and use Int, not a literal, to assign each of the

others as in A and B above.D. Assign 32’h7eee 777f to Word and then use Word to assign each of the

others as above. This value has a 0 MSB and so should represent a positivenumber.

E. Finally, assign 32’hf777 eee7 to Word and assign to each of the others. The1 MSB should represent a negative number.


Side note: Vector negative indices are allowed, although not used in this course. Thewidth always is given by the difference, plus 1.

For example, a declaration of,

reg[3:-7] CoeffA;

may be interpreted by a tool to represent an 11-bit decimal fraction with LSB equalto 2−7 = 1/27 = 1/128. Used in DSP modelling.

2.3 Procedural Control

2.3.1 Procedural Control in Verilog

• We shall study three procedural control constructs at this point: if, case, andfor. These only are allowed in a procedural block – initial or always.

2.3 Procedural Control 31

Use if for explicit priority or for ranges of values. The if often is easier to usethan the other control constructs when describing a short list of mutually exclusiveevents such as a clock and an asynchronous control.

if (expr) statement1;else if (expr) statement2;else statement3;

Use case when alternatives are specific values, are numerous, or are conceptuallyunprioritized, for example to implement a table lookup or small memory addressingscheme. The verilog case breaks automatically and does not “fall through” on amatch the way the C language case (in a switch statement) does. Good practiceis never to omit the default of a case statement; we’ll return to this issue, andthe case statement, later in the course.

case (expr)alt1: statement1;alt2: statement2;... ...

default: default stmt;endcase

The case alternatives usually are constant expressions but may be variables; thecase expression usually is a variable.

for is the preferred looping construct in verilog; it works about the same way asthe C for. However, the C language unary increment and decrement expressions,i++, ++i, i--, and --i, are not allowed. In verilog, one must use i = i + 1or i = i - 1.

for (loop var init; loop reentry expression; loop var update) statement (s);

2.3.2 Combinational and Sequential Logic

• The conditional expression operator, control expr ? True expr : False expr. Thisis used like a C function call returned value – in an expression. It is an expres-sion, not a statement. The expression always is interpreted by the current synthe-sizer as combinational logic. If the control expression evaluates to true (non-0),the True expr expression is its value; otherwise, it evaluates to the False expr.Example:

wire[31:0] X;integer A, B;...// Put the greater of A or B into X; A if they are equal:assign #2 X = (A>=B)? A : B;

32 2 Week 1 Class 2

This is very useful for a mux connection to a wire, because if is not allowedin a continuous assignment (if may be used only in a procedural block).

• Simple always@ block syntax: Use ‘,’ in the event control (sensitivity list);the traditional but inconsistent “or” is deprecated. If any change causes theblock to respond, the logic is combinational. When a posedge or negedgeexpression causes insensitivity to the opposite edge, the result is sequentiallogic:always@(posedge Clk, posedge Reset) means the same as the tra-ditional verilog always@(posedge Clk or posedge Reset).It also is possible to embed event controls inside an always block:

always@(posedge clk)beginxbus[1] <= 1’b1;@(posedge Enable) // Execution stops here until Enable goes to ’1’.

beginDbus <= 8’haa;xbus[7:4] <= ’b0;end

xbus[2] <= 1’b0;...end

• For complicated combinational constructs , consider using blocking assignmentsin an always block, with the result put on the right side of a continuous assign-ment statement. Example:

always@(Ain, Bin, Cin, Din, temp1, temp2)begintemp1 = AinˆBin;temp2 = CinˆDin;ComboOut = (temp1 & temp2) | AinˆDin;end

assign #5 OutBit = ComboOut | OtherComboOut; // Collect the delays here.

• nonblocking assignments. These are statements within a procedural block. Theydiffer from blocking assignments in two ways: (a) during simulation, they do notblock reading of the next statement; and, (b) at any given simulation time, theyare evaluated along with blocking statements, but they are executed only after allblocking assignments and net updates are complete.

• Use nonblocking assignments in always@ block for clocked sequential logic.Nonblocking evaluation occurs when blocking evaluations do, but assignmentis scheduled after all blocking assignments on a given clock event, thus ensur-ing that combinational input values will not have been altered by nonblockingupdates before being clocked in.


Avoid: Blocking assignments for clocked sequential logic, nonblocking as-signments for combinational logic.

Avoid: Mixed blocking and nonblocking assignments in one procedural block.Avoid: Delay of #0 to fix up race conditions caused by failure to avoid!

• Implementation of clocks and storage registers. Clocks imply sequential ele-ments (storage), because the clocked values remain constant in value betweenedges. Let’s ignore level-sensitive latches for now; if we do so, sequential ele-ments are synthesized by use of the edge specifiers, posedge or negedge, inan event control. Sensitivity only to an edge implies storage on the opposite edge.

A clock generator usually appears only in a testbench and may be defined by meansof a single, delayed nonblocking assignment in a level-sensitive always block:

reg Clock;...always@(Clock)#10 Clock <= !Clock; // ∼Clock also OK.

This makes use of the concurrent always statement. The assignment must benonblocking, so that the always block event control will be sensitive to the inver-sion. The clock reg must be initialized somewhere else.

Another way to define a clock is to use the procedural forever statement in analways or (more usually) initial block. Being procedural, a forever mustbe enclosed in a concurrent block in its module. For example,

reg Clock;...initial // Only for clock generation.beginClock <= 1’b0;forever#10 Clock <= !Clock;

end

This clock would work with blocking assignments. However, with blocking as-signments, anything clocked by it would require special treatment to ensure properdata setup.

Be careful not to use an initial block to initialize things that should be synthe-sized or used to represent hardware initialization: This is a very important differencebetween coding of simulation software and coding of hardware.

2.3.3 Verilog Strings and Messages

• Verilog string type. This is for literal strings, only. We shall not use strings ex-cept in system task messages, only. String values may be stored in a reg vector,

34 2 Week 1 Class 2

or in a memory object (array of bytes); but, be careful, especially with unicodetext or other nonASCII character encodings. Messaging system tasks all printto a simulator (console) text screen; they resemble C language printf functions.See Thomas and Moorby (2002) sections B.4 and F.1 for more information onmessaging during simulation.

The three most useful messaging tasks :

$display (format, args for display) for a simulation info printout when it isencountered procedurally but before RHS evaluations (before nonblocking as-signments) at that simulation time.$strobe (format, args for display) for a simulation info printout when it isencountered procedurally but after RHS evaluations (after nonblocking assign-ments) at that simulation time.$monitor (format, args for display) for print-on-change procedural simula-tion info after RHS evaluations and nonblocking assignments whenever one ofits args changes. Usually invoked in an initial block, possibly under a con-dition or after a delay.

Assertions are routines embedded in the design, by the designer’s foresight, tocheck that a condition holds and to announce a warning or error during simula-tion. Like simulation itself, they are a design activity as much as a verificationactivity.

Assertions assert that something should be true, remaining silent when it is; and,they report when it is false. Assertions bring attention to conditions which otherwisemight be overlooked after a design change, or under unusual or complex simulationconditions.

Whereas the primary purpose of an assertion is to warn the designer when theasserted condition has failed, assertions also may be used to generate simulationerrors and cause a running simulation to pause or terminate. We shall deal only withassertion messages for now.

A simple, homemade assertion statement using the $time system task:

reg X, Y;...if (X!=Y)$strobe("\n***Time=%04d. X=%1b == Y=%1b failed.\n", $time, X, Y);

Messages not only contain text strings, but they almost always also should reportvalues of design variables; so, they must provide readable and useful formats forthose values.

The most useful format specifiers are these (* = most common):* %h hex integer* %d decimal integer

%o octal integer* %b binary integer


%v strength level%c single ASCII character

* %s character string* %t simulation time in timescale units

%u 2-value data of 0, 1%z 4-value data of 0, 1, x, z

* %m module instance name (see below)

Any of the integer formats except %t may be preceded by “n”, in which n isan integer, to constrain the minimum number of unused leading places displayed.In addition, “0n” fills remaining leading displayed places with zeroes. Forexample,

integer x; ...; x = 5;$display("%s = [%04b] = [%4b].", "The value", x, x);

prints to the simulator console, “The value = [0101] = [ 101].”. Other-wise, using “%b” with no n, the integer would be printed with a default of 29 leadingblanks, because integers are 32 bits wide.

There is a special string replacement option, %m, which appears to be a formatspecifier but which actually does not format anything; instead, it is replaced in theoutput with the full hierarchical name of the instance in which the system task wasexecuted. Hierarchical names are explained in detail in Week 6 Class 2.

2.3.4 Shift Registers

A shift register is a register of bits which shifts the bit-pattern up or down in theregister. Often, “shift right” is used to describe shifting down, and “shift left” isused for shifting up. Up vs. down refers to the unsigned numerical value inter-preted from the register contents. Counters count up or down by similar reason-ing. The same terminology is used in software assembly-language programming, inwhich the contents of a register in the CPU or memory are shifted one way or theother.

Binary representation is most useful in understanding what happens during ashift. For example, suppose an 8-bit register with this content: 8’b0001 1001. Ashift up changes the contents to 8’b0011 0010. The leftmost bit is lost, and a new‘0’ appears on the right (because a ‘0’ would be shifted into the register by default).A shift down changes the original contents to, 8’b0000 1100. If this register, inhardware, was hooked up so its MSB input connected to its LSB output during ashift down, then the original contents, shifted down, would be, 8’b1000 1100:The LSB ‘1’ would be shifted into the MSB of the register.

36 2 Week 1 Class 2

When a shift register is connected to other design elements, the new bit appearingon one end or the other depends on however the register has been connected. Whathappens to the bit shifted out also depends on the design.

2.3.5 Reconvergence Design Note

A shift register with programmable storage is a good way to introduce the problemof reconvergent fanout – in this case, of a clock.

To use a shift register as something other than a 1-bit FIFO or a latency control,the shifting should be capable of being disabled; this allows the shift register to storeits current value for one or more clocks. A simple way to achieve this is to gate theshift clock, as shown in Fig. 2.2.

Fig. 2.2 Shift register which stores data when clock is gated off. The clock gate may cause areconvergence error

However, in the application shown, the clock which shifts the data also clocks anindependent flip-flop as shown on the far right. This clock has reconvergent fanout,because its logical effect, after traversing the shift register, reconverges on the exter-nal flip-flop. From the schematic alone, the problem in this case is likely to be setup:The and gate on the shift clock causes it to arrive later than the clock on the externalflip-flop. Thus, even with a design allowing for setup on the individual components,the external flip-flop may be clocked too soon to capture the current value on thelast shift flop.

This is a general problem in any design with a gated clock. It is overcome byusing specially designed shift flops with an input-enable control pin, or by addingdelay to the clock on all components which are independent of the shift register andwhich receive data from it.

Possible reconvergence is something always to keep in mind; however, thesynthesizer is designed to accommodate reconvergence and can be expected to cre-ate a netlist from the (gated-clock) design in Fig. 2.2 without the setup error de-scribed. Of course, if the designer wants the external logic to receive late data, thenthe synthesizer must be constrained to create it that way. We shall not discuss thisproblem further at this point in the course. In the next lab, we avoid reconvergenceby adding shift-enable logic manually to each of our shift-register flip-flops.

2.4 Nonblocking Control Lab 4 37

2.4 Nonblocking Control Lab 4

Work in the Lab04 directory. Create subdirectories under Lab04 if you want.

Lab Procedure

Step 1. Use the following examples to model three different D flip-flops in a singleverilog module. Add an assertion that warns when preset and clear are assertedsimultaneously. Check your design by simulating it; then, synthesize it, optimizingfirst for area and then for speed.Simple D flip-flop:

always@(posedge clk1) Q <= D;

D flip-flop with asynchronous clear:

always@(negedge clk1, posedge clr)beginif (clr == 1’b1)

Q <= 1’b0;else Q <= D;end

D flip-flop with asynchronous preset and clear:

always@(posedge clk2, negedge pre n, negedge clr n)

begin

if (clr n == 1’b0) Q <= 1’b0; // clear has priority over preset.

else if (pre n == 1’b0) Q <= 1’b1;

else Q <= D;

end

Step 2. Write a verilog model of a D flip-flop with synchronous clear. Simulate itto check your design.

Step 3. Use the following examples to model three different D latches correspond-ing to the flip-flops in Step 1. You may wish to copy your design from Step 1 andmodify the flip-flop code to represent latches. Decide which “simple D latch” to usefrom the examples below. As in Step 1, add an assertion that warns when preset andclear are asserted simultaneously. Check your design by simulating it; then, synthe-size it, optimizing first for area and then for speed. Check this netlist by inspectionof the schematic to be sure that it indeed models three latches.

Simple D latch:

always@(D) if (ena1==1’b1) Q = D;

38 2 Week 1 Class 2

The ena1 variable is declared somewhere and controlled externally; for exam-ple, ena1 might be a module input, as probably would be D.

What would happen if the sensitivity list was always@(ena1)?What kind of simple D latch would be the following?

always@(D, ena1) if (ena1==1’b1) Q = D;

D latch with asynchronous clear and with enable asserted low:

always@(D, clr, ena n)beginif (clr == 1’b1)

Q = 1’b0;else if (ena1 n==1’b0) Q = D;end

D latch with asynchronous preset and clear asserted low:

always@(D, pre n, clr n, ena2)beginif (clr n == 1’b0) Q = 1’b0; // clear has priority over preset.else if (pre n == 1’b0) Q = 1’b1;else if ( ena2 == 1’b1) Q = D;end

Step 4. Serial-load shift register. A shift register shifts on every clock if shifting isenabled; otherwise, it holds previous data. See the schematic in Fig. 2.3.

Fig. 2.3 Shift register with serial load

Write a verilog model of a D flip-flop with asynchronous clear, in its own file,and use it to construct a 5-bit shift register with serial load. Your flip-flop shouldhave Q and Qn outputs. This serially loaded shift register should have one D input;


its contents can be loaded by shifting data for 5 clocks; or, it can be cleared asyn-chronously. To allow the shift register to hold its data while the clock still is running,use a multiplexor (“mux”) to supply each flip-flip D input with one of these two pos-sible inputs: (a) With shift enabled, the mux should connect the previous Q to thenext D; or, (b) with shift disabled, the mux should connect each flip-flop’s D inputwith its own Q output.

The schematic representation of a 2-input mux is shown in Fig. 2.4.

Fig. 2.4 Schematic muxsymbol

Your mux model may be written this way:

always@(sel, in1, in2)if (sel==1’b0)

outbit = in1;else outbit = in2;

A mux is combinational, not sequential, logic, so blocking assignments are fine.Notice that if either of in1 or in2 had been omitted from the sensitivity list,a latched state would exist, and we would not have a mux but rather a latch ofsome kind.

Another way to write a mux (not in this Step, please):

assign outbitWire = (sel==1’b0)? in1 : in2;

The D flip-flop model should be in a separate file from the shift-register, and soshould be the mux model.

Simulate your design briefly to check it for functionality.Optional: Synthesize your design twice, optimized for area first and and then for

speed.

Step 5(optional). Parallel-load shift register. Modify the design from Step 4 so thatall 5 bits of the shift register can be loaded with new data on one clock.

The easiest way to do this is to add a third selection to your Step 4 muxes: Thethird choice should connect each flip-flop D input to its respective bit on a new, 5-bit,parallel-load input data bus. See Fig. 2.5.

40 2 Week 1 Class 2

Fig. 2.5 Shift register with parallel load

In this parallel-load design, when shift is not enabled, each flip-flop Q is con-nected to its own D; when shift is enabled, each flip-flop Q is connected to the Dof the next flip-flop; when parallel-load is selected, each flip-flop Q is unconnected(except the last one), and each flip-flop D is connected to a bit on the parallel datainput bus.

Be sure that your new muxes always select exactly one of the three intendedconnections no matter how they are operated.

Step 6. Rewrite the serial-load shift operation of Step 4 as a procedural model ina single always block like this:

always@(posedge ShiftClock)beginQReg[0] <= Din;QReg[1] <= QReg[0];QReg[2] <= QReg[1];QReg[3] <= QReg[2];QReg[4] <= QReg[3];end

This kind of design will not require a DFF model at all. To add the Step 4 muxes,you may use conditional expressions on the right instead of unconditional Qregexpressions:

always@(posedge ShiftClock)beginQReg[0] <= (ShiftEna==1’b1)? Din : QReg[0];QReg[1] <= (ShiftEna==1’b1)? QReg[0] : QReg[1];QReg[2] <= (ShiftEna==1’b1)? QReg[1] : QReg[2];QReg[3] <= (ShiftEna==1’b1)? QReg[2] : QReg[3];QReg[4] <= (ShiftEna==1’b1)? QReg[3] : QReg[4];end


This shift register model can be implemented using just one always block andno design hierarchy. Incidentally, putting the same delay, say, “#1”, in front of eachassignment statement in the code above should delay each shift by that delay butnot otherwise affect the logic. However, some simulators will not simulate the de-layed statements correctly – another good reason not to use delays in proceduralblocks.

Fig. 2.6 Simulation of the behavioral shift register model above, with a 0.5 ns lumped output delaynot shown in the verilog

Simulate to verify (see Fig. 2.6); then, try to synthesize the procedural modelfor area and speed. As it happens, the current version of the synthesizer may notsynthesize nonblocking assignments correctly if they are delayed, but you shouldtry anyway, to see what happens.

Then, just remove all the delays above and try synthesizing again.What happens in simulation if the nonblocking assignments are replaced by

blocking assignments in the code fragment immediately above? In the code frag-ment above, how much total time does it take for one shift?

You can synthesize a procedural shift register by removing the delays (above)or by changing the assignments to blocking ones. However, if you use blockingassignments, the statements then have to be ordered in reverse of that shown (withnonblocking assignments and equal delays as above, order is irrelevant, because theold value on the right will not be updated before it has been assigned).

To study the special features of nonblocking assignments, simulate theScheduler.v model which you will find in your Lab04 directory.


Latches are a problem for the synthesizer. This is not unintentional; guess why?Do you think the synthesizer can create better logic with a structural model than

with a procedural one? Why?How might the flip-flop models be improved? What about ‘x’ or ‘z’ states?In the serial-load shift-register design, how might an assertion be added to check

to make sure that shift was not disabled before at least five valid data had beenloaded?

42 2 Week 1 Class 2


As previously assigned, read Thomas and Moorby (2002) chapter 1 on basics andchapter 3 (on synthesizable verilog) through section 3.4. Do Thomas and Moorby(2002) 1.6 Ex. 1.2.

Read Thomas and Moorby (2002) chapter 2 on logic synthesis. We shall studyfinite-state machine design in verilog later, so read through the FSM sections lightly.

For inferred latches, study Thomas and Moorby (2002) Example 2.7 insection 2.3.2 and do 2.9 Ex. 2.2.

Optional: Read Thomas and Moorby (2002) section 5.2 on parameters, but ignoredefparam.


Section 9.2.2 on parameters.Sections 3.2.9, 3.3.1, and 9.5.3 for details on strings and messages.Sections 14.4–14.6 on logic optimization,

If you are puzzled by procedural controls, read through chapter 7 to see if ex-planations there might help. You may find the Palnitkar CD answers to exercisesin section 7.11 beneficial; however, use of forever or while in the coursework,like use of always without an event control, generally is discouraged.

Read section 15.15.2 for an overview of assertion-based verification, which isvery important in modern, large, complex designs.


3.1 Net Types, Simulation, and Scan

3.1.1 Variables and Constants

As previously mentioned, variables are of two different kinds, reg and net.However, the situation is more complicated.

A reg, an unsigned type, is about the same as the corresponding signed types,integer and real. Any of these either of reg-like types can be assigned a valueprocedurally and retains the value assigned until the value is changed by a subse-quent assignment. Both integers and reals are 32 bits wide; a reg may be anywidth, from one bit up to the limit the tool in use can accept. For flexibility, a regmay be declared signed in verilog-2001, in effect allowing declaration of integers ofany width. A real is not synthesizable in the digital design tools we use, and weshall ignore it in most of the rest of this course.

The word net does not refer to a specific type but rather a generic characteristic ofconnectivity. The type of a net may be wire, tri, wand, or wor – or other typesto be studied later. A wire and a tri are functionally identical, and the differentnames are just mnemonics. Multiple drivers on a wire may be a design error; ona tri, they probably are three-state drivers. A wand almost always is multiplydriven, and it drives inputs to which it is attached with the wired and of its drivers;a wor drives with the wired or. Thus, a wand or wor effects a logic operation inaddition to a connection.

Multiple drivers on wand or wor are effected by using two or more concurrentassignments. This can be done (a) by wiring the net to two or more instance outputpins, or (b) by two or more continuous assignment statements to the net. It shouldbe mentioned that neither wand nor wor is used often in modern cmos design.

Constants may be literal numerical values, parameter values, or string values.Because integers already are exactly 32 bits wide, they may be assigned con-

stant values without specifying width. However, it usually is wise to provide a widthfor every literal constant; this means that operations on the literal will have unam-biguous width, making the results well defined. Examples are the four fundamental1-bit states,


44 3 Week 2 Class 1

1’b1, 1’b0, 1’bx, and 1’bz; or, typical hex or binary expressions such as7’h15 and 16’b0001 0101 1100 1111.

A parameter value is by default unsigned. When initialized by a decimal inte-ger literal, a parameter becomes 32 bits wide by default; however, a parametermay be declared of any width. Examples are

parameter x = 5, y = 11; // unsigned; 32 bits wide.parameter[4:0] z = 5’b11000; // 5 bits wide.

There is no string type in verilog, but string literals enclosed in quotes may beused in simulation screen messages or assertions. A string of bytes also may be as-signed to a reg of adequate width, but this usage is not common in digital designexcept when programming error messages into a ROM or RAM. A typical simula-tion example is,

$display("Error at time=[%04d].", $time);

There are no “global” variables or constants in verilog; all must be declaredsomewhere within a module.

3.1.2 Identifiers

Identifiers in verilog, which is to say, declared names of variables or constants, ornames of modules, blocks or other objects, are:

• made of ASCII alphanumeric characters and ‘ ’• case-sensitive• of any length allowed by the tool in use• begin with a letter (alpha) or ‘\’.

Identifiers beginning with ‘\’ are called escaped identifiers and may contain anyASCII character except a blank. These identifiers are terminated by a blank (‘ ’),which is a delimiter and not part of the name. Escaped identifiers are used by toolsto avoid name conflicts for translation or portability and usually are not writtenmanually in verilog design source code.

3.1.3 Concurrent vs. Procedural Blocks

Concurrent blocks. These are blocks of code which simulate in no well-definedorder relative to one another. Verilog source files, for example, may be com-piled in a particular order, but the compilation order does not define the order ofsimulation. The most important concurrent block is the module instance. Like-wise, continuous assignment statements, initial blocks, and always blocks areconcurrent within a module. Other kinds of concurrent block, includingprimitives and the specify block, will be introduced later in the course.

3.1 Net Types, Simulation, and Scan 45

Procedural blocks. These are blocks of code within a concurrent block whichare read in order (sequence) during simulation and, if executed, are executed in theorder read. Procedural blocks may contain:

• blocking assignment statements• nonblocking assignment statements• procedural control statements (if; for; case)• function or task calls (later)• event controls (‘@’)• fork-join “parallel” statements (later)• nested procedural blocks enclosed in begin . . . end.

3.1.4 Miscellaneous Other Verilog Features

Macroes are like C preprocessor directives but begin with ‘‘’ instead of ‘#’. Theyare not strictly language elements, because they do not relate to other specific con-structs and may appear anywhere on a new line in the code. For example,

‘define, ‘timescale, ‘ifdef, ‘include

System tasks and system functions will be covered lightly in this course. Theyare simulation constructs and may appear in procedural blocks, only. For example,$strobe, $display, $sdf annotate, $stop, $finish.

Timing checks will be covered in detail later. They are predefined assertionswhich execute concurrently. They may appear in specify blocks only. Examplesare $width, $setup, and $hold.

The PLI (Programming-Language Interface) is a library of C routines whichallow a user to define new system tasks or functions which extend the functionalityof a verilog simulator. We shall discuss this toward the end of the course but shallnot use it at all.

3.1.5 Backus-Naur Format

(BNF)This is a way of defining syntax of a language and is used widely in verilogand other standards contexts. It gives an alternative view to text specifications oflanguage syntax and helps to resolve ambiguity. We shall not use it in this course, butit is nice to know. The Thomas and Moorby (2002) appendix G treats it extensively.

The BNF rationale is extremely simple and essentially hierarchical: The allowedsubstructure of any primary element of syntax is provided in a list following “::=”;subelements are broken down following “:=”. For example, suppose for simplicity’ssake the BNF for a variable was,

variable ::= reg | netnet := wire | tri | wor

46 3 Week 2 Class 1

Then, from this, we may deduce that a variable always will be either a reg,wire, tri, or wor. An attempt to use a parameter as a variable would be anerror easily recognized from the BNF given in this example.

3.1.6 Verilog Semantics

Verilog is an HDL, and its meaning is hardware. So, in a VLSI context, verilogmeans logic gates and wires or traces. The logic gates correspond to regs, moduleinstances, integers, and so forth; the wires to nets of various types.

Like any programming language, verilog works by expressions and statements.An expression is just something that can be evaluated (represents a value). An ex-ample of an expression is a sum or logical product, for example “5+7” or “A&&B”.An expression just evaluates to something; it doesn’t change anything.

However, a statement assigns a value to something and thus changes it. Thechanged value either is stored locally, or it is routed somewhere else. For example,in a procedural block, “Zab = A && B;” assigns the logical and of A and B (anexpression) to a reg named Zab. The two equivalent concurrent statements, con-tinuous assignment “assign Z = A && B;” or component instantiation “andand01(Z, A, B);” change the value of a net type named Z.

Statements have to be controlled somehow, and the usual way is by simulation ofa change in a variable. For example, “assign Z = A&B;” will be reexecuted ev-ery time the simulator simulates a change in A or B. An always block may containnumerous expressions and statements, and the designer must provide such a blockwith an event control (“sensitivity list”); the always block statements all will bereread only when a variable in the sensitivity list changes. For example, a block con-trolled by “always@(A)” will be reread only when A changes, even if one of itsstatements is “Z reg = A&B;”. However, a block controlled by “always@(�)”will be reread whenever any variable in an expression in the block is changed.

Latches and Muxes. Assuming only level sensitivity (not edge sensitivity in-troduced by keywords posedge or negedge), the meaning of an incomplete sen-sitivity list in an always block is a latch of some kind; a complete sensitivity listusually means combinational logic such as a collection of combinational gates ormux. However, if a control construct such as an if or case causes a value to beignored, a latched state can be created even with a complete sensitivity list.

For example, suppose in a module we have declarations of “wire a, b,sel;” and “reg z;”. Consider the following four different always blocks:

always@(a, b, sel) always@(*) always@(sel) always@(a, b)if (sel==1’b1) if (sel==1’b1) if (sel==1’b1) if (sel==1’b1)

z = a; z = a; z = a; z = a;else z = b; else z = b; else z = b; else z = b;

The leftmost two then represent muxes, because all the variables in expressionsare in the sensitivity list, and every alternative in the if is assigned to the one


output (z). The other two always blocks above represent nonstandard latches ofsome kind which are disabled when the variable(s) in the sensitivity list does notchange. The rightmost one is sensitive to every variable on the right-hand side ofa statement; thus, it represents a mux which is latched in an abstract sense: Theselection is latched, not the output value.

Here is a nonstandard latch which contains no procedural control construct:

always@(a,v) // typo!Z = a | b;

The value of z is latched against changes in b. This is the kind of latch oftencreated by mistake by a typo in the sensitivity list.

A simple transparent latch can be modelled correctly in an always block byomission of the sel==1’b0 alternative in what otherwise would be a mux. Thecode is next, with the equivalent component schematic symbol in Fig. 3.1:

// verilog simple transparent latch:always@(D,Sel)if (Sel==1’b1)Q = D;

Fig. 3.1 Simple transparentlatch by procedural omission.Sel renamed to E(nable)

The always block above is the preferred style for a synthesizable latch. Thesynthesizer will find a single component in the library for it, if the library has one.

Another way to write a synthesizable latch would be by continuous assignment:

assign Z = (Sel==1’b1)? D : Z;

Although the previous always-block model generally will synthesize to a latchlibrary component, a continuous assignment latch typically would synthesize to ex-plicit combinational logic with feedback.

The continuous-assignment latch might be synthesized as in Fig. 3.2:

Fig. 3.2 Simple transparentlatch by continuousassignment

48 3 Week 2 Class 1

A continuous assignment statement is treated as combinational logic by the syn-thesizer, so it does not attempt to find a library sequential element for the netlist.For this reason, it usually is not a good idea to use a combinational representationfor a latch (although we shall use combinational S-R latches in our course work forinstructional reasons).

When a latch is described functionally in combinational code, the precise gatestructure is up to the synthesizer, and this structure may be changed unpredic-tably during optimization. Thus, the timing (including possible glitching) of acombinationally-implemented latch is more uncertain than when it has been de-scribed more directly in a simple sequential construct such as the always blockabove (represented by Fig. 3.1). It is best to avoid the whole problem, whereverpossible, and write clocked constructs implying flip-flops instead of enabled con-structs implying transparent latches.

3.1.7 Modelling Sequential Logic

We now turn to some aspects of model building in verilog which are intended to pro-duce accurate synthesis results. For reasons to be expanded later, we avoid latchesper se here entirely and discuss only clocked latching elements – which is to say,flip-flops.

Clocks. A clocked block in verilog must be an always block with an edgeexpression, posedge or negedge. A clocked block never should include morethan one clock, and it should not include any level sensitivity. However, multipleedge expressions may appear if only one of them is applied to a clock. For example,

always@(posedge clk) ...always@(negedge clk) ...always@(posedge clk, negedge clear) ...

The following is not recommended and will not be synthesized:

always@(posedge clk, clear, preset) ...

Asynchronous controls. These usually are a single preset or clear but may beboth a preset and a clear. If verilog is written to represent both preset and clear, thecode can not avoid implying a priority for one of the controls. For example,

always@(posedge Clk, negedge Preset n, negedge Clear n)if (Preset n==1’b0)

Q <= 1’b1;else if (Clear n==1’b0)

Q <= 1’b0;else Q <= D;

In this model, priority is given to Preset over clear n if both are asserted inthe same time interval. Thus, a correct netlist should include logic creating this pri-ority, even though assertion of both probably would represent an operational error.


However, the designer’s intent usually will be to synthesize a single gate with twopins for the controls, and either no priority (random priority) or some sort of internalgate structure effecting a priority.

Avoid more than one asynchronous control if possible: The block may not besynthesizable if the library does not include a component with both a preset and aclear pin; and, if it does, the simulation may not match the synthesized netlist if bothcontrols ever should be asserted at once. It may be possible to pass a synthesizera constraint or other directive which would control its library access so that cellswould be chosen with specific preset-clear priority for particular instances. In theabsence of such control, the synthesizer might insert logic to implement exactlythe verilog simulation priority; or, it simply might ignore the verilog asynchronouscontrol priority entirely.

Race conditions. A race condition results from code which allows an ambigu-ous value (? ‘1’ or ‘0’ ?) to be scheduled during simulation. This generally meansthat the corresponding hardware will not be functional.

For example, within one always block, delayed nonblocking assignments canallow concurrency and therefore a race:

always@(posedge Clk)begin#2 X <= 1’b1;#2 X = 1’b0;#3 Y = X; // Ambiguous value used.end

The same concurrency can occur between different always blocks:

always@(posedge Clk) #1 X = a;always@(posedge Clk) #1 X = b;

To avoid the majority of race conditions, never mix nonblocking assignmentswith blocking assignments; and, never assign to the same variable from more thanone always block. Incidentally, the synthesizer will refuse to synthesize thesestyles, although the simulator will permit them.

In addition, it is good design to latch all outputs on major design components(large blocks or modules): Flip-flops on all outputs guarantee that when the clockedge occurs, current values only, and not possibly conflicting ones, will be suppliedto inputs of other components.

Finally, never schedule a delay of #0, except maybe in a testbench initial block.We shall study reasons for this later in the course. For now, just don’t do this:

always@(posedge Clk)begin#0 Q1 <= a; // Likely error! Never use #0!#0 Q2 <= b; // Likely error!...end

50 3 Week 2 Class 1

Synthesizable language subset. Not all the language is synthesizable. To be surethat what you simulate also will synthesize, here some rules of thumb:

• Don’t mix edge expressions and level-change expressions in a sensitivity list.• If you write a delay expression in an always block, use it with blocking assign-

ments, only. The synthesizer will object to any delayed nonblocking assignment,either by reporting an error or by issuing a severe warning. We shall see why later.

• When you code delays for simulation, try to do it this way: Avoid delays inalways blocks; instead, estimate the total delay(s) on the output of each suchblock, and move that estimated total to a continuous assignment statement. Forexample, code the way shown as follows:

module MyModule (output X, Y, rest of sensitivity list);local declarations. . .assign #5 X = Xreg; // estimated total delay = 5.assign #7 Y = Yreg; // estimate = 7....always@(posedge Clock) // A very strange flip-flop!beginx1 = (a && b) ˆ c;Xreg = x1 | x2;Yreg = &(y1 + y2);end

endmodule

3.1.8 Design for Test (DFT): Scan Lab Introduction

Modern tools automatically will insert scan into a design; this usually is done latein the design cycle, after the unscanned design has been well simulated. However,understanding the mechanics of scan insertion will help you to recognize and correctinsertion errors or inefficiencies.

Scan is called “scan” because scan registers allow logic states at hardware testpoints to be scanned (shifted serially) in and out of a design.

The purpose of scan is to be able to observe changes inside a design or a wholechip. Scan registers are hardware test points; they allow a designer or test engineer tosee what is going on in the hardware. Scan registers are not just simulation devices;they stay in the design after it is taped out and implemented in silicon.

There are two major kinds of scan, internal scan and boundary scan.In internal scan, the inputs and outputs of all, or some selected, blocks of com-

binational logic in the design are changed into scan elements. This is done byreplacing every flip-flop or latch with a scan flip-flop or scan latch. Obviously, ev-ery I/O of any combinational block must be either a chip pin or a sequential elementsuch as a flip-flop or latch.

A special test port is used for controlling scan operation; this port is called aJTAG port (“Joint Test Activity Group”) after the standards group that defined it.


The new scan components are the same as the ones they replace, except thatthey are muxed together into a serial scan chain, which is just a giant shift registerdistributed over all or part of the design. When the scan muxes are in the operateor normal mode, the design components operate as they were designed to do; whenthe muxes are in the scan mode, the design doesn’t operate any more, but all logicvalues on the inputs or outputs of scanned blocks can be shifted out of the designand inspected for correctness. Thus, by scanning in new inputs, operating, and thenscanning out the result, every combinational block in the design can be tested forcorrect operation, revealing design errors or (random) hardware failures, if any.

Figs. 3.3 through 3.5 illustrate how internal scan insertion is done in a designwith preexisting sequential elements, in this case flip-flops.

Fig. 3.3 Unscanned design

Fig. 3.4 Muxes inserted forinternal scan

Fig. 3.5 Library scancomponents. Mode selectnot shown

52 3 Week 2 Class 1

In our lab exercise, we shall do this kind of scan insertion, but with some modi-fication to allow for the fact that the original design does not include any sequentialelement to replace.

Boundary scan mostly is used for board-level hardware testing; it monitors theboundary of a chip. In boundary scan, the pins on a chip are connected to addi-tional scan components which in turn connect to the gates inside the chip. Thisallows chip test patterns to be applied by shifting in, and the results observed byshifting out, after the chip has been mounted on the circuit board. Boundary scanthus eliminates the need for a ”bed of nails”, or other external hardware probe,on every test point on the board. A modern ball-grid chip package may have wellover 1,000 pins spaced apart by a fraction of a millimeter; this makes a bed-of-nails approach impractical and possibly harmful to the tiny, delicate contactpoints.

Boundary scan usually is combined with some sort of built-in self-test protocol.In addition to a JTAG port, boundary scan typically requires a Test Activity Portcontroller (TAP controller), a built-in state machine, which is a complexity we shallignore for now.

In the next lab, we’ll use what we have learned about shift registers to insertscan flip-flops into the Intro Top design of the first lab of the course. It shouldbe emphasized that we shall be using two muxes per flip-flop, instead of the onerequired for normal internal scan. This is because the original design has no flip-flops, and we want to use the extra mux to remove the flip-flops functionally, notjust scan them. So, we are using scan insertion as an excuse for an exercise inverilog.

Our JTAG test port will consist of just five special one-bit ports: A scan-in port,a scan-out port, a scan clock port, a scan clear port, and a scan mode port.

We want to insert scan into our old Lab 1 design just to see how it is done. Issuesof clocking and settling of logic will be addressed again later in the course.

Before we start this lab, here is an outline of what we shall do:First, we’ll add a JTAG port and install flip-flops around the combinational logic

in Intro Top. Because Intro Top was purely combinational, this means we’llinstall a flip-flop on every I/O. After installing the flip-flops, we’ll restore function-ality to the Intro Top design by clocking the flip-flops. The only clock availableis the JTAG scan clock, so we’ll use that one. The rest of the JTAG port will re-main unused at this point. The Intro Top design then will become a synchronousdesign, clocked by the scan clock and with its original functionality intact, exceptfor synchronizing delays.

Second, we’ll install muxes, two for each flip-flop. Each mux at this pointwill be held in one select state (the operate or normal mode). For example, twoIntro Top outputs with muxes added to the new flip-flops would look as shownin Fig. 3.6.

3.2 Simple Scan Lab 5 53

Fig. 3.6 Two outputs in Intro Top after connection of muxes to the new flip-flops. Clock andclear logic omitted for clarity; ‘0’ on select is assumed to select normal mode

If we put the muxes in the scan mode, the design would not do anything, becausethe scan inputs to the input muxes are unconnected at this point. So, to allow im-mediate verification of our verilog by simulation, we shall install the muxes withthe flip-flops out of the design. This means that the Intro Top design again be-comes purely combinational, with no synchronizing delays, but with some smallpropagation delay added by the mux logic.

Notice that the design fragment in Fig. 3.6, held its normal mode, completelyignores the states of the flip-flops and does not respond to the scan clock at all.

Third, we’ll connect the dangling mux scan inputs to each other in a serial chain;as will be seen in the lab, this will chain the flip-flops together and allow them toact as a shift register when the design is clocked in scan mode. All mux outputs willhave been fully connected already; so, they need not be modified in this final step.

Now we begin the lab.

3.2 Simple Scan Lab 5

Lab ProcedureLog in and change to the Lab05 directory.You’ll find there a copy of the original Intro Top Lab01 design, with mi-

nor changes. Notice the maintenance log that has been kept in each module file.This kind of commenting is recommended, not only to pass on information to otherdesigners, but to remind the busy original designer, after a lapse of time, of whatwas done. The details to be logged will vary with your project coding style. Therealso is a synthesis script file there for use only in the final, optional Step of the lab.

Step 1. Add a JTAG test port in Intro Top. Do this just by adding five newone-bit ports connected to nothing: Call them ScanMode, ScanIn, ScanOut,

54 3 Week 2 Class 1

ScanClr, and ScanClk. All should be inputs except ScanOut. Your moduleheader now should be something like this:

module Intro_Top (output X, Y, Z, input A, B, C, D, output ScanOut, input ScanMode, ScanIn, ScanClr, ScanClk);

Step 2. Insert D flip-flops (FFs) into the path of every I/O at the top of the design(in Intro Top.v, not TestBench.v), except the JTAG I/O’s.

You may wish to use your FF design from Lab04; it should have a posedgeclock and positive-asserted asynchronous clear.

To do this, work with the module ports and wires in Intro Top; there is no needto change anything in the submodules which made up the structure of the originalLab 1 design.

Connect every preexisting design input to the D of a new FF; then, reconnect theQ output of that FF to whatever the preexisting input pin was driving. You shoulddeclare a new wire for each FF Q connection.

You may find it confusing and easy to make mistakes if you do not name yourwires and component instances mnemonically. For example, suppose your FF mod-ule was named DFFC (“D f lip-f lop with clear”). For the top-level A input port,which connected in Lab 1 to pin A of a block with instance name InputCombo,you might do this:

wire toInputCombo A;...DFFC Ain FF(.Q(toInputCombo A), .D(A), ...);

After the inputs, do the same for the outputs. In Intro Top, connect the Qoutput of a new FF to every preexisting design output; then, reconnect whatever wasdriving that preexisting design output to the D input of the new FF. Again, declare anew wire for each reconnection.

After wiring the data for all FFs, connect the ScanClk input to the clock pin ofeach FF; connect the ScanClr input to the asynchronous clear pin of each FF. Allyour FFs then should look something like this:

DFFC myFFinstanceName (..., .Clk(ScanClk), .Clr(ScanClr));

Add drivers for the new ScanClk and ScanClr inputs in the testbench inTestBench.v; be sure to connect them to the top-level design instance there.

A good testbench convention throughout this course is to declare a reg forevery design input to be driven; name this reg �stim, where ‘�’ is the designinput name. For example, Intro Top input A would be driven by a testbenchreg named Astim. Similarly, a good convention for design outputs is to declarea testbench wire for each one, named �watch. For example, the design output Xwould be connected to a testbench wire named Xwatch. You stimulate the inputs


and watch the outputs. Being consistent with testbench names makes it easy to recallwhich testbench variable was connected with each design variable without havingto examine the testbench code itself.

After this, clocking the FFs after every testbench stimulus change should applythe testbench �stim simulation inputs to the original design and should clock outthe current (not new) �watch outputs. After waiting enough simulation time forthe combinational logic to settle, clocking the FFs a second time without changinginputs should clock out the new, correct �watch outputs determined by the currentinputs and the design’s combinational logic.

Compile your changed design for simulation now, from testbench on down, tocheck your FF and clock wiring.

Step 3. Refine the timing, if necessary. Looking at the subblocks of the Lab05design, there are several delays of 10 units; so, be sure to have defined the clock inyour testbench with a generous period of, say, 50 units. This should allow plenty ofsettling time:

...always@(ClockStim) #25 ClockStim <= !ClockStim;//initialbegin#0 ClockStim = 1’b0;...end

The ClockStim in the testbench of course will be connected to the ScanClkport of the modified design. And you will want a new ClearStim in the testbenchto drive the design ScanClr. Just set ClearStim to 0; there is no need to resetthe flip-flops for this lab. Check your clocking scheme by simulating the designbriefly. See Fig. 3.7.

You now have a new, synchronous design consisting of the Lab 1 combinationallogic surrounded by sequential logic. It is synchronous because its operation issynchronized to ScanClk. Inputs are clocked in on every positive clock edge; theresult is clocked out on the next positive edge.

Fig. 3.7 The Step 3 synchronous implementation of Intro Top

56 3 Week 2 Class 1

Optionally, synthesize this design, just to see what the netlist schematic lookslike.

Step 4. Install multiplexers to remove all synchronous behavior again. Just themuxes; don’t chain anything yet. Use an always block or a continuous assignmentin Intro Top for a mux, not a (structural) component.

Each FF will require a mux driving its D input, to connect its D either to the nor-mal D input or to the scan chain. A second mux will be used to connect or disconnectthe FF to its normal output. The second mux is required for this design; otherwise,the FF’s Q output would contend with whatever normally is driving the next D input.

Fig. 3.8 Multiplexers arounda flip-flop in Intro Top tomake a scan element. Usually,scan requires one mux perflip-flop

At this point, the leftmost “(previous scan)” input as shown in Fig. 3.8 shouldbe driven with a constant such as 1’bx; the “(next scan)” wire shown should notbe added yet. It may be helpful to declare a wire with a noticeable name, such asTHIS IS X, assign a 1’bx to it, and use it in turn to assign the 1’bx to the severalmux inputs which will be set unknown at this stage in the lab.

As a result of all this, the FF shown in Fig. 3.8 is bypassed entirely in normalmode by its two muxes; so, in normal mode, we end up with our original, purelycombinational, Intro Lab 1 design.

While adding code for the muxes, make the select input of every mux work sothat when ScanMode is ‘1’, the FF is in the scan chain; when ScanMode is ‘0’,the FF is in the normal operating mode. Assign a small delay to each mux statement;say, #1 or so.

Simulate the design with ScanMode = 0 constantly (assign it in the test-bench); the result should be the same as it was in Lab 1 without muxes or FFs,except for the delay added by the muxes.

We now have Lab 1 working again, but with extra, nonfunctional FFs and muxeswhich merely add a delay.

Notice that we have made no change at all in any submodule. Our entire originaldesign was combinational, so we inserted the scan elements only at the top of thedesign. The result looks the same as a boundary scan, but it actually is an internalscan on a design which doesn’t have internal sequential gates.

Optionally, synthesize this design, just to see what the netlist schematic lookslike.

Step 5. Create a scan chain. At this point in the lab, the FFs are short-circuited outof the design by muxes, and they have no function. The scan inputs to all muxes are


unconnected to anything in the design. Therefore, the FFs can be connected togetherinto a shift register, using the mux scan inputs, without affecting the design.

Refer to Fig. 3.8. Pick any FF and connect its Q to the design ScanOut port.After that, connect the Q of any remaining FF to the scan input pin of the muxdriving the D input of the FF the Q of which just was connected. Repeat this until allFFs are chained together. The net result of this Step should be that the mux inputswhich were unknown in the previous Step now are connected to FF Q drivers, exceptthe first mux, which still is unknown. At this point, the ScanOut port should bewired to the Q output of the last FF in the scan chain.

Finally, connect the scan input pin of the mux driving the D input of the lastremaining FF to the ScanIn port, making this the first FF in the scan chain. Theresult of all this may be represented as in Fig. 3.9, which omits the ScanMode andScanClr wiring for simplicity:

Fig. 3.9 Logical representation of Lab 5 scan insertion. In normal mode, each Din-Dout logi-cally is a wire, and Sin-Sout is an open. In scan mode, Sin is a FF D input, Sout is a FF Qoutput, and Din-Dout is open. Dout and Sout are wired together, but this is irrelevant to themode differences

Recognize the scan chain on the dotted line of Fig. 3.9? It’s just a shift register!

Step 6. Assign useful instance names. Rename the flip-flop instances so that theirorder in the scan chain is represented numerically: For example, "A FF s01"might be a good name for the FF on the design A input, if it were set up as thefirst one ("s01") in the scan chain; "FF X s04" might be the FF driving the Xoutput, if it were fourth in the scan chain.

Optionally, synthesize this design; then, try to trace the scan chain in the netlist.

Step 7. Add a simulation safety net. It takes 7 scan clock cycles to scan in orout the contents of the entire chain; generally, depending on FF ordering, it takes

58 3 Week 2 Class 1

fewer clocks to scan in new stimuli or scan out new test results. Assume, then, thatoperation in scan mode for more than, say, 8 clocks in a row may represent an error.Write an assertion in TestBenchwhich triggers a warning if scan mode is assertedfor more than 8 scan clock cycles in a row.

Fig. 3.10 Simulation of the completed Step 7 scanned Intro Top design, with assertionsafety net

Figure 3.10 shows a simulation with a testbench which invokes the safety-netassertion. This assertion, being in the testbench, will not be seen by the synthesizerand will not have any effect on a synthesized netlist.

Step 8. Simulate your design to be sure you understand it. Synthesize it (from thetop level, not the testbench) for area. Examine the optimized netlist carefully; try tofind and count the multiplexers.

Optionally, after synthesizing for area, without exitting the synthesizer, un-group (flatten) the design and compile (synthesize) it again with an incremental-mapping option. Finally, still without exitting the synthesizer, do another compileon the flattened netlist for best area. The areas should improve progressively.

Step 9. (optional) The synthesizer can insert scan automatically in a design byreplacing sequential component instances with scan instances, for example replac-ing plain flip-flops with muxed flip-flops. However, to use this feature, there mustbe preexisting sequential components. We shall start by discarding the manuallyscanned Intro Top design and going back to the original combinational one.

Keep your current TestBench.v, but copy in a new Intro Top.v fromLab01. Interpose a D flip-flop as an output latch between each Intro Top outputdriver and its top-level output port (X, Y, and Z); use instances of your own DFFCmodel, but modified so that the Qn outputs are removed. These three FFs will be therequired sequential elements. Add a clock input port named Clk and a clear inputport named Clr to operate the FFs. See Fig. 3.11.


Fig. 3.11 The Intro Top design with latched outputs permits automated scan insertion

Then, add a set of three unconnected JTAG ports with these standard names,tms (test-mode select), tdi (test-data in), and tdo (test-data out). The other tworequired JTAG ports, tck and trst, will be shared with the Clk and Clr ports,respectively.

After this, rename the top module Intro TopFF. Then, modify Testbench.vas necessary to get the design to simulate normally, with outputs latched.

You should use the script originally prepared for you in your Lab05 directory toinsert scan automatically by means of the synthesizer. After scan synthesis, examinethe synthesized netlist briefly in a text editor or design vision.


In scan mode, with our design, each shift easily might cause the inputs to the com-binational logic to change, leading to a lot of logic changes and possibly glitches. Isthis of any concern?

How might the operational protocol be defined so as to minimize combinational-logic changes during scan chain shifting?

How might the scan chain be modified so the design combinational logic remainsin a constant state whenever it is in scan mode?


Read Thomas and Moorby (2002) sections 4.1 on concurrency, and 4.2 on eventcontrols.

Read the example in 4.5, 4.6 on disabling named blocks, and (optionally) 4.9 onfork-join. In section 4.9, note that fork and join may be used to substitutefor begin and end, respectively. However, adding an explicit begin and endaround every fork-join block makes the code more readable and more easilymodified; in particular, begin and end are required to use ‘ifdef DC to makecode with fork-join synthesizable.

Do 4.10 Ex. 4.1.(Optional) Look through Thomas and Moorby (2002) appendix G on BNF until

you understand how it can be used.

60 3 Week 2 Class 1


Read section 4.6.1 on coding verilog for logic synthesis.Do Exercise 1 in section 14.9, synthesis of an RTL adder.Do Exercises 2 and 3 of section 14.9.Look through the on-disc synthesis example that comes with the Palnitkar CD-

ROM (Chapter 14, ex. 1 directory). This example includes a presynthesized gate-level netlist and a verilog library of elementary component models.


4.1 PLLs and the SerDes Project

This time, we’ll design a PLL and a parallel-serial converter.For the remainder of the course, we’ll be working off and on to complete a SerDes

(Serializer-Deserializer) design, such as is required in the PCI Express specification.

4.1.1 Phase-Locked Loops

A phase-locked loop (PLL) consists of an output ClockOut generator, a phasecomparator, and a variable-frequency oscillator (VFO), for example, a voltage-controlled oscillator. A PLL input signal Clock is provided, and the phase-differencecomparator adjusts the VFO’s frequency whenever a phase shift is detected betweenthe Clock and the VFO ClockOut. The result is that the VFO is kept phase-lockedto the Clock. In theory, a PLL could lock in to any Clock input; but, in practice,the VFO and the rest of the PLL is designed so that the lock is accurate and almostjitter-free only in some narrow frequency range.

Although a PLL always locks in to the Clock provided, if the VFO output is usedto clock a counter, and a specific counter bit or value is used for the Clockout, theVFO frequency will be higher than the counted-down ClockOut used for the lock.The direct VFO output then may be used to provide a frequency-multiplied clocknevertheless phase-locked to the original Clock input.

4.1.2 A 1 × Digital PLL

An important PLL application is clock latency cancellation. In this application, thereis no multiplication, but the PLL is locked to a clock from a terminal branch of abalanced clock tree. This subtracts away the tree latency (clock insertion delay) andmakes available a clock from the PLL at the clock-tree terminal branches which isvery close to being exactly in phase with the original clock at its entry point on chip.See Fig. 4.1.


62 4 Week 2 Class 2

Fig. 4.1 PLL used to cancel clock-tree latency

Let us consider how we might design a verilog digital PLL for this purpose.We don’t want frequency multiplication, so we may assume two major compo-

nents, a clock phase comparator and a VFO. A schematic representation is given inFig. 4.2.

Fig. 4.2 Schematic of 1x digital PLL. The control bus from Comparator to VFO is assumed 2bits wide

Happily, a variable holding a verilog delay value may be changed during simula-tion time; this makes feasible a finely-controllable VFO period derived simply fromthe value of a verilog real variable. The PLL will not be synthesizable; but, wedon’t have to worry right now about synthesis.

This kind of VFO frequency control may be represented by the following pseu-docode:

real ProgrammedDelay;...beginProgrammedDelay = some delay value;...#ProgrammedDelay PLLClock = ∼PLLClock; // The VFO oscillator....ProgrammedDelay = some new delay value;...#ProgrammedDelay PLLClock = ∼PLLClock; // Frequency varied....end

4.1 PLLs and the SerDes Project 63

Suppose now that we have decided that the Comparator adjustment code fora VFO frequency increase shall be 2’b11, and for a decrease shall be 2’b00.

Then, assuming we have defined a delay giving us the desired VFO base operatingfrequency, and assuming we have decided upon the size of a delay increment whenthe Comparator signals an adjustment, the actual verilog for the VFO would beabout like this:

module VFO (output ClockOut, input[1:0] AdjustFreq, input Reset);reg PLLClock;real VFO Delay;assign ClockOut = PLLClock;//always@(PLLClock, Reset)if (Reset==1’b1)

beginVFO Delay = ‘VFOBaseDelay;PLLClock = 1’b0;end

else begincase (AdjustFreq)2’b11: VFO Delay = VFO Delay - ‘VFO Delta;2’b00: VFO Delay = VFO Delay + ‘VFO Delta;// Otherwise, leave VFO Delay alone.endcase#VFO Delay PLLClock <= ∼∼∼PLLClock; // The oscillator.end

//endmodule // VFO.

Notice the use of blocking assignments everywhere, to ensure that new valuesare available immediately upon update; however, the VFO oscillator must use anonblocking assignment to oscillate. This was coded with great trepidation, andonly after careful consideration of all consequences. This is a rare and unsynthesiz-able exception to the rule of never to mix blocking and nonblocking assignments ina single always block.

The Comparator is a bit more complicated. We require it to issue adjustmentcodes as described above for the VFO, but we want a decision with a minimum oflogic so that we can run it very fast in 90 nm library components. A simple verilogdesign would just use one clock to count the number of highs of the other clock inone clock cycle; if there was just one such high, the frequencies would be approxi-mately matched:

64 4 Week 2 Class 2

module JerkyComparator(output[1:0] AdjustFreq, input ClockIn, PLLClock, Reset);

reg[1:0] Adjr;assign AdjustFreq = Adjr;reg[1:0] HiCount;//always@(ClockIn, Reset)if (Reset==1’b1)

beginAdjr = 2’b01; // 2’b01 or 2’b10 are no-change codes.HiCount = ’b0;end

else if (PLLClock==1’b1)HiCount = HiCount + 2’b01;

else begincase (HiCount)

2’b00: Adjr = 2’b11; // Better speed it up.2’b01: Adjr = 2’b01; // Seems matched.

default: Adjr = 2’b00; // Must be too fast.endcaseHiCount = ’b0; // Initialize for next ClockIn edge.end

endmodule // JerkyComparator.

Blocking assignments again are used for immediate update. There is no reasonto use a nonblocking assignment anywhere in this model. The various possible de-cisions by this comparator are shown as waveform relationships in Fig. 4.3.

Fig. 4.3 Representative Comparator waveforms. Up arrow for HiCount > 1; down arrowfor HiCount= 0; horizontal double-head arrow for HiCount= 1

This kind of model works in simulation, and we shall use one very much like itin our next lab. But, it is very slow to lock in, and it causes many spurious frequencyadjustments. For the sake of a better lock-in, we can make several improvements:


• First, we should worry about a HiCount overflow and consequent wrap-aroundto 2’b00, causing a spurious speed-up adjustment when the opposite is indi-cated. This is avoided easily by increasing the HiCount reg width from 2 to3 bits.

• Second, we should improve the precision of the adjustment decisions. This canbe done by averaging the high counts over several cycles; the expected valuealways should be 1 when the two clocks are synchronized.

For speed, we should take a finite-difference approach to averaging in a verilogmodel. We should declare a fairly wide reg to hold the average; and, in a sepa-rate always block, either add 1 to it or subtract 1 from it on every clock which,respectively, has an excess high or no high at all.

As a result of these improvements, we can obtain an optimized 1 × digital PLLwhich does a respectable lock-in, not too much poorer than could be achieved withan analogue PLL. The code follows in two parts. The first part is the same compara-tor as in the previous code above – but, its result is used only internally. The secondpart averages the comparisons of the first in order to determine the frequency ad-justment to be sent to the VFO:

module SmoothComparator(output[1:0] AdjustFreq, input ClockIn, PLLClock, Reset);

reg[1:0] Adjr;assign AdjustFreq = Adjr;//reg[2:0] HiCount; // Counts PLL highs per ClockIn.reg[1:0] EdgeCode; // Locally encodes edge decision.reg[3:0] AvgEdge; // Decision variable.reg[2:0] Done; // Decision trigger variable.//always@(ClockIn, Reset)

begin : CheckEdgesif (Reset==1’b1)

beginEdgeCode = 2’b01; // The value of EdgeCode will be used toHiCount = ’b0; // increment or decrement AvgEdge.end

else if (PLLClock==1’b1)//Should be 1 of these per ClockIn cycle.HiCount = HiCount + 3’b1;

else begin // Check to see how many PLL 1’s we caught:case (HiCount)3’b000: EdgeCode = 2’b00; // PLL too slow.3’b001: EdgeCode = 2’b01; // Seems matched.

default: EdgeCode = 2’b11; // PLL too fast.endcaseHiCount = ’b0; // Initialize for next ClockIn edge.end

end // CheckEdges.// (continued below)

66 4 Week 2 Class 2

The second always block is decision-oriented:

// (continued from above)always@(ClockIn, Reset)

begin : MakeDecisionif (Reset==1’b1)

beginAdjr = 2’b1; // No change code.Done = ’b0;AvgEdge = 4’h8; // 7..9 mean no adjustment of VFO freq.end

else begin // Update the AvgEdge & check for decision:case (EdgeCode)2’b11: AvgEdge = AvgEdge + 1; // Add to PLL edge count.2’b00: AvgEdge = AvgEdge - 1; // Sub from PLL edge count.// default: do nothing.endcaseDone = Done + 1;if (Done==’b0) // Wrap-around.

beginif ( AvgEdge<7 )

Adjr = 2’b11; // Better speed it up.else if ( AvgEdge>9 )

Adjr = 2’b00; // Must be too fast.else Adjr = 2’b01; // No change.AvgEdge = 4’h8; // Initialize for next average.end

endend // MakeDecision.

endmodule // SmoothComparator.

Notice the case statement in the MakeDecision always block: It lacks adefault and does not cover all possible input alternatives. Therefore, a strange kindof latch is implied, and synthesis of this model should not be expected to be correct.This model would be considered a simulation-only place-holder if we were to adoptit for our class project.

This 1× PLL design, with some minor changes, has been implemented for youin your Lab06/Lab06 Ans directory. The model was designed to be simulatedwith a 400 MHz clock input, which is close to the upper frequency limit possibleeventually for synthesis in a 90 nm ASIC library. At this speed, the clock cycle timeis 2.50 ns, with a VFO half-cycle delay of 1.25 ns. The testbench drifts the delayslowly upward.


Some waveforms are shown in Fig. 4.4 and 4.5:

Fig. 4.4 The 1x PLL. Overview of the final 3 us of a 20 us VCS simulation

Fig. 4.5 Detailed closeup of the lock-in near the C1 cursor position of the Overview waveform

More information on PLL’s for latency cancellation, and details on static timinganalysis of a design containing a PLL, may be found in the Zimmer paper in theReferences at the beginning of this book.

4.1.3 Introduction to SerDes and PCI Express

The PCI Express (“PCIe”) bus is a serial bus meant to replace the 32-bit, parallelPCI (Personal Computer Interconnect) bus common in desktop computers betweenthe years of about 1995 and 2005. The serial bus sidesteps the growing interbitskew problem which is the main limiting factor in wide parallel busses clocked athigh speed. PCI Express should not be confused with the PCIx parallel bus standardwhich merely doubles the clock speed of an ordinary PCI bus.

A PCI bus operating at 66 MHz can transfer 66∗106∗32 ∼= 2∗109 bits/s(2 Gb/s). By contrast, each PCI Express serial link (technically, called a lane) cantransfer 2.5 Gb/s in each direction simultaneously. This increase primarily is becauseof continual progress in the understanding of high-frequency digital operations insilicon. The past decade also has seen growing automation of the use of siliconmonoliths to implement what in the past were strictly discrete analogue devices.

68 4 Week 2 Class 2

The first PCI Express specification, completed in mid-2002, allowed up to 32PCI Express lanes, each at 2.5 Gb/s, for a total of 80 Gb/s in each direction. Thecurrent, second-generation PCI Express specification calls for 5 Gb/s per lane, ineach direction, doubling the speed of the original generation. PCI Express, like thePCI bus, is an on-board link meant for short-range transfers of data, for example,transfers by the CPU to and from RAM or video or I/O-port controller. PCI Expressnot only is capable of being far faster than PCI, but it is much cheaper in termsof routing area on the board. However, data management of the serialization anddeserialization makes PCI Express much more complicated to design.

For example, to see the speed advantage, a typical computer terminal has a reso-lution of 1280 × 1024 pixels. In full-color CMYK mode, there are 32 bits per pixel.Thus, the screen requires a buffer of 1280*1024*4 bytes, which totals about 5 MB.To refresh the screen at 70 Hz then requires a transfer rate of about 350 B/s A parallelvideo bus 128 bits wide, operated at 33 MHz, can transfer about 500 MB/s and thuscan handle the screen update. However, such a bus would be running more slowlythan a single lane in second-generation PCI Express and would occupy about tentimes the area on a video card. A PCI Express video bus makes a lot of sense, bothin terms of performance and economy.

There are many analogue issues involved with circuits operating in the GHzrange. For example, each serial line actually is a differential pair of wires, mak-ing for four wires per full-duplex lane; for our digital purposes, calling each paira serial line is accurate enough. The mechanical implementation of a PCI Expressserial line may be a simple transmission line, a twisted pair, or even a coaxial cable.We shall ignore these issues for now and work on a serdes composed of routings andgates with all analogue difficulties assumed to have been designed away. Our serdeswill include a complete, self-contained serial-parallel interface the components ofwhich do not map exactly one-to-one with those of the PCI Express standard. Unlikea PCI Express design, when we are done, ours will be entirely digital and therefore,to that extent, technology-independent. Some optional readings have been providedin the References on the analogue side of the problem.

A serdes transfers data between two or more systems with busses of arbitrary(parallel) width. These local parallel busses are so short-ranged, that interbit skewis not specifically a design consideration. The data to be transferred are clockedoff a parallel bus into a buffer of some kind, generally a FIFO (First-In, First-Outstack memory), serialized (converted to a serial format), and transferred one bit ata time by the serdes to their destination. At the destination, the incoming serialdata are deserialized (converted to a parallel format), buffered, and clocked ontothe destination parallel bus. A great simplification is that the required precisely-synchronized common clock may be generated by a low-jitter PLL from the clockimage encoded in the data. Clocking in this context is the same as the defining ofdata-frame boundaries. If the serdes is in a stable, on-board environment, clockingcan be accomplished by using a PLL synchronized to the serial stream to generatea usable clock for buffering and format conversion at both ends. A single PLL maybe shared if both ends of a lane are in the same clock domain; or, two independentPLL’s may be used, one at the end of each lane.


4.1.4 The SerDes of this Course

The serdes we shall study operates, like the one in PCI Express, in a full-duplexmode; this means that transfers are possible simultaneously in both directions withno interference or sharing required. This is done simply by having two dedicated(verilog) data wires, each one capable of sending data in one direction.

Our serdes will convert 32-bit parallel data on each end to serial data in 16-bit frames. A full packet of data will be 64 bits. The 16-bit frame means that theembedded serial clock will change once per 16 bits transmitted. We shall transmitonly one data byte per frame; this is less efficient than PCI Express, but it will allowthe embedded clock to be extracted in an easily visible, orderly way. To serializeor deserialize the data, which must be processed one bit at a time, a PLL will bespecified which multiplies the parallel-clock frequency by 32.

We shall do our design on the assumption of a relatively low parallel clock fre-quency of 1 MHz; our serial line then should transmit at 32 Mb/s, around 1/100of the speed of either direction of a PCI Express lane. Because we require onepad byte for each data byte, a data packet containing one 32-bit word will oc-cupy 64 bits in the serial stream. Thus, one framed data word is 64 bits wide,like this:

64’bxxxxxxxx00011000xxxxxxxx00010000xxxxxxxx00001000xxxxxxxx00000000.

The ‘x’ bits represent values in data bytes; the pad byte values are shown as-is.This means that we should expect to transmit (and receive) one 32-bit word on

every other clock. Packets on a PCI Express bus may be as large as 128 bits, but weshall not adopt this kind of formatting in our design.

After we complete our design, we’ll see how fast it can be made to run by logicsynthesis and optimization with the gate-level libraries available for course use.

Figure 4.6 gives a block diagram of the data flow in our serializer.

Fig. 4.6 Dataflow of the planned serializer

70 4 Week 2 Class 2

Our deserializer just reverses the transformations of the serializer and is shownin Fig. 4.7.

Fig. 4.7 Dataflow of the planned deserializer

In a PCI Express or other similar serdes, the input clock on the receivingend often is extracted from the serial data stream and converted to a (analogue-approximate) square-wave for input as Clock to the PLL. The ClockOut created bythe PLL is controlled to match Clock precisely in phase, even if the PLL ClockOutis at a high multiple of the frequency of the input Clock.

Just as a matter of side interest, Fig. 4.8 gives an example of a real clock wave-shape for a modern memory chip.

Fig. 4.8 A good 400 MHz clock. (From photo taken at the LSI Logic exhibit, DenaliMemCon 2004)

This is a good clock running at 400 MHz; the vertical grids in Fig. 4.8 are about300 mV/division. The waveform is the jitter envelope of many thousands of cycles.The PCI Express serial link would be a two-wire differential pair running at a peak-to-peak voltage of about 1 V. It’s easy to see how analogue issues might arise, evenat only 400 MHz.

4.1.5 A 32 × Digital PLL

In our next lab, we shall write a simple but unsynthesizable verilog model of a PLLdesigned to multiply frequency by 32.

We are constrained to keep to digital design, in this course; so, as for the 1×PLL above, we shall substitute a frequency lock for a phase lock in our serdes PLL.

4.2 PLL Clock Lab 6 71

Digital synchronization will provide the phase lock. To save coding time, we shallnot bother with averaging for an improved lock. To cut down on comparator de-cisions in response to random phase misalignments, we shall generate an externalsampling pulse; the VFO will run free and will adjust its frequency only when thispulse occurs.

We can force this design to be acceptable to the synthesizer just to see whetherwe can create of a netlist, but that netlist will not be functional. Later in the course,we shall redesign this PLL so it will be correctly synthesizable to a working netlist.In the meantime, we can use this easily-written, unsynthesizable PLL to clock theserial line while we are working on other parts of our serdes design.

The functionality of our PLL is represented by the block diagram of Fig. 4.9,with n = 5:

Fig. 4.9 PLL block diagram, showing ClockIn ×2n frequency multiplication

A simple, resettable verilog 5-bit up-counter can be done this way:

reg[4:0] Count;always@(posedge Clock, posedge Reset)beginif (Reset==1’b1)

Count <= 5’h0;else Count <= Count + 5’h1;end

The lab instructions will include a schematic and further details.

4.2 PLL Clock Lab 6

Begin this lab by changing to the Lab06 directory.Lab ProcedureIn this lab, we’ll build a PLL and a related parallel-serial converter. Because a

logic synthesizer can’t synthesize delays (or delay differences), for synthesis pur-poses we normally would replace this PLL with a pre-synthesized hard macro oranalogue IP (“Intellectual Property”) block.

We shall introduce no #delay value anywhere in the PLL, other than testbench orclock-frequency delays.

72 4 Week 2 Class 2

To be systematic, we’ll break down the PLL into three blocks, as explained aboveand shown in Fig. 4.10: VFO, Comparator, and Counter.

Fig. 4.10 Schematic of verilog PLL, showing wire names. Reset nets omitted

Step 1. Start by creating a top-level module named PLLsim (sim = simu-lation). Instantiate in it three submodules: a VFO, a ClockComparator, and aMultiCounter. As usual, put each module in a separate file named for that mod-ule. Give the top module just three inputs, ClockIn, Reset, and Sample, andone output, ClockOut.

Declare module header ports to match the blocks shown in Fig. 4.10:

• For theClockComparator, declareClockIn,Reset, and CounterClockinput ports, and a two-bit AdjustFreq output port. We could use finer tuning,but two bits is enough for our purposes.

• For the VFO, declare a two-bit AdjustFreq input port, a SampleCmd and aReset input port, and a ClockOut output port.

• For the MultiCounter, declare Clock and Reset input ports and aCarryOut output port.

Connect the ports in PLLsim with wires, consistent with the block diagram ofFig. 4.10.

You might as well add a testbench module in PLLsim.v and instantiate PLLsimin it; having a testbench early in the process will allow you to try out your submod-ules as you complete them.

Step 2. Model the PLL VFO. We shall start with a simple verilog model ofa variable-frequency clock which adjusts itself in small increments using a two-bitcorrection flag on an input bus named AdjustFreq . A value of 1 on this busmeans no change; a value of 0 means reduce the frequency; a value of 2 or 3 meansincrease the frequency of the VFO.

The parallel-bus clock of 1 MHz implies a VFO base frequency of 32 MHz;so, the VFO period should be 31.25 ns, and the half-period (edge delay) thereforeshould be 15.625 ns. These delays are fractional, but we can use only integer types(integer or maybe reg); therefore, we can’t express anything less than 1 ns, so,we can expect our frequency control to be somewhat coarse.

In this preliminary, unsynthesizable PLL, we’ll sample frequency occasionally(to be determined by the serial data in) using an external signal, Sample. We’ll


design to be able to adjust our VFO by 1/16 of the half-period per sample, whichcomes to a little less than 1 ns per sample. So, on any Sample, if AdjustFreq >2’b01, we’ll decrease VFO period by about 1 ns; if AdjustFreq < 2’b01,we’ll increase VFO period by about 1 ns.

It is allowed in verilog to declare a parameter to be real and/or signed, whichwould be useful in a PLL testbench. But, many tools won’t let us use a parameterto store a real number, and declaring a real port would be a serious digitalimplementation problem, so, we’ll have to be content to control the design base-point operating frequency with defined macro constants. To accomplish this, and forflexibility in compiling individual modules separately, you should put these lines ina separate include file, PLLsim.inc:

‘timescale 1ns/100ps‘define HalfPeriod32BitBus 500.0 // ns half-period at 1 MHz.‘define VFOBaseDelay ‘HalfPeriod32BitBus/32.0 // At 32 MHz.‘define VFO DelayDelta 1 // ns.‘define VFO MaxDelta 2 // ns.

The last one is to prevent the PLL from running away or grinding to a halt: Useit in your VFO module to limit the frequency excursion from the base frequency.

Because defined macro constants can’t be changed during simulation (actu-ally, they are removed during simulator compilation and replaced by their val-ues), these constants will have to be used to initialize verilog variables. The val-ues above in PLLsim.inc will work whether the variables are synthesizableintegers or unsynthesizable reals. Later in the course, we’ll change this in-clude file to use different constant values for reals (simulation) than for integers(synthesis).

Next, in any design file using these definitions, add this line above the moduleheader:

‘include "PLLsim.inc"

The only files requiring this are the top-level one, PLLsim.v, which propagatesthe timescale to the rest of the design during simulation, and the VFO one, VFO.v.Notice two things about the include definitions: (a) Given the 32-bit to 1-bit de-sign itself, the values depend solely on the timescale and one assigned value,‘HalfPeriod32BitBus. (b) the values involved in division include decimalpoints: This is required in verilog to force the delays to be calculated as real num-bers. Without decimal points, the divisions and rounding would be done on integers,and the result would be less accurate in simulation. However, all variables involvedhave to be declared as integers for synthesis anyway, so the intermediate float cal-culations only have a small effect.

After this, in VFO declare the following variables (except VFO ClockOut) asinteger and add this reset block:

74 4 Week 2 Class 2

always@(Reset, SampleCmd, VFO ClockOut)if (Reset==1’b1)

beginBaseDelay = ‘VFOBaseDelay;VFO Delta = ‘VFO DelayDelta;VFO MaxDelta = ‘VFO MaxDelta;VFO Delay = ‘VFOBaseDelay;VFO ClockOut = 1’b0;end

else (below)

As a minor contribution to the lab exercise, complete the delay control of thisVFO design by supplying the contents of the else half of the above alwaysblock, which is given in part below; this else applies the delay adjustment anddetermines the VFO half-cycle delay:

else // as above.if (SampleCmd == 1’b1)beginif ( AdjustFreq>2’b01

&& (BaseDelay - VFO MaxDelta < VFO Delay) )// If floor is lower than current:VFO Delay = VFO Delay - VFO Delta;

else if ( AdjustFreq<2’b01&& (fill this in)

// else, leave VFO Delay alone.

Because of the VFO MaxDelta limits, we have replaced the earlier case state-ment with two levels of if’s.

To generate the PLL clock from the delay determined above, we complete themodel:

always@(Reset, SampleCmd, VFO ClockOut)if (Reset==1’b1)

... (as above) ...else begin

... (freq. adjustments) ...‘ifdef DC// No delayed nonblocking assignments:#VFO Delay VFO ClockOut = ∼VFO ClockOut;‘else#VFO Delay VFO ClockOut <=∼VFO ClockOut;‘endifend // main else.


The oscillation inversion in the last assignment above absolutely requires anonblocking assignment for simulation; however, the synthesizer rejects delayednonblocking assignments; whence the ‘ifdef (the macro DC always is definedwhen the synthesizer runs). Some demo versions of Silos will simulate the oscilla-tion if it was written with a blocking assignment, which actually is a verilog lan-guage error.

The reason the clock-generating always block is written to be sensitive toSampleCmd is because whenever such a command is asserted, we want the PLLclock to become synchronized to it.

After completing the VFO module, simulate it from PLLsim, putting a dummyclock temporarily into the comparator module to count 0 to 3 to trigger frequencyadjustment events. Just do a simple simulation to check that the delay programmingis working.

Step 3. Model the PLL comparator. The ClockComparator module, as infigure 3.10, compares an arbitrary input clock frequency to a variable clock fre-quency and issues an adjustment to make the variable frequency more closely matchthe input frequency.

In our design, as implied in this lab for the VFO model, the comparator willcompare the frequencies continuously, but the VFO will not make any adjust-ment unless a Sample command requires it. We need not supply our compara-tor a Sample command, but we must supply a Sample command periodically tothe VFO.

However, the ClockComparator must be supplied a reset to ensure its regis-ters are in a known state.

How will our comparator work? There are several different ways to do a purelydigital frequency comparison; we shall do it in this lab by counting variable-clockedges after every input-clock edge.

There will be three possible cases:

• If more than one variable-clock edge is found, the adjustment will be set to de-crease variable-clock frequency.

• If just one variable-clock edge is found, the adjustment will be set for no variable-clock change.

• If no variable-clock edge is found, the adjustment will be set to increase variable-clock frequency.

To do this, declare a 2-bit register named VarClockCount to count incom-ing PLL clock (CounterClock) edges. The width of 2 bits allows a value of 3to be counted, which exceeds the expected number of PLL clock edges per inputclock, assuming that both clocks have very close to the same frequency and dutycycle. We wish to avoid a VarClockCount counter overflow under all reasonableconditions, but using a 32-bit verilog integer seems excessive.

It then is possible to use the following in the ClockComparator module tocount the variable clock edges as below:

76 4 Week 2 Class 2

...always@( ClockIn, Reset ) // This is the synchronizing clock.

Beginif (Reset==1’b1)

beginAdjustFreq = 2’b01;VarClockCount = 2’b01;end

else // CounterClock is the clock to synchronize. Notice that it// is not on the sensitivity list; the inferred latch may be// expected to cause synthesis problems.if (CounterClock==1’b1)

VarClockCount = VarClockCount + 2’b01;else begin

case (VarClockCount)2’b00: AdjustFreq = 2’b11; // Better speed it up.2’b01: AdjustFreq = 2’b01; // Seems matched.

default: AdjustFreq = 2’b00; // Better slow it down.endcaseVarClockCount = 2’h00; // Initialize for next ClockIn edge.end

Notice that in this always block, we are monitoring every change of logic levelof the external clock. We do not trigger a reading of this block on the clock fromour PLL counter, because the PLL counter-clock is what we are using this alwaysblock to synchronize. It would be possible to use the VarClockCounter countdirectly as a control for the VFO, but the degree of freedom added by the caseencoding allows us to modify the ClockComparator somewhat without makingchanges in the VFO, too.

We count every time a ‘1’ is found for the PLL CounterClock when thesynchronizing clock has gone to ‘1’. Whenever we find CounterClock to be‘0’, we check the count we accumulated since the last ‘0’: If this count is 0, theCounterClock is assumed to be running too fast, so we request a speed-up; if itis 1, we seem to be approximately synchronized, and we do nothing; if it is above 1,the CounterClock is assumed to be running too slowly (its positive level wassampled twice), so we request a VFO slow-down.

Finally, complete the following lab Steps to determine the adjustment: TheseSteps simulate both VFO and ClockComparator together from a testbenchwhich controls PLLsim. Be sure to reset ClockComparator to ensure correctsimulation.

Step 4. Model the PLL Multiplier-Counter. This will be just a simple counterwhich toggles its overflow bit every 32nd clock. Thomas and Moorby (2002) de-scribes counter modelling in detail (e. g., sections 2.2–2.6, 6.5.3, 6.7); we shall notdwell on the architecture until a later chapter.


Typical verilog for a simple behavioral up-counter is as follows:

reg[HiBit:0] CountReg...always@(posedge ClockIn, posedge Reset)if (Reset==1’b1)

CountReg <= ’b0;else CountReg <= CountReg + 1’b1;

Note: A parameter is not allowed in verilog to specify the width of a literal. So,“BitHi’b1” would not be a legal increment expression. If we wrote, CountReg<= CountReg + 1, that would be OK, but it implies a 32-bit default integerincrement. So, we size the increment, assuming that widening 1 bit to the width ofCountReg would be easier on the compilers than narrowing 32 bits down to thatwidth. This is speculative and really probably makes no difference.

Defining the “carry” bit as the MSB in the counter reg, to get a carry out every32nd clock, our PLL’s counter has to be exactly 5 bits wide (25 = 32). So, install a5-bit counter in the MultiCounter module, and wire its highest-order bit to theCarryOut port. The carry bit will toggle every 16 clocks, giving it a period of 32clocks. Use a behavioral counter sensitive to posedge clock and posedge reset.

Step 5. Test the complete PLL. With the counter wired into the top-level, thePLL now should be complete and functional. Use the testbench to trigger the top-level Sample with a positive pulse just after each positive edge of ClockIn. Ver-ify the PLL by simulating from PLLsim (see Figs. 4.11 and 4.12). You should beable to change the frequency of ClockIn of Fig. 4.10 and see the CarryClockand the PLL ClockOut change (coarsely) to match it.

Fig. 4.11 Simulation of PLLsim. The “lock-in” of the VFO delay is coarse and extreme, but itdoes occur

Fig. 4.12 Closeup of the PLLsim serial clock

78 4 Week 2 Class 2

This completes our work on the PLL for now; the rest of this lab is on other, butrelated, topics.

Note on the PLL VFO Block. In back-end design, a hard macro is just a block withfixed size and pin locations – for example, an IP block. This isn’t a floorplanningor layout course, so we will not dwell on hard macroes. Later, to obtain a netlist,instead of substituting a macro, we can inhibit optimization of the VFO block usinga synthesis don’t-touch directive; or, we simply can tolerate a nonfunctional PLL inthe netlist.

Any hierarchical block instance below the top level of a design can be preservedfrom flattening or optimization by the synthesizer by marking it with a don’t touchcomment directive (use your synthesizer summary pdf on the CD-ROM). Instead ofa comment directive in the verilog, the don’t touch command may be included inyour synthesis script.

When the need for a dont touch arises as a synthesis convenience, locating itin the script may be preferable to inclusion in a comment, because a directive in thescript may be modified, or coordinated with other synthesis or optimization options,without editing the verilog in a design source file. However, when a dont touchinstance or net permanently is part of the design, locating the dont touch in acomment, making it part of the design, may be the better choice.

This completes the verilog 32x PLL. We shall use it forserdes simulation until we redesign it for synthesis.

Step 6. Model a generic parallel-serial converter.

Call this converter, ParToSerial; we’ll walk through the whole design.A parallel-serial converter requires knowledge of the width of the parallel bus. Theclock input may be just a serial clock, SerClock. Assume the parallel bus con-tents are properly clocked and synchronized externally, and that there is an inputflag, ParValid, to indicate when data on the parallel bus are stable and valid. SeeFig. 4.13.

Fig. 4.13 Generic, minimalparallel-serial converter

Assume that when ParValid goes to 1, the parallel data will remain valid untilall serial data are clocked out, but don’t clock out anything when ParValid is notasserted; also, don’t clock out anything that has been already clocked out, no matterwhether the parallel data are valid or not.

The serial protocol is simple: Clock out the data high-order bit (MSB) first, onebit per serial clock, setting a SerValidFlag when the first bit is on the serial bus,and clearing it after the last bit is on the serial bus.


We have to adhere to the rule that no data object should be assigned from morethan one always block; not only is this good design practice, but it is required forsynthesis. Because we must clock out by the serial clock, our one always blocktherefore must be sensitive to the serial clock.

A good plan is to use a Done flag to hold the state of the serialization. Thismakes the model an informal state machine: As soon as the ParValid is sampledasserted, move from a Done state to a not-Done state, and begin shifting out theserial data. If ParValid should be deasserted, or if the last parallel bit should havebeen processed, move back to a Done state. Remain in this Done state until a newParValid is sampled (this means sampling a deasserted ParValid before leav-ing Done). Otherwise worded, Done must be cleared by sampling of a deassertedParValid. We shall look into state machines later in the course.

Assuming a fixed parallel width of 32 bits, one way to do the verilog for thismodel is below.

module ParToSerial (output SerOut, SerValidFlag, input SerClock, ParValid, input[31:0] BusIn);

integer ix;reg SerValid, Done, SerBit;assign #1 SerValidFlag = SerValid;assign #2 SerOut = SerBit;always@(posedge SerClock)

begin // Reset everything unless ParValid:if (ParValid==1’b1)

if (SerValid==1’b1)beginSerBit <= BusIn[ix]; // Current serial bit.if (ix==0)

beginSerValid <= 1’b0;Done <= 1’b1;end

else ix <= ix - 1;end // SerValid was asserted.

else begin // No start yet:if (Done==1’b0)

beginSerValid <= 1’b1; // Flag start on next SerClock.ix <= 31;// Ready to start on next SerClock.end

SerBit <= 1’b0; // Serial bit default.end

else // ParValid not 1; reset everything:beginSerValid <= 1’b0;Done <= 1’b0;SerBit <= 1’b0; // Serial bit default.end // if ParValid else

end // alwaysendmodule // ParToSerial.

80 4 Week 2 Class 2

Finish this up by introducing a parameter to set the parallel-bus width, and bywriting a testbench. There is a copy of this model for you to modify in the Lab06directory, named ParToSerial unfinished.v.

For this lab exercise, code the parallel-serial conversion module with a verilogparameter (ANSI style) which allows the width of the parallel input bus to bevaried. Use the parameter inside the module, as well as in declarations, so that noth-ing is hard-coded for width. Make the default width 16 bits.

Also, change the declaration of ix to a reg declaration. Use a reasonable width,but also think about how you could make the reg width the minimum possible thatwill allow the model above to work correctly with a parameterized BusIn width.Simulate your model to verify that it works. Typical simulation results are shown inFigs. 4.14 and 4.15.

Fig. 4.14 Overview simulation of the generic parallel-to-serial converter

Fig. 4.15 The generic parallel-to-serial converter: Serial data closeup

Step 7. Serialization Frame Encoder. In the final part of this lab, we shall modela parallel-to-parallel encoder which takes a generic parallel bus and converts to awider bus which includes framing and frame boundary (pad) markers. This last for-mat then clearly can be serialized to clock our class-project serdes PLL as well asto be transmitted serially.

Refer to the block diagram of the Serializer presented at the start of this chapter.On each positive edge of a do-the-encode input clock, the parallel bus is sampledand copied to its framed format. The input bus may be assumed 32 bits wide and theframed output bus 64 bits wide, because we require that after each data byte on theinput bus, we insert 8 bits of frame padding. See Fig. 4.16.


Fig. 4.16 The serdes packet format. Each data byte is framed with one identifying pad byte

Recalling the frame format for our project, to identify each byte of data, eachframe boundary (8 bits) contains an ordinal number giving its place in the original32 bits of data.

It takes 2 bits to enumerate the 4 data (bytes) in a 32-bit bus, so we shall make upeach frame boundary to include a 2-bit number padded on each side by 3 binary 0’s.Our frame boundaries then will be these binary numbers: Boundary0 (below thelowest-order data byte) = 8’b000 00 000; boundary1 = 8’b000 01 000;boundary2 = 8’b000 10 000; and, boundary3 = 8’b000 11 000.

So, one sample of our framed data will look like this in binary, with x’s repre-senting the original input data:

64’bxxxxxxxx00011000xxxxxxxx00010000xxxxxxxx00001000xxxxxxxx00000000.

Design a module named SerFrameEnc which will encode an input bus thisway on every positive edge of a sampling-clock input. When writing your model,use verilog parameter values to specify the input and output bus widths. Aftersimulating it to check the result (see Fig. 4.17), try to synthesize your model, opti-mizing first for area and then for speed.

Fig. 4.17 A SerFrameEnc simulation, netlist optimized for speed


Where should ‘timescale be set in a multimodule design?How does verilog distinguish integer from float (real) numerical constants?How could the PLL comparator be modified to compare on every clock input,

rather than on every edge? How would this change the constants ‘define’d in thetop module, PLLsim?

82 4 Week 2 Class 2

The frame encoder seems very inefficient, using 64 bits to encode 32 bits of data.How might this encoding be made more efficient? At 2 or more Gb/s for a PCIExpress serdes, does it matter? We’ll look into this question again soon.


SerDes currently (2008) is a hot design topic. Read the overview posted at theFreescale site: http://www.freescale.com/webapp/sps/site/overview.jsp?nodeId=01HGpJ2350NbkQ (2004-11-16).

For a simple introduction to PLL’s, with some history, read Ron Bertrand’s, “TheBasics of PLL Frequency Synthesis”, in the Online Radio and Electronics Course, athttp://www.radioelectronicschool.com/reading/pll.pdf(2004-12-09).

(optional) For some background on the analogue issues, try the article by S. Seat,“Gearing Up Serdes for High-Speed Operation”, posted at http://www.commsdesign.com/design corner/showArticle.jhtml?articleID=16504769 (2004-11-16).

(optional) Two very thorough articles on serdes analogue design considera-tions, especially those concerning the PLL, are by E. H. Suckow: “Basics ofHigh-Performance SerDes Design: Part I” at http://www.analogzone.com/iot 0414.pdf; and, Part II at http://www.analogzone.com/iot 0428.pdf (2004-11-16).

(optional) A very good presentation of PCI Express as it relates to system ar-chitecture was given by Kevin Edwards at EuroDesign Con 2004: “PCI Express –IP for a Next-Generation I/O Interconnect”. This paper is very IP and standardsoriented. Available from Mentor Graphics on free registration at: http://www.techonline.com/community/tech group/embedded/tech paper/37455 (2005-06-11).

(optional) The S. Knowlton paper listed in the References covers the PCI Expresssystem architecture and the analogue speed issues but is available only to Synopsystool licensees. Knowlton explains the individual PCI Express system components ina way very compatible with our class project. He also gives some detail on packetbuffering, the retry and ECC functions, and the specific PCIe component called theserdes, which is in the analogue domain of the PCIe standard.


5.1 Data Storage and Verilog Arrays

This time we’ll study data storage and integrity, and the use of verilog arrays tomodel memory.

5.1.1 Memory: Hardware and Software Description

Memory retains data or other information for future use. In a hardware devicecontext, memory may be divided into two main categories, random-access andsequential-access.

Random access memory (RAM) is most familiar as the storage in the memorychips used in computers. Any storage location can be addressed the same way, andby the same hardware process, as any other storage location. Every addressed datumis equally far from every other one, so far as access time is concerned. EPROM,DRAM, SRAM, and flash RAM or ROM are familiar names for implementations ofthis kind of memory.

Sequential-access memory includes tapes, floppy discs or hard discs, and opticaldiscs such as CD or DVD discs. Sequential access requires a process or procedurewhich varies with the address at which the storage is located. For example, a tapemay have to be rewound to reach other data after accessing data near the end ofthe tape; or, the read/write head of a disc may have to be repositioned and movedifferent distances when it changes from one address to another.

In the present course, we shall not be dealing with sequential-access mem-ory models; they have no existence in the verilog language. However, random-access memories do. Verilog has an array construct specifically intended to modelRAM.

Side issue: The description of RAM chip capacity on data sheets and in the litera-ture varies somewhat and can be confusing. A hardware databook often will describethe capacity by giving the total number of bits addressable (excluding error correct-ing or parity storage), and will give the word width, thus: (Storage in bits) × (word


84 5 Week 3 Class 1

width). So, a “256k × 16” RAM chip would store 256k = 256×1024 = (28×210) =218 bits. To determine how many address locations in this chip, one divides by theword width: Clearly, in this example, there would have to be an 18-bit address busto address by 1-bit words. This means (218/23) = 15 bits of address to address bybyte, or 14 address bits to address by the hardware-defined 16-bit word width forthis chip.

Hardware manufacturers use this method when these chips are designed for nar-row 1-bit or 4-bit words and are intended to be wired in parallel to match the com-puter word size. For example, 8 256 k × 1 DRAM chips would be required for amemory of 256k bytes. Or, 9 would be required for 256k of memory with parity.No arithmetic is necessary to know the final number of addresses; 32 of these chipswould be required for 1 MB (megabyte) of memory.

In Palnitkar (2003), there is a DRAM memory model in Appendix F. The authoruses a software description of this memory: It is given as (storage in words) × (wordwidth). This memory, interpreted as a single DRAM chip, stores (256k × 16) bits =(256× 1024)× (16) = 2(8+10)+4 = 222 = 4Mb (megabits), recalling that 1Mb =1024×1024 = 220 bits. To model a memory accurately, it is necessary to understandwhat the description is saying.

5.1.2 Verilog Arrays

We have worked with verilog vectors up until now; these objects are declaredby a range immediately following the (predefined verilog) type: For example,reg[15:0] is used to declare a 16-bit vector for storage.

A verilog array also is defined by a range expression; however, the array rangefollows the name being declared. Historically, the array range is kept separate fromthe vector range precisely because it was intended that an array of vector ob-jects should be used as a memory. For example, “reg[15:0] WideMemory[1023:0];” declares a memory named WideMemory which has an addressrange (array) totalling 1024 locations; each such location stores a 16-bit reg ob-ject (word).

Notice that the upper array index is not a bit position; it is the number of stor-age locations in the memory minus 1 if the lower index is 0. The address bus forWideMemory would be 10 bits wide.

The general syntax to declare a verilog array thus is:

reg [vector log indices] Memory Name[array location indices];

A signed reg type, such as integer, also might be used, but this would be rare.An example of memory addressing is,

5.1 Data Storage and Verilog Arrays 85

reg[7:0] Memory[HiAddr:0]; // HiAddr is a parameter >= 22.reg[7:0] ByteRegister;reg[15:0] WordRegister; // This vector is 16 bits wide....ByteRegister <= Memory[12]; // Entire memory word = 1 byte.WordRegister <= Memory[20]; // Low-order byte from the memory word.WordRegister[15:8] <= Memory[22]; // High-order byte from the memory word....

Like hardware RAM, verilog memory historically was limited in the resolutionof its addressability. A CPU can only address one word at a time, and when it does,it gets the whole word, not just a single bit or a part of the stored word. It used to beso in verilog: A memory datum ( = array object) could not be accessed by part orbit, unless the words it stored were just one bit wide. This is not true any more afterverilog-2001.

For example, suppose a memory word size was 64 bits, but the system wordwidth was 32 bits. Then, the following code would be legal in verilog-2001 andverilog-2005:

reg[63:0] Memory[HiAddr:0]; // HiAddr is a parameter > 56.reg[7:0] ByteRegister;reg[31:0] WordRegister; // This vector is 32 bits wide....ByteRegister <= Memory[57]; // Entire memory word (truncated).ByteRegister <= Memory[50][15:8]; // 2nd byte from a memory word.Memory[56][63:32] <= WordRegister; // To the high half of memory word 56....

To declare a mamory, vector and array sizes are given with ranges separated, butthe resulting objects are referenced with ranges all following the object name. Thememory address is immediately next to the declared name and references an entirearray of bits of some kind; selects follow to the right of the memory location, as inselecting from a vector.

Verilog (verilog-2001) allows multidimensional arrays. For example,

reg[7:0] MemByByte[3:0][1023:0];

declares an object interpretable as a memory storing 1024 32-bit objects, each suchobject being addressable as any of 4 bytes. Or, it might be interpreted as storing4096 8-bit objects arranged the same way. So, “ByteReg <= MemByByte[3][121];” may be used as though reading the high byte stored at location 121. The(“[3][121]”) is an address, not a select, so the declared order of the indices isused. Also, because these are addresses, variables are allowed in the address indexexpressions. Variables are not allowed in part-selects anywhere in verilog; they areallowed in vector bit-selects.

86 5 Week 3 Class 1

In a multidimensional array, any number of dimensions is allowed, but onlyrarely would more then three be useful. It is possible to reduce the required di-mensionality by one by using a part-select on the addressed word; of course, sucha part-select would be legal only if in constant indices (literals, parameters, or con-stant expressions entirely of them).

For example,

reg[7:0] Buf8;

reg[7:0] MemByByte[3:0][1023:0]; // 2-D (call byte 3 the high-order byte).

reg[31:0] MemByWord[1023:0]; // 1-D.

integer i, j;

...

i = 3;

Buf8 <= MemByByte[i][j]; // High-order byte (3,j) stored in Buf8.

Buf8 <= MemByWord[j]; // Low-order byte stored.

Buf8 <= MemByWord[j][31:24]; // Part-select; high-order byte stored.

Buf8 <= MemByWord[j][(i*8)-1:(i-1)*8]; // ILLEGAL! i is a variable!.

...

Thomas and Moorby (2002) discusses multidimensional arrays in appen-dix E.2.

Finally, it is not legal to access more than one memory storage location in a singleexpression: For reg[7:0] Memory[255:0], the reference, “HugeRegister<= Memory[57:56];” is not legal, nor, for “reg[31:0] MemByByte[1023:0];”, would be “MyIllegalByte <= MemByByte[121:122][31:28];”, which would seem to cross address boundaries to get 4 bits fromeach of two different addresses. Only one address per read or write is allowed; but,like a bit-select of a plain vector, an address may be given by a variable.

This last implies that an array object never may be assigned directly; it has to beaccessed, possibly in a loop, one address at a time.

Verilog memory access currently is associated with these limitations:

• Only one array location is addressable at a time.• Part-select and bit-select by constant are legal after verilog-2001, but implemen-

tation by tools is spotty.• Part-select or bit-select by variable is not allowed.• Neither VCS nor Silos (demo version) can display a memory storage waveform;

however, QuestaSim and Aldec can.

Thus, currently, it is best to access memory data by addressing a memory lo-cation and assigning the value to a vector; this vector value then can be dis-played as a waveform and may be subjected to constant part-select or variablebit-select as desired. This approach is portable among simulators and synthesizers.For example,


parameter HiBit = 31;reg[HiBit:0] temp; // The vector.reg[HiBit:0] Storage[1023:0]; // The memory.reg[3:0] BitNo; // Assigned elsewhere....temp = Storage[Addr];HiPart = temp[HiBit:(HiBit+1)/2]; // A parameter is a constant.LoPart = temp[((HiBit+1)/2)-1:0];HiBit = temp[BitNo]; // Bit-select by variable is allowed....

5.1.3 A Simple RAM Model

All that is necessary is a verilog memory for storage, an address, and control overread and write. For example,

module RAM (output[7:0] Obus, input[7:0] Ibus, input[3:0] Adr, input Clk, Read);

reg[7:0] Storage[15:0];reg[7:0] ObusReg;//assign #1 Obus = ObusReg;//always@(posedge Clk)if (Read==1’b0)

Storage[Adr] <= Ibus;else ObusReg <= Storage[Adr];endmodule

5.1.4 Verilog Concatenation

At this point, it may be useful to introduce one verilog construct we have not yet dis-cussed: Concatenation. To concatenate one or more bits onto an existing vector, theconcatenation operator, “{. . .}” may be used in lieu of declaring explicitly a widervector and assigning to it by part select or bit select. All this does is save declara-tions of temporary data; for permanent storage, the concatenated result would haveto be copied to a wide-enough vector somewhere.

88 5 Week 3 Class 1

For example, to concatenate a parity bit to the MSB end of a 9-bit data storagelocation,

reg[7:0] DataByte; // The 8 bit datum, without parity.reg[8:0] StoredDataByte; // High bit will be 9th (parity) bit....StoredDataByte <= {ˆDataByte, DataByte}; // A 9-bit expression.

Likewise, two bytes stored in variables could be concatenated by Word <={HiByte, LoByte};.

5.1.5 Memory Data Integrity

This topic is a huge and complex one and is an active research subject. We shall codeno more than parity checking in this course, but we shall introduce the principlesbehind error-checking and correction (ECC).

The problem is that hardware can fail at random because of intrinsic defects, orbecause of outside influences such as RF or nuclear radiation. We cannot addressthese failures in terms of verilog, but some background may be found in the supple-mentary readings referenced at the beginning of this book.

Error checking usually is done by computing parity for each storage location,by calculating various checksums, or by more elaborate schemes involving encodedparameters sensitive to the specific values of the data bits: If a bit changes from thevalue originally stored for it, a check is supposed to detect this and warn the useror initiate a correction process. The basic principle is to store the check parametersomehow so that hardware failures will change only the check or the data, but notboth in a way masking the failure.

Parity checking is commonplace for RAM: The number of binary ‘1’ values (or,maybe ‘0’ values) is counted at each address, and an extra bit is allocated to eachaddress which is set either to ‘1’ or ‘0’ depending on the count.

The parity bit makes the total number of ‘1’ (or ‘0’) values always an evennumber (for “even parity”) or always odd (“odd parity”). It usually doesn’t matterwhether even or odd is used, or whether ‘1’ or ‘0’ is counted, so we shall from hereon speak only of even parity on ‘1’. In this scheme, an even number of 1’s makesthe sum, the parity bit value, even – in other words, a 0. So, a byte now takes 9 bits(8 data + 1 parity) of storage, but any change in a bit always is detected; changes in2 bits in the same 9-bit byte will be missed.

A sum is just an xor if we ignore the carry, so parity may be computed by theverilog xor (ˆ) operator. For example,


reg[HiBit:0] DataVector;reg[HiBit+1:0] DataWithParity;...// Compute and store parity value:DataWithParity = {ˆDataVector, DataVector};// Check parity on read:DataVector = (ˆDataWithParity==1’b0)

? DataWithParity // Parity bit is discarded.: ’b0; // Assign zero on parity error.

Parity checking is adopted because it is fast; it incurs no speed cost when donein hardware; and, it is easy, because a simple xor reduction (verilog ˆ) of any setof bits yields the binary sum automatically.

Computing parity on every access to any memory location thus is modelledeasily.

Checksums usually are used with large data objects such as framed serial dataor files stored on disc. The checksum often is just the sum of all ‘1’ values (orsometimes bytes) in that object; but, unlike parity, it may be a full sum, not just abinary bit. Any change in the data which changes the sum indicates an error. If bitsare summed, a change from ASCII ‘a’ to ‘b’ in a word will flag an error; if bytesare summed, a missing ASCII character in a file will flag an error.

Unlike parity values, checksums generally are stored in a conceptually separatelocation from the data they check. They may be used for simple error detection orfor Error Checking and Correcting (ECC) code.

The checksum may be calculated any of a variety of ways:

• As a sum of bytes, frames, or packets.• As a sum of bits.• As a sum of ‘1’ or ‘0’ bits (= parity, if no carry).• As an encoded sum of some kind. For example, as a CRC (Cyclic Redundancy

Check). A Linear Feedback Shift Register (LFSR) is one way to implement CRCin hardware.

To reduce the likelihood of missing a change in, say, two bits or bytes betweenchecks, elaborate partial encodings of stored data are used. For example, for seriallytransferred data, a linear feedback shift register (LFSR) can be used to compute achecksum-like representative number for each such object. The LFSR computes arunning xor on the current contents of certain of its bits with a value fed back froma subsequent bit. See this idea in Fig. 5.1.

Fig. 5.1 Three stages of a generic LFSR, showing xor of fed-back data

90 5 Week 3 Class 1

Every time the object is accessed, it is shifted through this register, and the resultmay be compared against a saved value. The shift incurs a latency but no other delay.

Thomas and Moorby (2002) discuss CRC rationale in greater detail. An examplethey give is represented schematically in Fig. 5.2.

Fig. 5.2 LFSR characteristic polynomial in hardware. The value of Q[15:0] represents the CRCdefined in Thomas and Moorby (2002) section 11.2.5 as x16 + x12 + x5 + 1. Modulo division ofvalid stored data by this polynomial will return 0. The common clock to all bits, and the outputtaps, are omitted for clarity

5.1.6 Error Checking and Correcting (ECC)

ECC not only checks for errors, but it corrects them, within limits. All ECC pro-cesses can find and correct a single error in a data object, such as a memory storagelocation. Some can detect two or more errors and can fix more than one. All cor-rections are made by the hardware and generally incur little or no time cost. Almostall commercial computer RAM chips include builtin ECC functions. Hard discs andoptical discs always include ECC.

Basically, the ECC idea depends on a checksum: A representation of the data isstored separate from it; the data are compared with this representation whenever thedata are accessed; an error is localized to the bit which was changed, and the data ac-tually seen by the computer or other device are changed to the original, correct value.

Some of the Additional Study reading will explain the details, but a brief, con-ceptual introduction will be presented now: We shall show how to do ECC usingparity:

ECC from parity. Consider a parity bit pT representing the total, 8 bits of data,and suppose just one bit could go bad. Assume a specific parity criterion, such aseven-1. If a parity check failed, all that could be done would be to recognize that thisspecific data object (8+1 bits) was bad; no correction could be made except perhapsto avoid using the data.

But, suppose two parity bits had been computed, one (p1) for the low nybble (bits0–3) and the other (pT) for p1 and the other 8 bits. Then, if pT failed a check, but p1didn’t, the apparatus could proceed on the assumption that the error was localizedin bits 4–7 or in the pT bit. If both p1 and pT failed, an error must have occurredin bits 0–3, or p1, but not pT, If p1 failed but not pT, one could be sure at least twoerrors had occurred, which we have agreed not to consider in this example.


With three parity bits, p1 and pT as before, and a new pE calculated from alleven bits (0, 2, 4, 6), one could narrow down the error to two of the four bits in onenybble, and with fourth parity bit pL on the low half of each nybble, the erroneousbit could be identified unambiguously. The correction then would be just to flip thecurrent value of that bit and go on using the resultant, corrected datum just as thoughthere had been no error. Cost: 12 bits to represent 8 bits of data; in this simple case,a 50% overhead in size, but no cost in speed.

The process just described may be summarized this way for one byte:

Assume only one hardware failure per word, and reg[7:0] Word;

1. Define pT = ˆWord[7:0];pT toggles if any bit in Word changes; system can detect this. 8’bxxxxxxxx

2. Define low-nybble pN = ˆWord[3:0];pN and pT toggle if any bit in low nybble changes.pT toggles if any bit in Word[7:4] changes.Thus, system can determine which half of Word is reliable. 8’bxxxxxxxx

3. Define even pE = ˆ{Word[6],Word[4],Word[2],Word[0]};System can determine whether odd or even bits,

of which half, are reliable. 8’bxxxxxxxx4. Define low half-nybble pL = ˆ{Word[5:4], Word[1:0]};

Using pT, pN, pE, and pL, system can determine which bit changed andflip it back during a read. 8’bxxxxxxxx→ 8’bxxxxxxx-xThis ECC costs 4 extra bits per 8 data bits.

Realistic ECC. Usually, data objects larger than one byte are adopted for ECC;and, for them, the size overhead can be a smaller percentage of the data to bechecked. Instead of a binary search in a parity tree to recover single errors, it isstatistically more efficient to use a finite element approach in which parity is re-placed by an overdetermining set of pattern coefficients. This is done almost alwaysby using LFSR hardware and applying algebraic field theory to encode regularitiesof the stored data in the checksums. Multiple bit errors in a large block of data can berecovered somewhat independently of where the errors occur in the block, and thechecksum overhead for practical ECC of a 512-byte block of data can be less than64 bytes. This overhead is about equal to that of simple, byte-wise parity checkingwithout correction!

To illustrate the mechanics of a realistic ECC process, suppose we sidestep thecomplications of group theory and adopt a minimally complex method, again basedon parity: We shall take an 8-bit byte of data and append to it a checksum composedof a vector of 8 parity bits composed as follows:

The 8 data bits will be written MSB on the left. Concatenated to the right of the data LSBwill be a 8-bit checksum. The leftmost checksum bit will be calculated as the parity (evenparity on ‘1’) of the whole data byte; the next checksum bit will be calculated as parity ofthe 7 bits of data with the MSB excluded. The next checksum bit will be parity of the leastsignificant 6 data bits, and so on down to the 8th checksum bit, which will be equal to theLSB of the data.

92 5 Week 3 Class 1

A simple hardware implementation of this method could be a LFSR with onestorage element fed back on itself as shown in Fig. 5.3.

Fig. 5.3 A minimally complicated LFSR permitting ECC

To operate this LFSR, one initializes it with 16’b0 and shifts in the data LSBfirst, twice (16 shifts). The result will be the desired pattern of xor’s in the rightmost8 bits, and a copy of the data in the leftmost 8 bits. The result could be transmittedserially with just a latency penalty; it also could be offloaded onto a parallel bus fordirect memory storage.

To see how the ECC might work, suppose the data byte was 1010 1011; check-summed, this word would become, 1010 1011 1001 1001.

Now, suppose that a 1-bit error occurred during serial transmission; for example,suppose the data LSB flipped, making the received word, 1010 1010 1001 1001.

The Rx would calculate the checksum of the received word to be 0110 0110,clearly a gross mismatch to the received checksum. It would be unreasonableto consider the possibility that the checksum could contain an error of so manybits, making it so distant from the one calculated from the received data. Avoid-ing a closed-form solution in this example, the Rx could formulate 8 hypothesesto correct the data by calculating all possible checksums with 1 data bit flippedin each:

The 8 possible 1-bit corrections to a received word of 1010 1010 1001 1001:

Hypothesis CorrectedData

ComputedChecksum

h0 0010 1010 1110 0110h1 1110 1010 1010 0110h2 1000 1010 1000 0110h3 1011 1010 1001 0110h4 1010 0010 1001 1110h5 1010 1110 1001 1010h6 1010 1000 1001 1000h7 1010 1011 1001 1001

Hypothesis h7 generates the received checksum; so, our ECC should flip the dataLSB to correct it.

Now let us assume two errors, say the data MSB and the data LSB. Again, we donot allow that the received checksum could contain so many errors.


The 8 possible 1-bit corrections to a received word of 0010 1010 1001 1001:

Hypothesis CorrectedData

ComputedChecksum

h0’ 1010 1010 0110 0110h1’ 0110 1010 0010 0110h2’ 0000 1010 0000 0110h3’ 0011 1010 0001 0110h4’ 0010 0010 0001 1110h5’ 0010 1110 0001 1010h6’ 0010 1000 0001 1000h7’ 0010 1011 0001 1001

In this case, no 1-bit correction to the data yields the received checksum; how-ever, h7’ yields a checksum very close (only 1 bit away). It would be reasonable toaccept h7’, flip the LSB, and then try 8 more hypotheses for a second correction; thiswould result in a 2-bit ECC which would correct both data errors. In actual practice,the distances are quantified and minimized in closed form in the algebra of Galoisfields, but this simple example shows the basic properties of a checksum permittingmultibit ECC.

For more information on the algorithms and computational details of ECC check-sum encoding, see the Cipra and the Wallace articles in the References.

5.1.7 Parity for SerDes Frame Boundaries

A simple parity value might be used to improve greatly the efficiency of our plannedserdes serial data framing. However, we shall not use it in this course. We are in-terested in design in verilog, and our inefficient but obvious 64-bit packet makes itboth easy and instructive to recognize verilog design errors during simulation. Wedo not wish to obscure a possible design error to ensure hardware we never intendto manufacture.

However, let’s ignore our own project once more, for the moment. Consider thefollowing way of determining clock synchronization of a local PLL clock with theembedded clock in the serial data. Instead of padding the data with 8 bits of encodedorder information per byte, as we shall do in our project, suppose we added just aparity bit to each datum, extending it to 9 bits per byte. Then, a packet of 32 bits ofour serialized data will look something like this:

36’bXxxxxxxxPXxxxxxxxPXxxxxxxxPXxxxxxxxP

The parity for each byte follows that byte, in the sense that we are assuming thatthe MSB is sent first over the serial line. Each byte’s MSB is represented by an

94 5 Week 3 Class 1

upper-case X, and the parity by P. Compare this with the Step 8 representation inour previous lab. With underscores to emphasize byte boundaries, we may write,

36’bXxxxxxxxP XxxxxxxxP XxxxxxxxP XxxxxxxxP

Now, suppose we try to synchronize a PLL clock with a stream of such frames:If we know we are on a byte boundary and are worried about a 1-bit jitter, we cancalculate parity: If we should shift by a bit, the parity might change, and maybe wecould adjust our PLL to resynchronize. If we are using even-1 parity, a ‘1’ in thewrong frame will trigger a parity error, but a ‘0’ won’t. Detection of a framing errorthen would be about 50% accurate.

But, this isn’t good enough: We want reliable synchronization. So, assumingeven-1 parity, let’s guarantee that a parity error will occur on a framing error, atleast in one direction. We simply add another new bit, a trigger bit, which always isthe inverse of one of the bits in the frame, to the end of every frame.

Because even-1 parity always implies that the xor of the parity bit and its wordmust be 0, the receiver’s (Rx) parity hardware will verify that the parity of the ex-pression, (word xor parity bit) always is 0. So, if the MSB of the word is toggled,this must toggle the parity bit, or a parity error will occur.

So, let’s add our new trigger bit but position it in the frame at position (MSB-9)and require that it always will be ignored in the transmitter’s (Tx) parity calcula-tion. Our data packet now consists of 10 bits to represent every 8 bits of data, withparity bit P following the LSB, and an extra bit, our trigger bit, indicated by under-lined lower-case x, following the parity bit and set to the inverse of the MSB of thepreceding data byte. The trigger bit never is counted in the Tx’s calculation of thevalue of P:

40’bxXxxxxxxxPxXxxxxxxxPxXxxxxxxPxXxxxxxxxP.

The MSB of each byte is capitalized, X. The Tx’s parity bit is bit P; each P endsa 10-bit data frame. Each x is a trigger bit; the x following the first P from the left,for example, is set to the inverse of the first X from the left.

Now we have achieved some progress: The correct framing would be the follow-ing, with underscores to indicate the receiver’s (Rx) detected frame boundaries. Thefirst bit in each frame is the x inverted MSB X value and is ignored for Rx parity:

40’bxXxxxxxxxPxXxxxxxxxPxXxxxxxxxPxXxxxxxxxP.

A 1-bit Rx framing error lagging would be this,

40’bxXxxxxxxxPxXxxxxxxxPxXxxxxxxxPxXxxxxxxxP.

and, leading, it would be this,

40’bxXxxxxxxx PxXxxxxxxx PxXxxxxxxx PxXxxxxxxx P.

The lagging error clearly causes the Rx to ignore the real (Tx) MSB and replace itby its inverse; this forces incorrect parity and guarantees an error which always will

5.2 Memory Lab 7 95

be detected. The leading error can be detected reliably some of the time, wheneverthe transmitted-data LSB happens not to equal the parity value of the now-garbleddata (which has x in the place of its MSB). The resultant raw framing-error detectionrate then should be expected to be about 75%.

A PLL biased very slightly to lag exact synchronization thus can be designed toachieve a very low rate of undetected 1-bit framing errors. The approach abovewould allow a designer to adjust a receiving PLL reliably with data in a frameno larger than 10 bits per byte. The ratio of 10 bits per byte is the assump-tion usually made in actual PCI Express designs, a numerical coincidence be-cause we have ignored many serialization complexities, such as Manchesterencoding.

5.2 Memory Lab 7

Lab ProcedureWork in your Lab07 directory.

Step 1. Try the following memory access statements. Initialize the RHS vari-ables with literal constants, and then see which ones work:

reg[63:0] WordReg;reg[07:0] ByteReg;reg[15:0] DByteReg;reg[63:0] BigMem[255:0];reg[3:0] LilMem[255:0];...BigMem[31] <= WordReg;WordReg <= BigMem;LilMem[127:126] <= ByteReg;LilMem <= ByteReg[3:0];DByteReg <= ByteReg;ByteReg <= DByteReg + BigMem[31];WordReg[12:0] <= BigMem[12:0][0];

Step 2. Design a verilog 1k×32 static RAM model (32×32 bits) with parity.Call the module, “Mem1k x 32”. Check this model by simulation as you do yourwork on it.

This RAM will require a 5-bit address bus input; however, use verilogparameters for address size and total addressable storage, so that quickly, bychanging one parameter value, you could modify your design to have a workingmodel with more or fewer words of storage. Parity bits are not addressable and arenot visible outside the chip.

96 5 Week 3 Class 1

You may model your RAM after the Simple RAM Model given preceding pre-sentation. Use just one always block for read and write; but, of course, the modulewill have to be considerably more complicated than the Simple RAM.

Fig. 5.4 The Mem1k x 32 RAM schematic

Use two 32-bit data ports, one for read and the other for write. Supply a clock;also an asynchronous chip enable which causes all data outputs (read port) to go to‘z’ when it is not asserted, but which has no effect on stored data. The clock has noeffect while chip enable is not asserted.

Supply two direction-control inputs, one for read and the other for write. Changeson read or write have no effect until a positive edge of the clock occurs. If neitherread nor write is asserted, the previously read values continue to drive the read port;if both are asserted, a read takes place but data may not be valid.

Assign reasonable time delays, using delayed continuous assignments to theoutputs. Supply a data ready output pin to be used by external devices requir-ing assurance that a read is taking place, and that data which is read out isstable and valid. Don’t worry about the case in which a read is asserted con-tinuously and the address changes about the same time as the clock: Assumethat a system using your RAM will supply address changes consistent with itsspecifications.

Also supply a parity error output which goes high when a parity error has beendetected during a read and remains high until an input address is read again. Referto the block diagram in Fig. 5.4.

Because this is a static RAM, of course omit DRAM features such as ras, cas andrefresh. Design for flip-flops and not latches. Put the model in a file named after themodule.

Include an assertion to announce parity violations to the simulator screen. Ofcourse, your simulation model can’t possibly experience a hardware failure, but thismessage may tell you if you make a design error with the parity bit. You can force

5.2 Memory Lab 7 97

an error by putting a temporary blocking assignment in your model to confuse thexor producing the parity value.

Step 3. Check your RAM. Write data to an address and simulate to verify thatit is read correctly (see Fig. 5.5).

Fig. 5.5 Cursory simulation of single-port Mem1kx32 with separate read and write ports

Step 4. After completing the previous step and doing any simulation neces-sary to verify your design superficially, add a for loop in a testbench initialblock to write a data pattern (e. g., an up-count by 3) into the memory at everyaddress and then to display the stored value. Use $display() for your display.Pay special attention to the “corner cases” at address 0 and address 31. Your paritybit would be in bit 32 at each address. An example of the loop is given just below.Notice how to address a data object (“MemStorage”) in an instance, here namedMem1kx32 inst, in the current testbench module:

for (...)begin...#1 DbusIn = (some data depending on loop);#1 Write = 1’b1;#10 Write = 1’b0;

SomeReg = Mem1kx32 inst.MemStorage[j];$display(‘‘...’’, $time, addr, SomeReg[31:0], SomeReg[32]);

end

Step 5. Modify your RAM design so it has just one bidirectional data port, Dothis by copying your working model (above) into a new file named“Mem1kx32Bidir.v”. Then, declare a new, empty module in this file, namedafter the file. Use the exact same new module ports as in the old Mem1kx32 model,except for only one inout data port. See Fig. 5.6.

98 5 Week 3 Class 1

Fig. 5.6 Schematic of wrapper to provide Mem1kx32 with a bidirectional data bus

Instantiate your old RAM in the new Mem1kx32Bidir. Connect everything1-to-1, but leave the data unconnected.

All you have to do now to complete the connection is to add a continuous as-signment in the wrapper module which turns off the DataO driver when Read isnot asserted. Also, wire DataI to the new DataIO driver as shown above. Verifyyour new RAM by a simulation that does write and then read from two successiveaddresses, then reads again from the first address (see Fig. 5.7).

Fig. 5.7 Cursory simulation of single-port Mem1kx32Bidir with bidirectional read-write port

If this were part of a much larger design project, you would separate the bidi-rectional and original RAM modules into different files. However, this exercise issimpler if you allow yourself to keep as many as three modules in one file: Theoriginal RAM model, the new bidirectional “wrapper” for that RAM model, and thetestbench.

Step 6. Synthesize your bidirectional-data bus memory design, optimize forarea, and examine the resulting netlist in a text editor. Resynthesize for speed andexamine the netlist again.

5.2 Memory Lab 7 99


Concatenation: When can it be useful?How are hierarchical references made to module instances?What’s the benefit of a bidirectional data bus?How might the spec for the Dready flag be improved? What about when a read

remains asserted while the address changes? Shouldn’t the RAM be responsible forall periods during which its output can’t be predicted?

How would one change the RAM flip-flops to latches?Do we really need a ChipEna? Why not disable outputs except when a read was

asserted?


Read Thomas and Moorby (2002), Section 5.1, on verilog rules for connection toports.

Read Thomas and Moorby (2002) section 6.2.4, pp. 166 ff. to see how our parityapproach can be adapted easily to a Hamming Code ECC format (at 12 bits perbyte).

Read Thomas and Moorby (2002) appendix E.1 and E.2 on vectors, arrays, andmultidimensional arrays.

(optional) Read “The Laws of Cryptography: The Hamming Code for Error Cor-rection”, by Neal R. Wagner, a 2002 web site posting at http://www.cs.utsa.edu/∼wagner/laws/hamming.html (2004-12-15). This is a brief and verynice treatment of ECC, extending and improving the present coverage. Unhappily,the posting is flagged by the author as “obsolete”; it will become part of a bookwhich is downloadable from the web site but which mostly is irrelevant to thiscourse.


Section 4.2.3 discusses the verilog rules for connection to ports.Look through the verilog of the behavioral DRAM memory model in Appendix F

(pp. 434 ff.). It uses several verilog language features we haven’t yet mentioned,so you may wish to put it aside for a few weeks. It may not work with the Silossimulator on the CD.


6.1 Counter Types and Structures

6.1.1 Introduction to Counters

A counter is a data object which represents a value that is incremented ordecremented in uniform steps. So, a counter may contain successive values of 0,1, 2 . . .; or, 2, 4, 6, . . ., etc. A 1-bit counter alternates between ‘0’ and ‘1’. A n-bit register for a binary, unsigned up-counter would count as shown in Fig. 6.1, insuccessive count values starting from 0 on the top row (first register state):

Fig. 6.1 Binary up-count register with MSB on left and successive count values Between 1 and nbits switch on each count

The above storage of the count value is very compact spatially, n bits for 2n

different values, but every carry to a new bit causes simultaneous switching of sev-eral, often very many, bits at a time. This carrying and bit-switching can cause set-tling delays to vary from clock to clock; all the switching may generate on-chipcross-talk noise or power supply glitches, causing errors in the counter or in nearbydevices.


102 6 Week 3 Class 2

Values of, say 1, 2, 4, 8, 16, . . . don’t represent a count as such. However, ifthe value of a counter is interpreted in terms of bit position, a register content of20,21,22,23, . . .2n, may be considered a count, in that the position of the ‘1’ in aregister holding those successive values increments in uniform steps, by one locationat a time. Essentially, the log2 of the value in the register is being counted. Anexample of this kind of counter is the one-hot counter, because exactly one bitposition always is “hot” with a ‘1’.

The hardware implementation of a one-hot counter is just a shift register whichis initialized with a single ‘1’ in it. A one-hot count would look like that in Fig. 6.2,in terms of successive bit-patterns in the counter register:

Fig. 6.2 One-hot up-count. Two bits switch on every count

Notice that there can’t be any value of 0 represented this way, if the bits in theregister are to be interpreted as ordinary, verilog binary (2n) numbers. Also, thestorage capacity is very limited, requiring n bits for just n different values. This kindof count is spatially inefficient, but it is very fast and causes little (but not minimal)noise, because just two bits switch per clock. We trade storage space for time, in asense.

6.1.2 Terminology: Behavioral, Procedural, RTL, Structural

There are at least three different counter approaches possible in verilog: behavioral,procedural, and structural. Sometimes the word, RTL, “Register Transfer Level”, isused to describe design activity; RTL overlaps behavioral and procedural.

Behaviorally, the verilog language permits elementary operators to be used forcounting, with no concern for the hardware that might result. A simple count may bewritten as, Count <= Count + 1, Count <= Count − 1, Count<= Count + Incr, etc. Behaviorally, we leave it up to the synthesizer to de-cide how to put these statements into our netlist. Notice that in the preceding

6.1 Counter Types and Structures 103

statements, there is no way of knowing which Count bit might change, or whichway, because of the statement alone.

Procedural assignments are distinguished by the possibility that a statementmight be superceded by another one in simulation time, with no implication thatmultiple drivers were involved. Procedural assignments only can occur in proceduralblocks. The possibility of changing the assigned value is most obvious with blockingassignments. For example,

initial // Cycle the reset by time 10:begin#0 Reset = 1’b0;#1 Reset = 1’b1;#9 Reset = 1’b0;end

Behavioral statements may include procedural or continuous assignments. Pro-cedural statements may be bit-specific and thus not behavioral.

Procedurally, we may assign values to counter registers in procedural statements,but the level of detail may be anywhere from that of the whole counter register, as inthe behavioral example above, down to that of individual parts or bits. Composing acounter by xor expressions and bit-assignments could be procedural, but not behav-ioral. For example, in an always block, Count[2] <= Count[1]ˆCarrywould be an example of a bit-level procedural statement. Note that a continuousassignment, assign Count[2] = Count[1]ˆCarry, would make the state-ment nonprocedural. However, counters are sequential devices; and, whereas a con-tinuous assignment statement might represent an addition, it can not represent acount.

The border between procedural and behavioral coding is not crisp, and oftenthe same code may be described as RTL (usually a special case of procedural) orbehavioral, depending on context. Recall the simple counter for our PLL in Week 2Class 2: That was an RTL counter not quite behavioral because we wrote a bit-widthexpression for the incrementing literal, 5’h1.

The present author feels justified in using the term behavioral when the codedoes not explicitly assign identifiable bit values to a port or register; thus, astructural interpretation of the statement is not specified.

This is consistent with usage in the Thomas and Moorby (2002) (sec-tion 1.2) and Palnitkar (2003, chapter 7). Very little in this book is purelybehavioral, in that it could not be called RTL.

Structurally, verilog allows us complete control over the netlist if we wish toinstantiate the gates by hand from a library compatible with our intended technol-ogy. In this context, continuous assignment statements are structural connectionsrepresenting simple wiring or combinational logic. Palnitkar (2003) distinguishes


dataflow modelling as a special style including continuous assignments; however,this terminology is nonstandard. One sure thing is that continuous assignments arenot procedural (they can’t be included in a procedural block in modern verilog).

Here is another example to clarify the terminology. Suppose the goal is to rep-resent a 2-bit, binary up-count. After a few declarations, we may write verilog asshown next:

behavioral Count <= Count + 1; // Count is reg.

RTL: Count[1:0] <= Count[1:0] + 2’b01; // Count is reg.

RTL:always@(posedge Clock) Count[0] <= ∼Count[0];always@(Count[0])

if (Count[0]==1’b0) Count[1] <= ∼Count[1];// Count is net:

structural: DFF Bit0( .Q(Count[0]), .Qn(Wire0), .D(Wire0), .Clk(Clock) );DFF Bit1( .Q(Count[1]), .Qn(Wire1), .D(Wire1), .Clk(Wire0) );

Gate-level structural design entry should be avoided when using a synthesizer.Sometimes, when tuning a synthesized netlist, structural control may be necessary;but, for complex structures, the result may not be optimal. Furthermore, the syn-thesizer’s logic optimizer may not be able to improve suboptimal structures whichhave been built by hand. Finally, gates instantiated from a specific library make thedesign technology-specific; porting the design to a new technology will be morecomplicated with hand-instantiation than when the synthesizer is allowed to chooseall gates from the new technology library.

6.1.3 Adder Expression vs. Counter Statement

It is important to point out the difference here between an adder and a counter;An adder performs an expression evaluation, for example, X <= A + B. Even ifincluded in a function call, the adder is contained entirely on the right side of thisstatement. A counter makes a statement of the current value in a register. Thus, ad-dition can appear in a combinational or a sequential procedural statement; countingcan occur only in a sequential procedural statement. Although a count incrementexpression can be implemented as an addition of 1, a count can not be kept in acombinational form; it has to be stored so that the sequence of different values,count vs. next-count, can be distinguished in order of occurrence. The next countmust depend on the previous value of count.

Counting requires assignment of a sum expression in a statement, which is shownschematically in Fig. 6.3.


Fig. 6.3 Sequential summing permits counting

6.1.4 Counter Structures

Next, we discuss a few examples of different counter structures. In all the countersto be presented below, assume the count is the binary value taken from the Q portsof the flip-flops shown. We shall use only D flip-flops here, because they are themost common kind in current use, either in manual design entry or in synthesizedlogic.

The element of any binary counter based on D flip-flops is the toggle flip-flop,sometimes called a T flip-flop, which is just a D flip-flop with its ∼Q port wiredback to its D input.

For D flip-flops modelled behaviorally, toggle assignments from D to Q shouldbe nonblocking, not blocking, to ensure that updated values are based solely onassignments from the previous clock.

//// Basic toggle flip-flop://wire Qn D;...DFF Toggle01( .Q(Q), .Qn(Qn D), .D(Qn D), .Clk(ClkIn) );...

The schematic for this device hookup would be as shown in Fig. 6.4:

Fig. 6.4 Basic toggle (T)flip-flop

A toggle flip-flop component may be constructed by putting a simple wrappermodule around a DFF wired to toggle:


module TFF (output Q, input Clock, Reset);// DFFC is the DFF above, with clear (reset) added.wire Qn D;DFFC Toggler( .Q(Q), .Qn(Qn D), .D(Qn D), .Clk(Clock), .Rst(Reset) );endmodule

6.1.4.1 Ripple Counter

This kind of counter is minimal in size and easy to implement structurally. Thebits in this kind of counter are flip-flops which are not clocked but are triggeredby the data edges (carry out) of previous stages. They are wired to toggle on everyactive edge on their clock-pin input. The count must be reset to a known value;but, otherwise, no logic is required other than that of the flip-flops themselves. Forexample, a 3-bit ripple counter could be as simple as in Fig. 6.5.

Fig. 6.5 Schematic of 3-bit ripple counter constructed of D flip-flops. Outputs not shown. MSB isfarthest right. Only the LSB is clocked

Because the higher-significance stages in a ripple counter see no clock and cannot change state until all earlier bits have changed state, this is a linearly progres-sively slower device as the count register widens. A variety of glitch values willappear in the output bit pattern during the time after each clock, while the ripplestriggered by the clock settle. However, carry from all stages does not propagate im-mediately on the clock edge, so the power consumption and noise are reduced whencompared with that of a synchronous counter.

6.1.4.2 Carry Look-Ahead (Synchronous) Counter

This kind of counter completes an update of all register bits on each clock input, atthe cost of additional carry look-ahead logic. Stated otherwise, all bits always areclocked and would toggle together, all at once, on every clock, except that the togglefor each bit is disabled until all lower-order bits are ‘1’. Thus, although every bit isclocked, no bit toggles to ‘1’ until it becomes the carry from the previous bits.

A synchronous counter can operate at almost the same clock speed regardless ofregister width. However, because all bit switching is synchronous with the clock,


a synchronous counter occasionally briefly will consume power in big spurts, andsuch a spurt will create noise which might cause logic errors, as mentioned above.

Fig. 6.6 Schematic of a 3-bit synchronous counter. Every bit is clocked. Previous 1’s (inverted)are or’ed, and an xor prevents each toggle until carry overflows into that bit

Notice in the synchronous counter of Fig. 6.6 that each storage element (flip-flop)is driven by a 1-bit adder (xor gate). Thus, in a general sense, the flip-flops merelylatch, on each clock, the sum composed by the connecting combinational logic.

6.1.4.3 One-Hot and Ring Counters

Unlike the ripple or synchronous counters, these counters do not use all possible bitstates of a register.

The one-hot counter was described above.A ring counter is just a one-hot counter in which the MSB is shifted back into

the LSB. In a ring counter, a single ‘1’, or other constant pattern established byreset logic or a parallel load, circulates through the register endlessly. Ring coun-ters may be used to generate clock pulse patterns of arbitrary duty cycle. Whenused to create a clock, a ring counter in a sense is the inverse of a PLL; it createsa clock of lower frequency than its input, rather than a clock of equal or higherfrequency.

6.1.4.4 Gray Code Counter

Of all common counter hardware structures, the gray code is maximally efficient interms of switching power and noise. It is named after a designer whose name wasF. Gray.

In gray-code counting, as in the simple binary up-counter, all possible registerpatterns are used; but, successive count patterns are such that just one register bitchanges on any clock, as shown in Fig. 6.7. This efficiency is at the cost of someextra encoding logic which is necessary to sequence the register patterns properly.Also, determining the equivalent 2n binary bit count value from the gray code valuerequires some decoding logic.


Fig. 6.7 Unsigned gray-code up-count. MSB is on the left. With the bottom pattern (2n-1), onemore count restores the 0 pattern

We shall not be using gray code in this course, but it is a good pattern to keep inmind when minimal switching power or noise is a concern.

Another application of gray code is in synchronizing the value of a count acrossdifferent clock domains; for example, this would be required for generating con-sistent, successive addresses or state-machine state encodings. As explained in theCummings papers listed in the References, because only one bit changes at a time,the probability of invalidly sampling an intermediate counter state (by logic in adomain other than that of the counter) is lower for gray encoding than for any otherkind of count commonly used.

6.2 Counter Lab 8

Work in the Lab08 directory for this lab.

Lab ProcedureNote: For a counter clocked on the positive edge, a miscount in this lab is defined

as the wrong number when the count is sampled on the opposite (negative) edge ofthe clock.

Step 1. Build a ripple counter. You should have a working positive-edge flip-flop (FF) from Lab04. If necessary, modify your FF model so it has two outputs,Q and Qn. The Qn logically will be !Q (or, ∼Q), obtained perhaps by clocking Qn<= ∼D in the model. If not in your model already, assign a #3 delay (3 ns) tochanges in Q and Qn.

Then, create a verilog structural model of a 4-bit ripple counter, using yourFF component and the design in Fig. 6.5 above. Name the counter moduleRipple4DFF.

6.2 Counter Lab 8 109

Simulate the model to determine how much simulation time it takes your modelto count from 0 to 4’b1100 (hex 4’hc) at its fastest possible speed. Do this byrepeatedly simulating with shorter clock periods until there is a miscount. Recordthe time for later reference. Use a testbench in a separate file.

Note: The miscounts you find probably will depend only on the delays, andyour flip-flops may toggle in simulation no matter what your clock speed.This is because the simulator may not implement inertial delays for delayedstatements. So, although real hardware would freeze up if clocked too fast, wemay have to wait until we study path delays in specify blocks before oursimulator will reveal this kind of problem.

Synthesize Ripple4DFF and try optimization for area and speed. Save thespeed-optimized netlist for later in this lab.

Step 2. Build a synchronous counter. Use the design in Fig. 6.6 above to builda structural 4-bit synchronous counter model using the same FF module as in theprevious Step. Name the synchronous counter module Synch4DFF. Use verilogcontinuous assignments to represent the connecting or and xor “gates”, and assigna delay of #2 for each such gate assignment.

Simulate Synch4DFF with the same testbench to compare it with the ripplecounter in the previous Step; record the shortest possible simulation time to countfrom 0 to 4’b1100, as before. Use the same basic testbench file as you did for theripple counter.

Also, synthesize for area and speed and save the speed-optimized netlist.Then, instantiate both the ripple-counter speed-optimized netlist and the

synchronous-counter netlist modules in a testbench module and see how fast you canrun them in simulation. You will have to compile the verilog LibraryName v2001.v file for this simulation, because you will require verilog modelsfor the components in the synthesized netlists. If you use VCS, add -v before thelibrary file name in your .vcs file.

Step 3. Write a behavioral counter model. In a new module file namedCounter4.v, declare a 4-bit reg and model a counter behaviorally this way:

reg[3:0] CountReg;. . .always@(posedge ClockIn, posedge Clear)beginif (Clear==1’b1)

CountReg <= 0;else CountReg <= CountReg + 1;end


Assign a delay to a continuous assignment statement which transfers theCountReg value to an output port, CountOut. The count delay should matchyour #3 FF module delay (Step 1 above). Now, simulate and compare times with thetwo preceding structural counters.

Then, synthesize for area and speed.Finally, do a little experiment: Make a copy of Counter4.v and change the

code to CountReg <= CountReg − 1. Simulate; and, notice what happenswhen the count wraps below 0. A reg is an unsigned type, and the value next after0, in a down-count, is the maximum positive value expressible in the reg.

Step 4. Use a verilog special wire type for logic. Make a copy of yourSynch4DFF model in a new file named Synch4DFFWor. Replace the verilogor expressions in your synchronous counter with independent assignments to oneor more wired-or (wor) nets. Don’t assign any delay to the new or logic (we’lllook into differentiating rise and fall delays later in the course). Simulate again andsynthesize for area and speed, to compare with Synch4DFF.

Keep in mind that wor may not be available in a CMOS technology library,where pull up and pull down strengths usually are equal. When you code a wornet, then, the synthesizer either should replace its drivers with special librarywor gate buffers or other components, or should drive the net with a libraryor gate.

Step 5. Use your PLL as a counter clock. Make a new directory under Lab08and copy into it your entire PLLsim model from Lab06. Rename the PLLsimmodule and its file to “PLLTop” to avoid confusion. Instantiate the PLL andall your counters (ripple, synchronous, behavioral) in a single module namedClockedByPLL and connect the PLL output clock to them.

Clock your PLL with an input (ClockIn) at 1 MHz. See the block diagramin Fig. 6.8. Be sure to remove your timescale and ‘define code from all the oldmodule files and put them only in the new top of the design (ClockedByPLL), andin the testbench instantiating it. Simulate briefly to be sure all counters will count(see Figs. 6.9 and 6.10).

Fig. 6.8 The Lab 6 PLL adapted to clock three counters. Resets omitted

6.2 Counter Lab 8 111

Fig. 6.9 Simulation of the three ClockedByPLL counters

Fig. 6.10 Closeup of the ClockedByPLL counters, showing ripple glitches


How do synch vs. ripple counter sim times compare?If the DFF delay is around 3 ns, why won’t the ripple counter count correctly at

a clock period of 10 ns?The intermediate states which are output by a ripple counter before it settles can

be confusing during simulation. They also represent brief pulses or glitches whichin a real design might be current-amplified by bus drivers, or otherwise might causeunnecessary noise. How might these states be kept off an output bus?

The behavioral counter (Counter4) was trivial to change from an up-counter toa down-counter. How easy would it have been to change similarly the ripple counteror the synchronous counter?


Read Thomas and Moorby (2002) section 1.4.1 to see a counter model used incontext.

Read Thomas and Moorby (2002) appendix A.14–A.17 to see a counter modelledas a state machine.



Palnitkar models counters at several different places, to make various points aboutthe verilog language. If you are interested in counters per se, you may wish to workthe examples in section 6.7. Solutions are given on the Palnitkar CD.

The ripple counter is described in sections 2.2 and 6.5.3.A synchronous counter built from J-K flip-flops is described in section 12.7,

exercises 5 and 6. Note: A switch-level J-K flip-flop is modelled in section 9.4 ofThomas and Moorby (2002).


7.1 Contention and Operator Precedence

This time we shall study some new verilog: drive strengths, race conditions, operatorprecedence, and named blocks. We then shall move on to a discussion on how to lockin our PLL to a serial data stream.

7.1.1 Verilog Net Types and Strengths

Strength means nothing in verilog except when a net has more than one driver.We’ve briefly mentioned some of the special net types (tri, wand, wor) in

Week 2 Class 1. In the present lecture, we merely introduce the different kinds ofstrength; we shall study switch-level modelling later.

Except the high-impedance state (one of the four basic logic levels in verilog,implied by assignment of ‘z’). there is little use to strength in RTL or behavioralmodelling for VLSI design.

According to the verilog standard (IEEE Std 1364, 4.4 and 7.9–7.13), there aretwo kinds of strength, drive strength and charge strength.

7.1.1.1 Drive Strength

The drive strength levels, in decreasing order, are called supply, strong, pull, andweak. Recall here that ‘z’ is considered in the language to be a logic level, nota strength. However, a ‘z’ level is effectively an ‘x’ asserted at lower than weakstrength.

Drive strengths of strong (‘1’; ‘0’; ‘x’) or weaker-than-weak (‘z’) are theonly ones associated with the usual functions encountered in RTL or gate mod-els. Drive strength proper in verilog only applies to (a) nets in continuous assign-ment statements or (b) gate outputs of the builtin verilog primitive logic gates.



Drive strength as such is useless in association with reg type declarations; aftera reg value has been assigned to a net type, strength possibly may becomerelevant.

7.1.1.2 Charge Strength

The charge strength levels, in decreasing order, are called large, medium, and small;they should be considered as representing the sizes of on-chip capacitors. Switch-level models are used to represent current flow through transistors, which dependsin turn on device-element capacitance. The only data objects allowed to be declaredwith a charge strength are trireg nets. They are called trireg, because theyare viewed as charge storage elements, somewhat the way reg’s can be viewed aslogic-level storage elements.

As said already, except the use of ‘z’, strength generally is of no concern ingate-level or behavioral design; strength is intended to be invoked when simulatingat the switch level. Switch level models include pass transistors, single nmos orpmos devices, or other gate substructure useful in simulating the elements whichare combined to make up a logic gate in a technology library.

Switch-level models exist in something of a vacuum in modern verilog. Ac-curate validation of models at this level usually requires analogue simulation.Switch-level models can be functionally accurate but not timing-accurate. How-ever, the major concern in verilog modelling of individual technology-library gatesis timing, because other factors such as noise, crosstalk coupling strength, andleakage current, are beyond the scope of the language. Because timing usuallycan be expressed adequately in verilog without switch-level reference, switch-level verilog models are not commonly used in commercial library work in theindustry.

Also, library models usually are not written in verilog but in ALF (IEEE Ad-vanced Library Format) or in vendor-specific languages such as Liberty. This be-cause a complete library model has to include analogue-related simulation datasuch as slew rate under load, as well as nonsimulation data such as layout ge-ometry, placement rules, wire-loading, and other quantities beyond the veriloglanguage.

When it occurs, contention is resolved in verilog by strength level. Ambiguousresults only occur when different logic levels of ‘1’ or ‘0’ contend and are of equalstrength. In all other contentions, the resultant strength and logic level is that of thehigher strength level. Ordinary logic gates, module instances, or verilog expressionsact in this context as amplifiers in the sense that they output strong logic regardlessof the strength of their (resolved) inputs.

Here’s an ordered table of verilog strengths (see also Thomas and Moorby(2002), section 10.10.2.2):

7.1 Contention and Operator Precedence 115

Strength Level Keyword(s) Logic Level(s) Strength Number

supply supply1 7strong strong1 1, x 6pull pull1 5large large 4weak weak1 3medium medium 2small small 1highz highz1 z 0highz highz0 z 0small small 1medium medium 2weak weak0 3large large 4pull pull0 5strong strong0 0, x 6supply supply0 7

The verilog logic levels (not strengths) normally used in RTL or behavioral codeapply only to net assignments and are defaulted to the strength numbers shown. Thisdefault can be changed by verilog macroes as explained below.

The strength number in the preceding table is what is used to resolve contention:Higher numbers win over lower ones, and equal numbers at different levels result inunknowns.

Strengths may be declared in parentheses preceding the instance or net name, ormay precede the target of a continuous assignment to the net; drive strengths mustbe in pairs (high, low), and charge strengths must be single (size). Examples are,

wire OutWire;...and (strong1, weak0) and01( OutWire, in1, in2, in3);nor (pull1, pull0) nor31( OutWire, in1, in2, in3);trireg (large) LargeCapOnNet;wire (supply1, supply0) TiedTo1 = 1’b1;// ---wire UpClock;...

assign (strong1, weak0) UpClock = ClockIn;

In the example code above, a truth table for multiply-driven OutWire would be:

in1&in2&in3 and01 output nor31 output Outwire

1 strong1 (pull1 logically impossible) (strong1)1 strong1 pull0 strong10 weak0 pull1 pull10 weak0 pull0 pull0


We are interested only in the idea of strength at this point. We shall take upswitch-level modelling in detail later.

Two relevant procedural assignment statements are force and release. These aresimulator overrides, rather than drive strengths, and they should not be used in adesign – only for testbench or debug. The statement, “force object = level;”,in which object is any reg, forces the design object to the given logic level until“release object;” is executed.

There also is a similar procedural assign and a deassign; we shall not study theseand do not recommend them for design work.

7.1.2 Race Conditions, Again

In verilog, every block in a module has its inputs (or right-hand side – RHS) eval-uated concurrently in simulation time; this includes instantiations and initialblocks. Ordering of simulation events can be made meaningful (a) by cumulativedelays from sequences of events which happen to combine to cause events to oc-cur at unique simulation times; or, (b) by location of statements within proceduralblocks (always or initial blocks).

There is an important ordering of evaluations within each time step, but it is notvisible in simulation waveforms. We shall study this ordering, which is enforced inthe form of the verilog simulator event queue, later in the course.

Synthesis tools currently can not use concurrency in procedural blocks very well;they also ignore design delay expressions. To create a synthesizable sequence ofevents, it is necessary to make the later events depend on the earlier ones by defin-ing the earlier ones as inputs (RHS) and the later as outputs (LHS), or by makingthe desired sequence depend on the sequential state of design variables, primarilyreg types.

Order can be considered either unique or multiple. If an event (such as a hardwarereset) occurs only once in the simulation lifetime of the device being modelled, itmay be ordered uniquely with respect to other events. Obviously, something whichhappens on every clock, or on every number of them, can’t be ordered uniquely; but,it may perhaps be ordered within the clock cycle or relative to some other repetitiveevent.

A race condition occurs when some state depends on the relative order of twoor more preceding events, and the order of them can’t be predicted. Clearly, thisimplies that both of the preceding events can occur at the same simulation time.The preceding events are said to be in a race to determine which one comes beforethe other(s). This is a very undesirable condition for the hardware, because unpre-dictable hardware usually is not functional. If the synthesizer doesn’t reject suchconstructs as errors, it may synthesize the wrong hardware.

The simplest kind of race condition is repetitive and caused by assignments tothe same data object from two or more concurrent blocks. If such blocks were notconcurrent, the race would not occur, and the hardware would be functional. Forexample,


always@(posedge clockIn)begin#1 a = b;(other statements)end

...always@(a, b, c) #2 ena = (a & b) | c;...always@(a, b) #2 ena = a ˆ b;

In each of the three statements in the example, the delay should be interpretedthis way: If a variable in the event control expression changes to ‘1’ or ‘0’ (only to‘1’ for a posedge), the always block is read and thus the delay is initiated; afterthe delay has lapsed, the RHS is evaluated and the result is assigned.

The race in the code above is to assign ena. We assume other statements orblocks not shown are assigning to a, b, and c. This race is fairly obvious; but, it’snot hard to imagine a bigger module which might cause confusion and conceal arace, because of complicated concurrency.

One might imagine that a designer would think of the simulation of the raceexample above this way: Start time = 0 at the clock edge. Then, at time 1, a gets thevalue of b. Suppose a thus has been changed to ‘1’ or ‘0’; then, this triggers readingof the two other blocks. At time 3, ena goes to some value because of the secondalways statement – then again, depending on what was the value of c, perhapsena has its value replaced by 0 at time 3, because of the third always statement?Hmmm . . . may be a problem?

But, worse: What if c changes after time 1? Then this triggers the secondalways but not the third! Thus, ena may have its value changed arbitrarily ei-ther to the value of (a&b)| c or to aˆb. Not only that, but the clock might intrudeat any time to sample any arbitrary value of whatever is controlled by ena.

Also, what if a has been set equal to b before the clock edge? The clockedblock won’t change the value of a; then, neither of the other blocks shown will beread right after the clock; rather, they may be read at any arbitrary time because ofchanges in a or b from other blocks not shown. Yuk! This won’t work.

The way to avoid this kind of race condition within a module is never to as-sign any data object in a module from more than one always block. When or-derly execution is required, a simple solution is to have the always blocks readon opposite edges of the same clock, and to have all delays total less than halfof a clock period in every block involved. Another way would be to provide twoor more well-defined clocks derived at different phases from the same originalclock.

Returning to the example, the first step in correcting the preceding race conditionwithout using clocking schemes becomes obvious: Put all assignments to ena in asingle always block, for example like this:



...always@(a, b, c)begin#2 ena = (c==1)? 1 : a & b; // An if could be used here.#2 ena = a ˆ b;end

The sequencing of the two assignments to ena now is obvious, whereas it mighthave gone ignored or unnoticed in a big collection of independent always blocks.One here might question why the exclusive-or assignment should occur after theone above it, causing at most a glitch ignored by anything reading the outputs ofthis module. But, at least now there is no race condition. Perhaps the (inconsistent)intent of the original code would be realized better this way:


...always@(a, b, c) #2 ena = (c==1)? 1 : a ˆ b;

The initial kind of block shouldn’t be overlooked in a discussion of raceconditions. Verilog allows any number of initial blocks in a module, and thesame races can occur as with always blocks, except that usually such races willbe nonrepetitive. The synthesizer ignores initial blocks, so they can cause nosynthesis problem – except a simulation/synthesis mismatch!

initial blocks should be used only in testbenches or, in a design, for specialpurposes, such as denotation of an SDF back-annotation file. However, we shouldnevertheless point out the following:

Both always blocks and initial blocks are equally concurrent, and, attime 0, either one might be the first to cause evaluation of what should be as-signed to something. Because it is unavoidable to assign some objects both inone always block and in an initial block, brief race problems may be over-looked or tolerated in simulation as a minor nuisance. In the following exam-ple, if the always block is read before the initial block, the Reset 1’b0will be missed at time 0, and Abus may not be set to 0 until Reset is toggledlater:


initialbeginReset = 1’b0;#0 Reset = 1’b1;...end

always@(ClockIn, Reset)if (Reset==1’b0)

Abus <= ’b0;else Abus <= Inbus;

Another problem can be caused by mixing blocking and nonblocking assign-ments in a single procedural block: Don’t do it. Determine whether any of the logicwill be clocked; if so, use only nonblocking assignments. For combinational logic,we want setup time for clocking, and, any change has to be propagated throughoutthe whole block. Because nonblocking assignments will update with the earliest,not the immediately new, values, if all the logic will be combinational, use blockingassignments. When the logic is sequential but unclocked (latched), be very carefulof correct synthesis, and use either blocking or nonblocking assignments, depend-ing on setup requirements. If necessary, repartition your design (within a module)so the sequential and combinational logic is more cleanly separated.

In closing this discussion of race conditions, it should be mentioned that verilogsimulators use the same pulse-propagating process as most of the other availabledigital simulators; it is called inertial delay. An input pulse has to have a certainamount of “inertia” to get through a gate and influence the output. By default, if amodule has been assigned a propagation (or path) delay, and a new level in an inputlasts for less time than the propagation delay, the change will be ignored. This isa form of glitch filtering; only glitches wide enough will make it through the gate.Physically, it is a coarse model of energy response: A brief pulse is assumed not tohave enough energy to switch the gate; and, the path delay value being available,it is used to decide this energy. Unlike other HDL’s, verilog allows the thresholdinertial pulse width to be controlled by the designer; we shall study this later in thecourse.

Simulators designed for VLSI design often do not simulate inertial delay prop-erly, unless the delay is given in a specify block (to be covered later). So, unlessyou have tested it, do not depend on your simulator to swallow glitches based onhand-entered delays in continuous assignments or procedural blocks.

7.1.3 Unknowns in Relational Expressions

We shall return to this later in more detail, but, for now, it is enough just to say thatrelational expressions always evaluate to ‘x’, which is not true (false) when a bitneither is ‘1’ nor ‘0’. Thus, any ‘x’ or ‘z’ causes failure of true.


For example,

reg[3:0] A, B;...A = 4’b01x1;B = 4’b0001;if ( A > B )

X = 1’b1;else X = 1’b0;

The result is that X goes to 1’b0. Perhaps surprisingly, with values assigned asabove, the same result occurs for this:

if ( A == A )X = 1’b1;

else X = 1’b0;

To handle the four verilog logic levels literally and individually, one may use acase statement. With A equal to 4’b01x1 as above,

case (A)4’b0000: X = 1’b0;4’b0011: X = 1’b0;4’b01x1: X = 1’b1;default: X = 1’bx;endcase

we get X set to 1’b1.The case statement only does exact matches; it does not permit wildcards.

There exist two special wildcard variants of the case statement named casexand casez; we shall ignore them for now.

7.1.4 Verilog Operators and Precedence

We’ve already used the verilog bitwise operators most often seen in a design. Hereis a list of all the operators, as given in the Thomas and Moorby (2002) (appendixC), and in IEEE 1364 (5.1).


Symbol Type Symbol Type Symbol Type

∼ bitwise * arithmetic > relational& bitwise / arithmetic < relational∼& reduction + arithmetic >= relational| bitwise − arithmetic <= relational∼ | reduction % arithmetic == equalityˆ bitwise ** arithmetic ! = equality∼ ∧,∧ ∼ reduction ! logical === equality (case)>> shift && logical ! == equality (case)<< shift || logical ? : conditional>>> shift (arith) { } concatenation>>> shift (arith) { n{ } } replication

Any bitwise operator except ∼ also is a reduction operator. There are two alter-native ways of writing the xnor reduction operator; the present author prefers ∼ˆ.

Corresponding logical operators and bitwise operators yield the same results onone-bit operands. ‘1’ for true is the same as 1’b1, for all purposes. When manip-ulating bits, busses, or registers, or when expecting to synthesize gates to performoperations, it is best to provide an explicit width, just to keep in mind what one isdoing.

The arithmetical shift right operator (>>>) shifts in the value of the preoperationsign bit instead of a ‘0’. The other shift operators shift in a ‘0’.

Synthesizable verilog generally imposes limits on the exponentiation operator(��); for example, both operands must be constant; or, the expression must evaluateto a power of 2. As a rule, wherever practical, avoid exponentiation by writing“1<<n” to evaluate 2��n.

The case equality operator (=== or ! ==) is not used in a case statement;it may be used in an if or a conditional expression. It works like the comparisondone in a case (but not casex or casez) statement, which distinguishes ‘x’from ‘z’. It can not return a value of ‘x’, as can the other equality, the bitwise, or therelational operators. The case equality operators do not allow wildcarding and do notinterpret ‘x’ or ‘z’ as wildcards. Case equality is case equality, not casex or casezequality.

The replication operator may be used to assign a value or pattern to a part selector a whole vector. For example, “Xbus[15:0]<= {Abus[4:1], 6{1’bz,1’b0}};”, to set the 12 low-order bits to an alternating pattern of ‘z’and ‘0’.

All operators except the conditional operator associate left to right when of equalprecedence. All unary operators are of the highest precedence. Here is a table ofoperator precedence, after IEEE Std 1364:


Precedence Operator

lowest = 0 ? : (conditional) { } (concatenate)1 ||2 &&3 |4 ˆ ∼ˆ ˆ∼5 &6 == ! = === ! ===7 < <= > >=8 << >> <<< >>>9 + - (binary)10 * / %11 **highest = 12 + − ! ∼ & | ˆ ∼& ∼ | ∼ˆ ˆ∼ (unary)

This ends our discussion of verilog operators. Now you know them all . . .

7.2 Digital Basics: Three-State Buffer and Decoder

Before starting the next lab, let’s look at a couple of design basics as they are realizedin verilog.

First, let’s introduce the verilog three-state buffer component shown in Fig. 7.1:

Fig. 7.1 bufif1, with pinfunctionality labelled

The bufif1 is a verilog primitive gate which, when on, simply amplifies, orbuffers, its input logic level. When off, it outputs a ‘z’. Its port list includes oneoutput bit, one input bit, and one control bit, in that order from the left.

bufif1 optional InstName(out, in, control);

When the control bit is a logic ‘1’, the gate is on; so, it is a buffer if controlis ‘1’, and this explains its name. We shall use this gate to exercise what we havelearned about contention among different driving strengths.

Next, how do we do a verilog decoder? For a 2-to-4 decoder, the simplest wayis to use a case statement. We can use a case as though it were a lookup table toassign the one-hot ‘1’ depending on which 2-bit count is in the case expression.

7.3 Strength and Contention Lab 9 123

With a binary count in Sel, to run through all possible bit patterns, and the decodedresult in xReg, this is our case decoder for the X buffer selection in the next lab:

reg[3:0] Sel, xReg;...always@(Sel)

begincase (Sel[1:0])2’b00: xReg = 4’b0001;2’b01: xReg = 4’b0010;2’b10: xReg = 4’b0100;2’b11: xReg = 4’b1000;

default: xReg = ’bx; // e. g., to handle an ‘x’ in Sel.endcaseend

This will be our first explicit use of the case statement in a lab. It is very impor-tant not to leave open an alternative; this means always to add a default statementto cover leftover alternatives. Setting the default to assign ‘x’ values tells the logicsynthesizer you don’t care about leftover alternatives, and this permits better areaoptimization than otherwise. Also, overlooking an alternative may create a latch ofsome kind, and this may not be what you want. We’ll look later at other implicationsof this very useful verilog construct, the case statement.

7.3 Strength and Contention Lab 9

Do the nonoptional work of this lab in the Lab09 directory. If you have eitherthe Thomas and Moorby or the Palnitkar books, they come with a demo ver-sion of the Silos simulator, which may be used for the optional Steps of thislab, a wire strength exercise. The demo version of Silos is not intended for largedesigns.

The optional steps of this lab probably will not work as described in a simu-lator which is optimized for use in CMOS VLSI design; such a simulator neverwould be used to resolve strengths; instead, speed and capacity for synthesizableverilog would be its goal. Strength is not synthesizable as a netlist gate output,and contention, except for ‘z’, typically is assumed in a big design to imply onlyunknowns.

Lab ProcedureWork in your Lab09 directory.

In this lab, we shall create a module named Netter. This module will containlogic as shown in the block diagram of Fig. 7.2.


Fig. 7.2 Netter functionality. Blocks with dotted borders indicate logic clouds, not submodules

Notice that the Netter output drivers are all three-state buffers. It is illegal inverilog to drive an output with a simple wired logic net (wor, wand, etc.), for exam-ple, one connected directly between the module inputs and outputs. This probably isbecause no localized delay can be associated with the logic of such a net (as opposedto a path on a net from a gate output).

We wish to run a simulation which will demonstrate the result of contention ofeach of the four drive strengths against the others, including itself. To do this, weshall connect together some buffers and turn them on and off selectively so that onlytwo are on at a time. One of the two always will be at logic level ‘1’, and the otherat ‘0’. Thus, we can see during simulation which strength wins a contention of twoopposite levels.

This means 4×4 = 16 different pairwise connections. So, to explain the schematicof Fig. 7.2, we define four bufif1’s as the X buffers and four others as Y . All theX buffers receive a ‘0’ input, and all the Y ’s a ‘1’. Each buffer in X or Y is assigneda different one of the 4 drive strengths. Then, we may use a 4-bit binary counter andassign the lower two bits (2 bits = 4 choices) to control X and the upper 2 bits tocontrol Y . If we decode the count, we then can control each of X and Y to have justone buffer on at a time. By decode, it means here that the binary count is represented“one-hot” – in other words, a 2-bit count is represented by a logic ‘1’ in one of 4positions on a 4-bit bus.

Because a 4-bit binary counter goes through all possible combinations as itcounts, if we tie all the X and Y buffer outputs together, we shall get all possiblecombinations of one buffer on among the X buffers, and, correspondingly, one onamong the Y buffers.


Step 1. Enter two decoders. Begin the Netter module by writing the header.Use the schematic in Fig. 7.2. Declare a module named Netter with one 4-bitinput port and wire a 4-bit internal net xChoice to the xReg of the X decoder asshown in Fig. 7.2. For the decoder, be sure to use an always block with a casestatement as in the code immediately above. Then add another case, switching onSel[3:2], in the same always block, for the Y buffers. Wire that output (yReg)to a new 4-bit internal net named yChoice. Don’t bother to connect the xChoicewires to anything yet.

The incomplete Netter, at this point, should be as represented by Fig. 7.3.

Fig. 7.3 Output wiring of theNetter decoders

Because the decoded outputs will be assigned from constants, there is no reasonto make the always block sensitive to anything but the Sel bus.

To check your model, connect a counter and simulate it in VCS or QuestaSim;use a testbench which sometimes provides an ‘x’ bit in the input. These simulatorsare meant for large CMOS designs and can not resolve strengths usefully, but theywill find syntax errors for you.

Optional Step 2. Using your module from Step 1, you should add bufif1’s asshown in the code below, tying their inputs to logic ‘1’ or logic ‘0’ and their controlsto the decoder outputs.

Define the strengths by assigning them in the bufif1 instantiations; you shouldinstantiate two each of the following bufif1’s in Netter. See Step 3 for a hinton how to fill in the wire names and the instance names:

bufif1 (supply1, supply0) inst name( , , );bufif1 (strong1, strong0) inst name( , , );bufif1 (pull1, pull0) inst name( , , );bufif1 (weak1, weak0) inst name( , , );


Optional Step 3. After declaring the 8 bufif1’s, enable one buffer at a time bywiring them this way to the decoder outputs:

wire[3:0] xChoice; // Just to rename xReg....

assign Xin = 1’b0;assign Yin = 1’b1;...

assign xChoice = xReg;//bufif1 (supply1, supply0) SupplyBufX(SupplyOutX, Xin, xChoice[0]);bufif1 (strong1, strong0) StrongBufX(StrongOutX, Xin, xChoice[1]);

... (2 more X and 4 more Y)

Optional Step 4. After this, tie all the buffer outputs together; this may be doneby a set of continuous assignments all to the same wire. Such assignments of courseare concurrent, so there is no need to worry about their order in the verilog sourcecode:

...assign XYwire = SupplyOutX;assign XYwire = StrongOutX;...(6 more) ...

Then, the contention may be examined by looking at XYwire in a simulator suchas Silos, which can display strength differences; the result also may be assigned toa Netter output port, but if it is assigned to a reg type, the strength will be lostand only the logic level at strength = strong will remain.

Don’t clock Netter, because all you want is combinational (muxed and de-coded) nets. You had to declare reg’s to use the case statement, but these reg’swill be assigned to nets continuously and never will be allowed to save state whenone of their drivers changes; so, they will represent combinational logic – if youhave all case alternatives covered!

Optional Step 5. Clock a counter in your testbench to assign the value of Seland step through all contention alternatives.

After seeing the simulation, you may synthesize Netter with no area optimiza-tion and examine the netlist. What did the synthesizer do to implement the decoders?Was strength preserved? Optimize for area and examine the netlist again.

This completes our exercise in wire strength and contention, and it ends our de-pendence on the use of the Silos simulator for this lab.


Step 6. Race condition exercise. Create a new module named Racer in a newfile and put in it two always blocks as follows:

always@(DoPulse)begin#1 RaceReg = 1’b0;#1 RaceReg = 1’b1;#1 RaceReg = 1’b0;end

//always@(DoPulse) #1 RaceReg = ∼RaceReg;

Instantiate Racer in a testbench which toggles the input DoPulse a few times.Simulate: (Note: This will not synthesize). The first block should cause a 1 nspositive pulse lagging every DoPulse edge by 2 ns. But, what effect has the sec-ond always block? If it inverts after the first ‘0’ assignment, then it should ad-vance and widen the pulse; if it inverts before the first ‘0’, it should be supercededby the ‘0’ and should not be noticed. Reverse the locations of the two alwaysblocks in Racer.v. Does this change the result? It should, because concurrentblocks may be read in any order, according to the simulator developer; and, gen-erally, the order of appearance in the file will be used consistently, one way or theother.

Figures 7.4 and 7.5 show the waveforms in VCS or QuestaSim; what about Silosor Aldec?

Fig. 7.4 Case 1: Inverting always first


Fig. 7.5 Case 2: Inverting always second

Step 7. Operators and precedence. It is easy to get confused about the and and orlogical vs. bitwise operators. For one-bit variables, they produce the same results.But for multiple-bit variables, they differ importantly.

For example, consider the three different bit-masks in these expressions (the leftvalue “masks” the right one):

3’b110 & 4’b1111,3’b010 & 5’b00111,3’b101 & 3’b111.

The ‘&’ is bitwise; so, the three results, in order, numerically are, 110, 10,and 101, widened on the left by 0-extension to the width of the destination vec-tor (which is not shown).

However, “&&” is a logical operator, and it only returns a ‘1’ (= 1’b1) if bothoperands are nonzero, or a ‘0’. So, the expression

3’b010 && 6’b111000

evaluates to ‘1’, again widened by 0-extension to the width of the destination vector.By contrast, the bitwise expression

3’b010 & 6’b111000

evaluates to ‘0’, widened to the destination width.Look at the expressions above and imagine them assigned to an 8-bit bus. What

happens if one bit is ‘x’ or ‘z’?Does the logical operator return an ‘x’?Are logical and equality operators different in the way they handle an ‘x’?For this Step, test your insight by coding a small module Operators in its own

file and simulating the three bit-masked expressions at the start of this Step. Usedestination operands of 3 and 8 bits width. Then, replace the bitwise operators ‘&’with logical operators ‘&&’ and simulate again.

Optional: If you have the time, use the simulator to evaluate the following tworesults, for a and b both ‘1’, and c and d both ‘0’:

NoParens = a&!dˆb|a&∼dâ||∼aˆbˆcâ&&d;Parens = ((a&(!d)ˆb) | (a&(∼d)â)) ||((∼a)ˆbˆcâ) && d;

7.4 Back to the PLL and the SerDes 129

The point of this is that to understand someone else’s mess, break the expressionat the lowest-precedence operator(s); insert parentheses, and break again at the nextlowest, etc. To prevent your own mess, use parentheses and spacing to clarify themeaning of complicated expressions.

7.3.1 Strength Lab postmortem

How do VCS and QuestaSim handle resolution of contention?Is contention important when two concurrent statements assign the same logic

level to a net?

7.4 Back to the PLL and the SerDes

7.4.1 Named Blocks

Several verilog constructs, which we shall study more closely in the next chapter,depend on being able to name a block of code. A name is assigned to any verilogblock simply by following the begin with a colon and a valid verilog identifier.Of course, only procedural code can be in blocks, so only it can be named. A blockwithout begin can’t be named; so, if necessary, one simply inserts begin andend around anything to be named. No semicolon follows the identifier. Any rea-sonable alphanumeric string is a valid identifier if it does not begin with a decimalnumeral.

For example,

always@(negedge clk)begin : MyNegedgeAlways...end

Or,

begin : Loop 1 to 9for (j=1; j<=9; j = j + 1)

begin...end // for j.

end // Named loop block.

Naming a block has no functional effect. However, a block name, like a modulename, can be used by a tool and thus may be found in a synthesized netlist or asimulator trace list. Thus, naming blocks can help in debugging or optimizing anetlist.


Although the name itself changes nothing, it can be used to create new function-ality. For example, naming the block in a looping statement such as for, while, orforever allows the loop to be exited by the disable statement. This is similarto the C language break statement.

7.4.2 The PLL in a SerDes

Recall the SerDes design blocked out in Week 2 Class 2. Our full-duplex sys-tem will require a serial transmitter (Tx) and receiver (Rx) in each direction.Each end of the serdes will be defined in a different system clock domain, theseclocks being the clocks used to manipulate the parallel bus data within the twosystems.

To make our design reflect the most general case, there will not be any syn-chronization between the two system clock domains. A serial clock will have tobe generated for each Tx, to transmit the data over the serial line; of course, eachRx then will have to have a deserializing clock running at the same speed as theserializing Tx clock. We have decided that the system clock speed in each do-main will be 1 MHz, and that all serial clocks will run at 32 MHz, these clocksbeing generated by four independent PLLs, one for each Tx and one for each Rx. Aclock ratio of 32:1 easily is achievable by an analogue PLL currently in use in theindustry.

To avoid unnecessary complexity, we shall not address error detection or correc-tion, encryption, compression, edge detection, or packet retries, of the serial datatransferred. The protocols involved would be too time-consuming for the presentcourse.

The design on the Tx side of our serdes is straightforward: The parallel data areclocked in on a 1 MHz system clock, and the Tx PLL is used to serialize and clockthem out at 32 Mb/s. The Tx PLL can track the system clock directly, its phasebeing constrained only by that of the well-defined system clock. Because we havedecided to transmit packets 64 bits wide per 32 bit data word, the maximum speedof transmission on each serial line will be one word on every other system clock;thus, each receiver also will run at a maximum deserialization rate of one word pertwo system clocks.

The design on the Rx side will have to be more complicated. Because the twosystems are clocked independently (see Fig. 7.6), the Rx PLL will have to extractthe Tx clock from the serial data. This is equivalent to saying that the receiver willhave to identify incoming packet boundaries and use their arrival rate to provide a1 MHz clock for the Rx PLL. Given the extracted 1 MHz clock, the Rx PLL cansynchronize to it and generate the 32 MHz Rx serial clock required to clock in thedata and deserialize it, one word per two extracted 1 MHz clocks. After deserializa-tion, the incoming data words will have to be clocked out on the receiving systemclock.


Fig. 7.6 Full duplex SerDesand the two system clockdomains

To allow slack for Tx synchronization of the PLL serial clock and the trans-mitting system clock, each serializer will include a FIFO to buffer incoming data.This FIFO will reduce loss of data words by averaging out PLL-caused variationsin serialization speed to match the average rate of arrival of valid parallel data fromthe transmitting system.

Each deserializer also will include a FIFO to take up the same kind of slack.In addition, the Rx FIFO will reduce data loss because of possible delays in syn-chronization of the two system clocks, the extracted Tx system clock with the Rxsystem clock. In many systems, only the receiving end would be designed with aFIFO; however, we shall add one to the transmitting end, perhaps to allow for irreg-ularity in the sending rate.

We shall discuss more general clock-synchronizing technology later in the course.

7.4.3 The SerDes Packet Format Revisited

To construct a serial data source allowing synchronization, we shall use the datapacket format previously given as,

64’bxxxxxxxx00011000xxxxxxxx00010000xxxxxxxx00001000xxxxxxxx00000000,

in which each ‘x’ represents one data bit in a serialized 8-bit byte value. The formatshown is sent and received MSB first. To provide a synchronizing, incoming clockfor our Rx PLL, we shall look for a pattern in the input data stream of a pad-bytedown-count from 2’b11 to 2’b00.

To make things easy for a start, let’s send just the same 4 bytes repeatedly: Forthis, we shall pick the ASCII codes for ‘a’, ‘b’, ‘y’, and ‘z’, in that order. Thesecodes are, respectively, 8’h61, 8’h62, 8’h79, and 8’h7a.

At this design stage, then, our serial data source therefore repeatedly will sendthis binary data stream, left to right:

// ‘a’ pad 3 ‘b’ pad 2 ‘y’ pad 1 ‘z’ pad 064’b01100001 00011000 01100010 00010000 01111001 00001000 01111010 00000000

The alphabetical interpretation of the 8 bytes × 8 = 64 bits is shown above thestream representation.


To test our deserializer as we write the verilog, we have to generate this datastream repeatedly at about 32 Mb/s. This is trivial: We just leave the data patternabove as a constant and pretend it is being received repeatedly over a serial line.

Next, we show how a software-oriented, behavioral (or, procedural) verilogmodel can be developed to synchronize a PLL to an embedded serial clock. Thismodel will not be synthesizable; so, we shall return later to our previous frame-encoder approach to devise a synthesizable model. The next discussion is a languagedigression meant to accommodate software-oriented coding in verilog.

7.4.4 Behavioral PLL Synchronization (language digression)

A behavioral synchronization, or perhaps more accurately, procedural synchroniza-tion, is done easily in verilog, using a for statement to sample the stream, andanother, containing, while statement which has no exit criterion. The constructfor(j=j; j==j; j=j) would be preferred to a while for our learning pur-poses, but the synthesizer requires the for iteration variable to be initialized witha constant, which would not work in our application. So, we shall usewhile(1).

In our procedural approach, you will notice that the model is not purely behav-ioral; it includes bit-level RTL procedural assignments. An outline of the model’sloop would look something like this:

integer i, j;

reg CurrentSerialBit; // A design data object.

reg[63:0] Stream; // For now, this is a fixed data object.

...

// Concatenation used for documentation purposes, only:

Stream = {32’b01100001 00011000 01100010 00010000,

32’b01111001 00001000 01111010 00000000};begin : While i

while(1) // Exit control of this loop will be inside of it.

begin

for (i = 63; i >= 0; i = i - 1) // Repeats every 64 bits, if no disable.

begin

#31 CurrentSerialBit = Stream[i]; // The delay value will be explained.

( do stuff with CurrentSerialBit ... then disable While i )

end // for i.

end

end // End named block While i.

... // Pick up execution here after disable While i.

The real control of the outer while(1) is by the name on the block containingit, While i. When “disable While i;” is executed anywhere in the mod-ule containing the code above, everything scheduled between the named beginand end is terminated immediately, and execution continues below the end of theWhile i block. As mentioned before, this works like a C language break.


Use of the disable statement: In verilog, any procedural block may be placedbetween begin and end, the begin labelled as above with a verilog identifier, anddisable called on the block by name. Like goto in C, this should be used onlywhen unavoidable, because it leads to complicated execution which may becomedifficult to debug when anything goes wrong or when it is desired to modify orimprove the code.

With the example above as a start, we can decide how to synchronize the PLL tothe data stream. In the for loop data stream above, the delay of 31 ns × 64 bits =1.984µs period gives us about half of the speed we want for our approximately1 MHz embedded clock. Just about right, to handle one parallel word every otherclock.

Fig. 7.7 The embedded clock (EClock) in the serial stream can be at twice the packet frequency.MSB and LSB refer to the 2-bit pad numbers

We know from a previous lab that, by design, our PLL clock is free-running ata nominal frequency of 32 MHz and continually monitors the frequency of its in-coming 1 MHz ClockIn clock. In the present design, whenever the PLL receivesa positive-edge Sample command, it uses the ClockIn edges it has been moni-toring to make a small frequency adjustment toward that of ClockIn. So, we mustsupply a ClockIn nominally at 1 MHz (= 32 Mbs−1/32) to the PLL, and we mustissue regular Sample commands.

We therefore choose to issue a Sample command on every data packet received,and to supply the PLL with an embedded clock (EClock) defined by the toggle ofthe LSB in the 2-bit frame padding down-count. So, we shall sample on every otherEClock; EClock will be connected permanently to the PLL ClockIn input.See Fig. 7.7.

To extract the embedded clock, EClock, we can use the pad pattern in the datastream; we should look first for 3 successive ‘0’ bits, followed by any two morebits nn, which are read as a count value, followed by another 3 successive ‘0’ bits.Each of these successive, padded nn values is separated from the previous one by 8ignored (data) bits. The nn values will count down by 1 after every data byte, as thedata stream is traversed from MSB toward LSB.

To set a somewhat arbitrary synchronization criterion, we shall do nothing untilthe down-count pattern has been established at least for 4 nn values and ends in2’b00. We then set the logic level of our EClock equal to that of the nn LSB; thisinitially will be 1’b0. This should allow the PLL to determine the direction of afrequency correction, if any, and so we shall begin issuing Sample commands tothe PLL.


If we lose synchronization as defined by any nn down-count miscount, westop issuing Sample commands and leave EClock at its most recent level untilsynchronization is again established by the criterion above. Our PLL, of course,continues to provide an output clock at a frequency based on the most recent pe-riod of synchronization; it is up to other logic to decide what to do with thisclock.

Given synchronization and desynchronization criteria, the question remains ofhow we should identify the pattern, 8’b000nn000, where nn represents a 2-bitcount.

To reach an answer to this question, the preceding outline of the model’s loopmay be expanded in detail and expressed as a flowchart in Fig. 7.8.

Fig. 7.8 FindPatternBeh flowchart used to develop the code fragments below

From the flowchart of this procedure, the data are arriving serially, so we canstart the identification of a packet with a for statement that is triggered everyi-th serial bit received, with i stored in a 6 bit (64-value) reg. A verilog reg isunsigned by default, so counting up or down in this reg connects 0 with 63 in eitherdirection.

Let’s look for the pad pattern assuming the current value of i has put us on theMSB + 1 byte = SerVect[63-8] = SerVect[55], which is the first ‘0’ ina correctly identified padded count-byte. One implementation in verilog might beby the following code fragment:


// Assume SerVect is a saved, serial 64-bit vector:reg[5:0] i; // i traverses the serial stream (64-bit vector).reg n1Bit, n2Bit; // Two 1-bit regs to hold the expected nn pad count....FoundPads = 1’b0;begin : While i // Name the block to exit it.while(1)

beginif ( SerVect[i]==1’b0 && SerVect[i-1]==1’b0 && SerVect[i-2]==1’b0

&& SerVect[i-5]==1’b0 && SerVect[i-6]==1’b0 && SerVect[i-7]==1’b0)begin#1 FoundPads = 1’b1; // 1 means true here, for later use.#1 n1Bit = SerVect[i-4]; // Save the padded nn = {n2, n1} value.#1 n2Bit = SerVect[i-3];disable While i; // Exit the while block, if found.end // if.

...i = i - 1;end // While i statement.end // While i named block.

Study the code above; notice that the j of a previous example is not present. Besure you understand what this code does in relation to our framing pattern if thecurrent value of i puts it on the first ‘0’ of the MSB pad byte (bit 55 in the vectorabove). Then, move i: Assume that the current value of i now is at bit 23, forexample, in that pattern. If so, with the big && expression in the if, we are lookingat all the 0’s in the byte in the part select, SerVect[23:16], to see whether wecan capture its nn count value.

The i = i − 1 in the while(1) loop control means that the && patternmatch test will be run on the next less significant (i-1) position in the SerVectvector if a pattern match does not succeed at the current (i) position. This is becausewe assume the data stream is arriving MSB first; thus, over time, we assume that lessand less significant bits will be at the center of our attention. Of course, in the specialcase of bit 23 of our SerVect above, the match will succeed, the While i blockwill be disable’d, and no new if will be run at i−1 = bit 22, given the codefragment shown.

The assignments shown in the verilog above all are blocking; we are not mod-elling hardware but are creating behavior. The delays are just to space out edges inthe simulation waveforms, but we want every assignment to take effect before thenext sequential line is read, just as in a software program. Later, perhaps some ofthis model will be seen as clearly sequential and thus perhaps will be implementedusing undelayed nonblocking assignments.

Also, it might seem that the above pattern search would be speeded up by nestingsix if’s:

if (SerVect[i]==1’b0)if (SerVect[i-1]==1’b0)

...


In terms of the verilog simulation language, this is a false impression, becausethe multiple && expression is evaluated left-to-right; and, on the first equalityfailure, the && will go false just as quickly as would a nested if in that po-sition. However, the logic synthesizer might produce a different netlist for thenested if’s, so perhaps a rewrite in the form of nested if’s should be keptin mind.

Anyway, the code above will not set FoundPads = 1 unless what prob-ably are the six pad ‘0’ bits we are seeking have been found. Such a patternmatch might, however, be a coincidence, and we might have matched by mis-take on an i in the middle of a data byte. So, let’s elaborate on our searchfurther.

Let’s declare a vector Nkeeper which is 4×2 = 8 bits wide, and use the codeabove to collect 4 successive, 2-bit nn values (which we think are nn values, any-way). This adds some more functionality to the code above; it can be written asfollows:

reg[7:0] Nkeeper; // Stores 4 2-bit nn values.reg[5:0] i; // Indexes into a saved 64-bit SerVect vector.reg[2:0] j // Counts which of 4 assumed nn’s we are on....FoundPads = 1’b0;i = starting value;for ( j=0; j<=3 ; j=j ) // The for j increment is nonfunctional.

begin : While iwhile(1)

beginif ( six pad 0’s; same as above )begin#1 Nkeeper[2*j+1] = SerVect[i-3]; // MSB of nn.#1 Nkeeper[2*j] = SerVect[i-4]; // LSB.#1 j = j + 1;#1 i = i - 16; // Jump ahead to the assumed next pad byte.#1 disable While i;end // if.

i = i - 1;end // while.

end // While i block.

Minor point: The delays are located here as placeholders; they make the se-quence of simulation events easier to see in a waveform display. When the modelis working, these delays should be removed; ultimately, some assignments inthe containing module might be changed to nonblocking ones, too. The onlycaution in adding such delays is that, if blocking, they add up, and the deviceclock must be slow enough to let them all time out under all conditions. If not,some events might be cancelled during simulation, and the model might failto work.

The code above is the same as the preceding verilog, except that it repeats un-til j = 3. Setting of the pattern-match flag has been omitted for simplicity. The


width of j must be 3 bits or more; if it were just 2 bits, it would count past 3to 0, and there never would be a count greater than 3; so, the outer for neverwould exit.

One problem with the above code fragment is that it might get stuck in thedata and repeatedly jump past the pad bytes. Also, Nkeeper might get filledup with random values whether or not they came from pad bytes. For example,what if the stream contained no packet at all and was mostly ‘0’ and only anoccasional ‘1’?

We want to check for 4 successive nn pad values, separated by exactly 8 (data)bits each; so, we should modify the above. We shall decrement i by 1 only whilesearching for a pattern match and not finding it; when we find a match, we’ll jumpahead by 16 bits (decrement i by 16) to look for another one. If we ever fail to finda subsequent match, we shall restore j to 0, increment i by 15 to get the first bitwhich was ignored by jumping ahead, and start the search again after advancing by1 bit in the stream.

The result comes out something like the following:

reg[7:0] Nkeeper; // Stores 4 nn values.reg[5:0] i; // Indexes into a saved 64-bit SerVect vector.reg[2:0] j // Counts which of 4 assumed nn’s we are on....FoundPads = 1’b0;i = starting value;for ( j=0; j<=3; j=j ) // No increment.begin : While iwhile(1)beginif ( six pad 0’s; same as above )

begin#1 Nkeeper[2*j+1] = SerVect[i-3]; // MSB.#1 Nkeeper[2*j] = SerVect[i-4]; // LSB.if (j==3) // We’re done:begin#1 FoundPads = 1’b1;disable While i;end

// If j wasn’t 3, jump ahead for another look:#1 j = j + 1;#1 i = i - 16;end

else // No six-0 match this time.if (j==0)

#1 i = i - 1; // First nn not found yet.else begin // We found at least one nn, but now failed:

i = i + 15; // Drop back after jumping by mistake.j = 0; // Reset nn counter; we’re not in a pad byte.end

end // while(1).end // While i.


So, if the preceding block of code exits, we shall have stored 4 2-bit values inNkeeper on the assumption that they will be a sequential, binary down-count inour packet format. Really, we should have dropped back by j�16 - 1, not just by15; but, we’ll fix that next time. But, is it true that in the above we have a way tofind our pad pattern?

No, we haven’t checked the counts. To be very sure of having found the padbytes, and thus the packet boundary, we should add a check for the down-count. Ifwe don’t confirm a down-count, we have not actually found our packet framing, so,we should continue searching. We only allow the code to exit if we have found whatseems to be a properly padded packet of data.

The easiest place to check for a down-count is in the block above which setsFoundPads to 1. We should add a branch there on whether j is 0 or not; ifit is 0, there is only one nn, so, there will be no use checking it. But, if j>0,we can see whether the current 2 nn bits is exactly 1 less than the previous2 nn bits.

First, let’s break out the initialization of the above into a new always blockwhich is read on the opposite edge of StartSearch. This way, even if we as-sign to the same variables from the two always blocks, there can’t be a racecondition unless the delay during one block extends to the other edge. This won’thappen in our design, because StartSearch is not a clock but a search-enablesignal.

always@(negedge StartSearch)begin#1 Nkeeper = ’b0; // Init count keeper every search start.#1 FoundPads = 1’b0; // Move this init here.#1 i = StartI;end


Then, the final searching block is as shown here:

always@(posedge StartSearch) // Clocked, maybe sequential logic?begin : AlwaysSearchfor ( j=0; j<=3; j=j ) // No increment. ’<=’ is relational operator!

begin : While iwhile(1)

beginif ( six pad 0’s; same as above )

begin#1 Nkeeper[2*j+1] = SerVect[i-3]; // MSB of pad count.#1 Nkeeper[2*j] = SerVect[i-4]; // LSB of pad count.// Check whether done:if (j==3) // We have 4 apparent nn values; do they count down?

begin#1 CountOK = 2’b00;for (k=1; k<=3; k=k+1)

begin// Use concatenation to get 2-bit nn values:#1 nPrev = { Nkeeper[2*k-1], Nkeeper[2*k-2] };#1 nNow = { Nkeeper[2*k+1], Nkeeper[2*k] };if ((nNow+1)==nPrev) #1 CountOK = CountOK + 1;end //for k.

if (CountOK==3) // Total of 4 were OK; so,begin // issue a pulse and stop everything:#1 FoundPads = 1’b1;#1 disable AlwaysSearch;end

else begin // If not a down-count, start all over:#1 i = i + 16*j - 1; // See * below:#1 j = 0;#1 Nkeeper = ’b0;end // if CountOK.

end // if j==3else begin // j not 3:

#1 j = j + 1;#1 i = i - 16; // Jump ahead, for another padded nn.#1 disable While i;

end // else not j==3.end // if 6 zero matches.

else if (j==0) // First nn not found yet:#1 i = i - 1;

else begin // * This was not first apparent nn found.#1 i = i + 16*j - 1; // Drop back after jump by mistake.#1 j = 0; // Reset nn counter.#1 Nkeeper = ’b0; // Reinit count keeper.end

end // while(1) block.end // While i labelled block.

end // always AlwaysSearch.

The preceding is a complete RTL design which correctly locates the packet padbytes in our fake piece of serially streamed data. It would work as well, slightlyadapted, to a serially varying input formatted to conform with our required framingprotocol.


A copy of the RTL search code above, with a testbench, is provided in the Lab10directory and is in a file named FindPatternBeh.v.

7.4.5 Synthesis of Behavioral Code

The problem is that the model so far almost is a C model; it is not efficiently syn-thesizable. In fact, the code in FindPatternBeh.v probably will not synthesizeat all.

One hint of synthesis problems is the presence all over of blocking assignmentsand delays. A test: Remove the delays: If the code still simulates correctly, it prob-ably is synthesizable. For example,

if (CountOK==3)...

else begin // If not a down-count, start all over:#1 i = i + IJump*j - 1;#1 j = 0;#1 Nkeeper = ’b0;end

...

If these #1 delays were necessary to proper functioning of some other alwaysblock or module, this verilog would not synthesize to logic which was functionallycorrect.

Also, if one replaced all the blocking assignments in the code fragment withnonblocking ones, there would be a race condition, with the delays shown. If theassignments were changed to nonblocking, and selectively longer delays were pro-vided, say #2 j <= 0, it might simulate correctly; but, again, synthesis probablywould produce defective logic. For this fragment, the best solution would be to re-move all delays entirely.

Delays aside, it can be difficult to rewrite a behavioral model in synthesizableform if the model is of any complexity and requires specific delays. The reader mayconsider spending a little time thinking about how to convert the behavioral modelabove, but we shall take a different tack in writing a synthesizable version.

7.4.6 Synthesizable, Pattern-Based PLL Synchronization

Let’s go back and look again at the problem: We have a stream of serial data and wewish to synchronize a clock to the framing pattern. In our project, the pattern is 64bits wide, so it makes no sense to worry about anything less than 64 bits wide.

As an alternative to the behavioral, C-like coding presented previously, we cando this: Assume we shall have a dynamic sampling “window” of the serial stream

7.5 PLL Behavioral Lock-In Lab 10 141

which is 64 bits wide and stored in a register. We check in the window for a framing(pad) pattern; if we find it, we synchronize our clock to a boundary which is well-defined in the frame; if we don’t find the pattern, we shift the window by one newbit (= serially shift in a new bit) and check again. If our window only changes onebit at a time, we cannot fail to find the frame boundaries, if they are there.

So, all we need do is check a completely static window of data. This can be doneconcurrently by checking every pad bit, all 32 of them, in the 64-bit sample we have.There is no need to shift or change anything, so there is no sequencing or delaying ofanything at all in simulation time. All we have to do is agree not to change anythingin the register while we examine the sample for our frame boundaries.

First, we define our pad boundary patterns; then, we decide where to look forthem in the 64-bit window. And, that’s all there is to it.

Actually, there’s less to it than that: We can agree only to look for the entire 64-bitpattern centered in the window (serial packet MSB at the window MSB position).Why not? If the actual serial stream is not found to be centered on one sampling, wejust shift a new LSB serial window bit in, shift the old window MSB out, and check;shift and check; and just keep doing this until we have the framed pattern centered;then, we recognize it, and off we go!

Using this approach, all we have to do in our model is choose bits to which toattach comparator logic (xor’s in the netlist, maybe). It might go something like this:

reg[63:0] SerVect = // The current 64-bit window to scan./* 60 50 40 30 20 10 0* 32109876 54321098 76543210 98765432 10987654 32109876 54321098 76543210 */

64’b01100001 00011000 01100010 00010000 01111001 00001000 01111010 00000000;...localparam[PadHi:0] p0 = 8’b000 00 000; // The pad patterns.localparam[PadHi:0] p1 = 8’b000 01 000; // localparams can’t be overridden.localparam[PadHi:0] p2 = 8’b000 10 000;localparam[PadHi:0] p3 = 8’b000 11 000;...

if ( SerVect[55]==p3[7] && SerVect[54]==p3[6] && SerVect[53]==p3[5]&& SerVect[52]==p3[4] && SerVect[51]==p3[3]&& SerVect[50]==p3[2] && SerVect[49]==p3[1] && SerVect[48]==p3[0]&& SerVect[39]==p2[7] && ... (total of 32 compares)

) Found = 1’b1;else Found = 1’b0;... (etc.)

This will be much easier to do after we have studied functions, which will benext time. So, for now, we leave this problem as-is and shall return to it later.

7.5 PLL Behavioral Lock-In Lab 10

Work in the Lab10 directory.

Lab Procedure

In this lab, we shall exercise use of the for statement as a way of sampling a datastream. Thomas and Moorby (2002) recommends varying among for, while, and


forever, depending on the context; however, in practice, the for statement cando anything these others can; we shall use it extensively here. The others usuallyare simpler, so we shall exercise for to understand its complexity (and its benefits)better.

Step 1. Make a subdirectory in Lab10 named PLLsync. Copy the entire PLLdesign from the Lab08/Lab08 Ans/PLL directory into the new PLLsync di-rectory. This design will have had a PLL clock which is applied to three differentcounter structures. The top of the design should be a module named ClockedByPLL.

You probably have a complete design of your own for this, from your Lab 8 work,but the lab instructions will work best if you use the answer files provided.

Reorganize the files. Assume there are too many files in one place in this de-sign. Make a new subdirectory named PLL, and move the five PLL module filesinto the new PLLsync/PLL directory. These files already should be named afterthe ClockComparator, MultiCounter, and VFO submodules making up thePLL, and PLLTop.inc, a defined-constant include file. The PLLTop.v file alsoshould be in PLL and should contain the PLL top level module which connects thesesubmodules. The top level module should be named PLLTop; rename it, if it is notalready so.

Create a simulator file list file, PLLsync.vcs, one level up (in the PLLsyncdirectory), and invoke the simulator briefly so you are sure it can be run with thePLL design moved into the PLLsync/PLL directory. This is just to check the filelocations.

Step 2. Rename and revise the design top. Change the module and file names ofthe three-counter design from ClockedByPLL to PLLsync. Remove all coun-ters except the behavioral one, so that the only count output is the behavioral one.Discard the DFF.v model.

The block diagram of your new PLLsync design, which in Lab08 once wascalled ClockedByPLL, now should look like Fig. 7.9.

Fig. 7.9 The PLLsync design block diagram

Simulate PLLsync to verify its correct functionality. Use a 1 MHz ClockIn.

7.5 PLL Behavioral Lock-In Lab 10 143

Step 3. Modify a pattern-finder to extract a clock and a PLL sample command.Make a copy of FindPatternBeh.v (originally placed for you in the Lab10 di-rectory; see left side of Fig. 7.10) named EClockSample.v (“E” for extract). Themodule in FindPatternBeh.v is named FindPattern; rename the modulein the copy to EClockSample.

Fig. 7.10 FindPattern can be modified trivially to EClockSample, to provide inputs forPLLsync. T = toggle; StartSearch would be triggered on every new received serial bit

Modify the new EClockSample module so it outputs our required extractedclock and sample command, as we specified above. A D flip-flop with its Q tiedback to its D often is called a toggle flip-flop, or, T flip-flop, shown in the right sideof Fig. 7.10.

Do not try the pattern-based approach, which was a loopless, synthesizable de-sign introduced at the end of today’s serdes presentation. Do not worry about ac-tual sample speed in your test bench; just extract the EClock and the Samplecommand as you search through the serial bit pattern with the unsynthesizableFindPattern looping approach.

Fig. 7.11 Simulation of EClockSample, showing the extracted clock and the sample command

Thus, EClockSample simply should output the value of the pad-pattern counterLSB continuously, renamed to EClock (wired to EClockWatch in the testbenchused for Fig. 7.11). While Found is asserted, the EClock is good (synchronized);while Found is not asserted, the EClock can not be depended upon.

Do not try yet to attach the PLLsync ClockIn to the EclockSampleEClock. However, keep in mind that, later, we may use EClock to adjust thePLL frequency when EClock is known good; and, we can let the PLL oscillatorrun free when EClock is not known good.


7.5.1 Lock-in Lab Postmortem

How should one deal with establishing and losing synchronization?


Read Thomas and Moorby (2002) section 4.6 on disable.Thomas and Moorby (2002) section 6.5.1 gives a truth-table for the bufif1

primitive.Read Thomas and Moorby (2002) sections 10.1 and 10.2 on strength and con-

tention.Read Thomas and Moorby (2002) appendices C.1 and C.2 on verilog operators

and precedence. The operator functionality is very important to know thoroughly;precedence is less important, because it can be defined or overridden by parentheses.


Section 3.2.1 and Appendix A on verilog builtin net types and strengths.Section 6.1.2 on implicit net declarations.Look through sections 6.3 and 6.4 on verilog operators.


8.1 State Machine and FIFO design

This time we shall study FIFO design and the state machines required for it. First,though, some tools for simplifying the verilog.

8.1.1 Verilog Tasks and Functions

The structure of a design is defined by its hierarchy of module instances. Any com-mon functionality can be put into a module and then instantiated as many times asdesired in a design. However, sometimes, repetitive functionality is required whichis procedural and should not be visible as part of the design hierarchy. As we haveseen, operating on numbers or logic states by using operators such as ‘+’, ‘&’, or“==” is far more convenient than wiring up adders, and gates, or comparators.Likewise, defining a shift-register in an always block with one statement is muchquicker than wiring one from individual gates or even wiring one in as a mod-ule instance. In verilog, convenient handling of this kind of commonplace, repeti-tive, procedural functionality is made available to the user in the form of tasks andfunctions.

Both of these constructs are declared in the module in which they are intendedto be used. Although they may be called remotely by a hierarchical reference, thisis bad design practice and is discouraged. If reuse in different modules is required,the declarations may be put in a file and the file then may be introduced into amodule by means of a ‘include. Because neither tasks nor functions representdesign structure, their declarations can not contain local wires to wire their con-tents to anything; however, they may contain local reg variables for temporary datamanipulation.

The difference between tasks and functions has to do with complexity and, moreimportantly, with timing. We refer here only to user-defined tasks and functions, notto the verilog language’s builtin system tasks or system functions.



Complexity. Functions are simpler. A task can call a function or another task;a function can not call a task but may call another function. So, tasks potentiallyare more complicated than functions. In addition, a function just changes one thingwhen it is executed, its return value; in effect, a function merely defines an ex-pression which is evaluated each time the function is called. A task can modifyexternal reg objects as it executes, possibly leaving a diversity of changes after itis done.

A task may include delay expressions or nonblocking assignments, although de-layed nonblocking assignments probably will not be synthesizable. A function maycontain only undelayed blocking assignments.

Timing. Functions can not include delays. A task can include scheduling delaysand event controls; a function can not. Because tasks can be executed over some ar-bitrary simulation time period, they can be used to represent the concurrency typicalof hardware components.

Functions are just complicated substitutes for expressions; functions execute inzero simulation time and return a value, very much like a simple expression.

8.1.1.1 Task and Function Declarations

A task is declared with a port list much like a module header; its declaration be-gins with the keyword task, and it ends with the keyword endtask. Here is anexample of a task declaration and call:

task SwizzleIt (output[3:0] SwizOut, input[3:0] SwizIn, input Ena);beginif (Ena==1’b1)

#7 SwizOut = { SwizIn[2], SwizIn[3], SwizIn[0], SwizIn[1] };else #5 SwizOut = SwizIn;end

endtask...always@(posedge GetData)SwizzleIt( Bus2, Bus1, SwizzleCmd );

A function is effectively the name of a temporary reg type that expresses thevalue returned by the function. For this reason, a function is declared as an ob-ject with a specific width, and with inputs only. Even though a function is calledwith inputs only, and can not have output or inout identifiers declared forit, its input(s) must be declared with the keyword input. A function may becalled in a continuous assignment statement, although this is done rarely in prac-tice. A function may call timing-independent system tasks or functions, such as$display.

When called, a task or function must be passed real params (arguments) by posi-tion, only; port-mapping by name is not allowed.

8.1 State Machine and FIFO design 147

A function is declared between the keywords function and endfunction.For example, here is a function declaration and call:

reg[7:0] CheckSum;function[7:0] doCheckSum ( input[63:0] DataArray );

reg[15:0] temp1, temp2; // Just to illustrate local declarations.begintemp1 = DataArray[15:0] ˆ DataArray[31:16];temp2 = DataArray[63:48] ˆ DataArray[47:32];doCheckSum = temp1[7:0] + temp2[7:0] ˆ temp1[15:8] + temp2[15:8];end

endfunction...#2 CheckSum = doCheckSum(Dbus);

8.1.1.2 Task Data Sharing

Because tasks affect declared data objects as they run, a possible problem ariseswhen two calls are made to the same task during the same simulation time period:Both task calls operate on the same data (the task is declared only once; both exe-cutions operate on the same, unique objects declared in the one task declaration), soconcurrent execution may lead to unpredictable behavior which may be fatal to thehardware being simulated.

To prevent sharing of declared internal data, and to allow recursion, tasks or func-tions may be declared as automatic, which means that their data are copied andpushed on the (simulator CPU) stack as the sequence of recursive executions pro-ceeds. The automatic declaration prevents sharing of local data among differenttask or function calls; the rationale becomes the same as that of the handling of localvariables when making recursive function calls in C.

You may wish to look at the on-disc example in the Lab11 directory,Find3Mod.v, to see how the automatic keyword is used. Implementation ofautomatic is not consistent across all EDA tools, so its use should be avoided ifportability is important.

8.1.1.3 Tasks and Functions are Named Blocks

Finally, tasks and functions are considered named blocks, and a task can be disabledby name from within itself or from anywhere that it can be called. It also may con-tain named blocks. A function can’t be disabled except from within itself, becauseit executes in zero time; however, named blocks in a function can be disabled bystatements within the function.


8.1.2 A Function for Synthesizable PLL Synchronization

We outlined a synthesizable rewrite of the FindPattern module in the previouschapter; the idea was to do one giant, 32-expression if to check format in a 64-bitdata stream. However, an equally valid way would be to run four function calls, eachchecking just 8 bits at a time.

For example, below is a function, checkPad(), which uses a for loop todo such a check. A function executes procedurally in zero simulation time, so itcan be written in software style, with no worry about simulator-synthesis schedul-ing inconsistencies. No delay is allowed, of course, but we already assumed thatour synthesizable version of FindPattern would require no simulation-objectupdate.

function // 64-bit vector 8-bit pad pattern offset in the streamcheckPad ( input[VecHi:0] Stream, input[PadHi:0] pad, input[AdrHi:0] iX );reg OK; // Flags pattern match.integer i // i is the data stream offset.

, j; // j is the pad-byte pattern offset.begini = iX; // Init to stream MSB to be searched.OK = 0; // Init to failed state.begin : For // Capitalized, this is not a verilog keyword.for (j=PadHi; j>=0; j = j-1)beginif (Stream[i]==pad[j])

OK = 1;else begin

OK = 0;disable For; // Break the for loop.end

i = i - 1;end // for loop.

end // For.checkPad = OK;endendfunction

Notice the “disable For;” statement, which stops all activity inside thenamed block, requiring that the final line, “checkPad = OK;” be executed next.This last ends execution, because a verilog function returns as soon as it is assigneda value.

Two comments:

(a) the index variables are declared integer, because the for loop exit variablemust be used in a signed comparison: To exit the for loop, a value less than 0must be expressed. The synthesizer’s optimization routines will remove unusedbits from any 32-bit integer, leaving only a register big enough to do the job.

(b) The for loop is put inside a block named, perhaps humorously, “For”; thisis legal, because all verilog keywords are lower-case, and verilog is a case-sensitive language. The block is named to allow disable to trigger early exit(during simulation) on a mismatch.


If we assume pad-byte patterns declared as local (permanent) parameters,

...localparam[PadHi:0] pad 00 = 8’b000 00 000;localparam[PadHi:0] pad 01 = 8’b000 01 000;localparam[PadHi:0] pad 10 = 8’b000 10 000;localparam[PadHi:0] pad 11 = 8’b000 11 000;...

then, the function checkPad() may be called this way:

...if ( checkPad(Stream, pad 11, OffSet-(1*PadWid))

&& checkPad(Stream, pad 10, OffSet-(3*PadWid))&& checkPad(Stream, pad 01, OffSet-(5*PadWid))&& checkPad(Stream, pad 00, OffSet-(7*PadWid))

)FoundPads = 1;

else FoundPads = 0;

This is much more readable (and reusable) than 32 equality comparisons allglobbed into one if expression.

See FindPattern.v in the Lab11 directory for a full implementation of thefunction calls above.

8.1.3 Concurrency by fork-join

We have looked at isolated, concurrent, independent execution by different alwaysor initial blocks, and we have looked at procedural execution, line by line,within sequential blocks. We have seen how procedural evaluation of delayed block-ing assignments differs from concurrent evaluation of delayed nonblocking assign-ments. Verilog also provides a construct which allows controlled concurrency be-tween statements. This is called the parallel block, or fork-join, construct. It ismodelled after the fork system call in the unix operating system.

A fork-join currently is not synthesizable, but it is used in simulation. Be-cause IP blocks are not normally synthesized, the fork-join may be used repre-sent hardware in a simulation model of an IP block to be included in an otherwisesynthesizable netlist.

A fork-join block is allowed wherever a procedural block is allowed, andit may contain any number of statements which are executed concurrently by thesimulator, with the usual unknown, effectively random, order of initialization ofeach such statement relative to the others. What is special about the fork-joinblock is that it provides for a waiting point at which all its concurrent statements areallowed to catch up with one another and resynchronize.


The statements following the fork are effectively independent threads of exe-cution; they are gathered together to a single time point, indicated by join, whenthe last of them finishes its execution. The gathering applies to the next sequentialstatement following join.

For example, in the following code, DataBus will see several glitches duringsimulation. If the delay times shown can’t be modified to fix this, the glitches willhave to be tolerated:

always@(posedge Clk)begin#1 DataBus[0] <= 1’b0;#2 DataBus[1] <= 1’b1;#4 DataBus[3] = 1’b0;#3 DataBus[2] <= 1’b1;OutBus = DataBus; // Some OutBus bits change before others.end

In verilog, the delayed nonblocking assignments should be concurrent, but theymay not simulate correctly in some simulators. Regardless, by collecting the resultinside a fork-join block, the glitches can be prevented:

always@(posedge Clk)beginfork#1 DataBus[0] <= 1’b0; // These could be blocking assignments; they#2 DataBus[1] <= 1’b1; // still would be scheduled concurrently.#4 DataBus[3] = 1’b0;#3 DataBus[2] <= 1’b1;joinOutBus = DataBus; // All OutBus bits change together.end

When you want several assignments to occur simultaneously during simulation,and you have to use procedural delays, consider using a fork-join block to ac-complish this. Otherwise, avoid this construct.

8.1.4 Verilog State Machines

There are two main kinds of state machine, Mealy and Moore. In a Moore machine,the outputs depend solely on the current state; in a Mealy machine, the current statehas some control over outputs, but additional control comes directly from inputs andmay be independent of the state. For example, a state machine might cycle throughvarious conditions, but a separate device might not always be ready to accept thestate machine’s outputs; if so, the other device may put some or all of the statemachine’s outputs into a high impedance state or may switch the outputs using amultiplexer; this machine then would be a Mealy machine. We shall not consider


this distinction further here, because the essence of the state machine is in how itchanges state, not in how its output logic is controlled.

A state machine consists minimally of a state register which changes value overtime in a deterministic way. A simple state machine is a toggle flip-flop, such as theone we used in our ripple counter lab exercise. If the current state is set (‘1’), thenon the next clock the state changes to clear (‘0’); if the current state is clear, then onthe next clock the state changes to set. Any digital counter is a simple state machine;but, usually, state machines are much more complicated. To describe complicatedfunctionality, a bubble diagram or a flow chart is used. Such diagrams may be foundin abundance in the data book of a modern microprocessor.

Good verilog design of a simple state machine is to separate the state registercontrol from most of, or all, the combinational logic in the machine. This separationis not required by the language, but it simplifies understanding and maintenanceof the state machine, as well as (usually) reducing the amount of coding required.See Fig. 8.1.

Fig. 8.1 Abstraction ofa verilog state machine.Separation of the functionalityallows blocking assignmentsto be used throughout thecombinational code

The state update logic includes the state register and is not accessible to exter-nal devices. The combinational logic uses inputs and the current state to (a) assignoutputs and (b) determine the next value of the state register.

8.1.5 FIFO Functionality

FIFO is an acronym for First-In, First Out: The first value written into a FIFO is thefirst one to be read out, somewhat like a pipeline in which the fluid sent in first isthe first to be delivered.

A FIFO is a kind of register or memory stack. A stack structure is typified bystorage which is addressed by numbers calculated from recent numbers, as opposedto being calculated independently as some arbitrary pointer or address value. It isthen very reasonable that the complement of a FIFO should be a LIFO, Last-In,First Out. A LIFO is like a standard microprocessor stack: The last value pushedonto the stack is the first one popped off of it.


A LIFO can be controlled by a single register, a stack pointer, because the lo-cation used to push data into a LIFO is the same as the location from which datawould be popped. However, a FIFO is more complicated, because, like a pipeline,it has two ends to be controlled. Whereas a LIFO can have just one stack pointer, aFIFO has to have two different pointers, one to write data in, and the other to readdata out.

Fig. 8.2 First-in, first-outfunctionality gives a FIFOregister file a specific direc-tionality

As shown in Fig. 8.2, the FIFO storage consists of some number n of registers,indicated by horizontal rectangles, and a predefined direction of data flow. Data arewritten in at some rate; they are read out at some perhaps different rate. One ap-plication of a FIFO is as a buffer between two different clock domains on a chip;bursts of reading or writing are cushioned by the FIFO so that devices in both do-mains can move data at some useful rate without waiting on every clock tick forthe other domain to become ready. Other FIFO applications are in RS-422 or eth-ernet serial links, both of which usually communicate between independent clockdomains.

There are two main conventions in representing FIFO read and write pointers:(a) These pointers can be viewed as pointing to the next valid address (register)in the FIFO; or, (b) they can be viewed as pointing to the most recently used one.In this course, we shall adopt the former convention: A pointer points to the nextaddress.

Our read pointer, as shown in Fig. 8.2, then must point to valid data in the FIFO,the next datum to be read; the write pointer must point to the unused register intowhich the next new datum will be written. Thus, the write pointer points to currently


invalid data, generally data which have been copied (read) out, so that that register’scontents are of no further value.

If the data flow is as shown in Fig. 8.2, from top to bottom, it must be that the twopointers move upward after each use. Now, what happens when the write pointershown reaches register n, at the “top” of the FIFO? Simple: It wraps around andpositions itself at the bottom register shown, if that register’s contents are invalid.We have seen the same thing when one of our unsigned reg counters wraps aroundfrom its maximum value to 0 again; and, in fact, unsigned counters can be used tocontrol FIFO read and write pointers.

Fig. 8.3 FIFO componentparts

The component parts of a FIFO are as shown broken down in Fig. 8.3. Thereis a set of data storage registers; a read-address and write-address pointer, and anunsigned counter to control each pointer value. The state machine enables or dis-ables counting and monitors the current count values.

Address translation for the FIFO register file could be used to map the countsto and from Gray code addresses. Thus, the register addresses can be encoded val-ues rather than simple, sequential binary counts. The small vertical arrow on theright of each counter indicates that it is allowed to count just one way, here de-picted as “up”. If the counters could count both the way shown and the oppo-site way, which is to say both “up” and “down”, there would not be any well-defined direction of data flow in the FIFO, and it would not be a FIFO. Byencapsulating count-to-address translation in verilog tasks, the translation ratio-nale can be changed arbitrarily without need to alter anything else in the FIFOimplementation.


8.1.6 FIFO Operational Details

Before attempting to write verilog describing a FIFO, let us look more closely athow it must operate. First, suppose the FIFO was “empty”, which is to say, that alldata had been read out of it and no new data had been written since. How to describethis depends on where the write pointer was when the read pointer was used to readout the last valid datum.

Suppose, for example, the write pointer was at the second register from the topin Fig. 8.3. Then, the FIFO would look something like that in Fig. 8.4, labelledEmpty1.

Fig. 8.4 Register addressingfor an empty FIFO

The horizontal arrow sits on the register at which the pointer is pointing; the smallvertical arrow indicates the only allowed direction of change. The write pointer,labelled with a “W”, is free to move upward; the read pointer (“R”) has just readfrom register j and is not allowed to move upward to j+1, because those data areinvalid, as explicitly indicated by the position of W. Thus, we must depict R as itselfbeing invalid, pointing nowhere useful.

In the situation just described, suppose, now, that one new data value was written.The FIFO no longer would be empty. This datum would be written to register j+1,and W would move to point to the next invalid register, j+2. We are now allowedto read from j+1, so R should be pointing there. This is shown in Fig. 8.5, labelledEmpty1 +W .

Fig. 8.5 Register addressingfor an almost-empty FIFO


Let’s look at a different, but equally “empty” condition. Suppose the last validdatum had been read while the write pointer was pointing to the bottom register(n= 0, in Fig. 8.2). Then, the read pointer must be invalid and again must pointnowhere. We may indicate this by putting R right below W, because when W doesa write, R will be ready to read from that exact, same register. Therefore, this sec-ond empty FIFO may be depicted as shown in Fig. 8.6 again to the left, labelledEmpty2.

Fig. 8.6 Register addressingfor an empty FIFO

In this case, when the next write occurs, making the FIFO no longer empty, Wwill move from 0 to point to register 1, and R will become valid, pointing to register0. This is shown in Fig. 8.7, labelled Empty2 +W .

Fig. 8.7 Register addressingfor an almost-empty FIFO

Very good. We may be able to guess from this that R must be less than W at alltimes, because R never can move upward to point to the same register as W; therefore,R never can cross over W and get above it.

However, now let us look at the opposite condition of the FIFO, when it is “full”.This means that not enough has been read recently, and that all possible registershave been written, so that no more writes are allowed, and therefore the W pointermust be invalid.


Suppose this condition occurred just after W had written its last datum into someregister j, as shown in Fig. 8.8, labelled Full1.

Fig. 8.8 Register addressingfor a full FIFO

Oops! We guessed wrong. As easily seen, it is R which is now just ahead of thenext possible position of W. W is not allowed to move up past R, because writing tothe valid register j+1, pointed to by R, is forbidden.

If a read now takes place, the FIFO no longer will be full, register j+1 willbecome invalid, and the condition will change to the one shown in Fig. 8.9, labelledFull1-R.

Fig. 8.9 Register addressingfor an almost-full FIFO

After the read, R points to j+2, and W is allowed to point to j+1, where a writenow can occur.

Finally, let’s look at another full condition, where R is at the bottom of the FIFO,at register 0. If so, it must be that W is invalid but will point to register 0 as soon as aread has allowed the FIFO to contain an invalid register. The full condition is shownon the left, in Fig. 8.10, labelled Full2.


Fig. 8.10 Register addressingfor a full FIFO

The position of W in Full2 actually is nowhere, because W is invalid; but, be-cause we know its first possible valid location, it is shown ready to wrap aroundfrom the other end of the register set to register 0; all that this wrap requires isone read.

After that one read, the FIFO no longer is full and the situation is shown inFig. 8.11, labelled Full2-R.

Fig. 8.11 Register addressingfor an almost-full FIFO

We see that R now points to register 1, and that W has become valid and points toregister 0, at the bottom of the FIFO as shown.

The conditions of “full” and “empty” are of great concern to the other devicesdepending on the FIFO for data transfer. A full FIFO means that the supplier of datamust stop trying to send data; an empty FIFO means that the consumer of data muststop trying to receive data. Thus, a FIFO in general must provide output flags thatindicate when it is full or empty.


8.1.7 A Verilog FIFO

Our FIFO will be composed of a memory (register file) and a control module. Thecontrol module will be designed as a state machine. As explained above, we shallseparate the state machine sequential transition logic from the combinational logicof the rest of the state machine. A schematic of the particular design we shall use isgiven in Fig. 8.12.

Fig. 8.12 Block diagram of FIFO

In the following, we shall refer to “read” and “write” commands by the statemachine. These terms refer to commands issued by the state machine to the RAM,at the read or write address being output by the state machine.

Keep in mind that there will be an external device which is using the FIFO andwhich will be sending read and write requests to the FIFO; the state machine willhonor these requests by issuing commands to the RAM. However, when the FIFO isfull, a write request will have to be ignored; when the FIFO is empty, a read requestwill have to be ignored.

The problem in the detailed description of the FIFO above is in the read and writepointers when they are not usable; one of them is unusable in the “empty” condition,the other in the “full” condition. At face value, we can’t write verilog to handle apointer which is pointing nowhere. On the other hand, study of the preceding detailsreveals that the R and W pointers, as well as their next transitions, always are well de-fined numerically if we stay just one step away from the FIFO states of empty or full.

So, we shall design around “almost-empty” and “almost-full” and shall treat“full” and “empty” as special cases which include invalid pointers.

For brevity, let R and W indicate the numerical positions n of the FIFO registers towhich they point. Then, when R == W-1, the FIFO is one read away from empty;and, when W == R-1, the FIFO is one write away from full. These are easilydescribed arithmetic relations. The condition R==W is, of course, operationally for-bidden for two valid pointers; one must be invalid. All other values of R and W allowsimple arithmetic describing correct operation with no special concern: The pointersimply is incremented by 1 each time, after it is used.

Given this, we may describe our FIFO as a state machine with a possible statetransition only on a read or write and with the following five simple states:


normal Normal. Read may transit to almost empty; write may transit to almost full.No other transition allowed. A 4-register FIFO will move directly between a emptyand a full, with no normal state.

a empty Almost empty. Read transits to empty; write transits to normal.a full Almost full. Write transits to full; read transits to normal.empty Empty. Read is forbidden; write transits to a empty.full Full. Write is forbidden; read transits to a full.

A state transition diagram (“bubble diagram”) for this machine would be asshown in Fig. 8.13.

Fig. 8.13 FIFO state machinetransition diagram. “begin”is the power-up or resettransition. Assumes a FIFOwith five or more storageregisters and R and W theaddress of next action

We’ll start by defining our state register and its states. For 5 states, the majoralternatives would be a 3-bit binary register or a 5-bit one-hot register. Let’s use abinary encoding.

The machine must retain state, so we require a block of clocked sequential logicfor state transitions. Applying our previous rules for coding, we want nonblockingassignments separated from blocking ones, so we shall isolate state transitions in asmall, clocked block with nonblocking assignments only. Combinational logic willdetermine the next state from the current one, so we need only pass the sequen-tial block a variable with the next state encoded in it. The transition logic can beblocking assignments. We shall just code the state assignments in the state machinemodule; the FIFO registers themselves will be in a separate module.

We need two other sequential elements, the read and write counters. They haveto be clocked so that state changes take place only while the FIFO pointers are ina known, consistent state. We’ll therefore only allow the combinational logic to beread on the opposite clock edge from the one which updates the address countersand the state register. If we update the state register on the positive edge of the clock,this means that we should update the address counters on the same edge but readthe combinational block, which programs all updates, only while the clock is low.There are several ways of implementing the counter updates, but a task called in thecombinational block perhaps is the simplest:


task incrRead; // Called while clock is low.begin@(posedge Clk)ReadCount = ReadCount + 1;

endendtask

When this task is called, it stops at the event control and waits on a positive clockedge; when that edge occurs, it increments the read counter address and then exits.A similar task may be declared for the write counter.

Using tasks to increment the counters, and perhaps also to compare values, wouldmake modifying the counter easier and less error-prone during early developmentthan having to go around and edit isolated stretches of combinational code, if wedecided to change to a one-hot or gray-code address counter instead of a binarycounter.

Turning to the state encoding and ignoring state transition logic, so far we havethis:

//// Don’t allow any other module to affect the state encoding,// so, use localparams://localparam empty = 3’b000, // all 0 = empty.

a empty = 3’b010, // LSB 0 = close to empty.normal = 3’b011, // a empty < normal < a full.a full = 3’b101, // MSB 1 = close to full.full = 3’b111; // all 1 = full.

//// The sequential block controlling state transitions://always@(posedge Clk, posedge Reset)if (Reset==1’b1)

CurState <= empty;else CurState <= NextState;

// End sequential state transition block.

We cannot reset NextState in the clocked block, because the synthesizerwould object to the contention of the clocked assignment to NextState in thesequential logic combined with the inevitable other unclocked assignments in thecombinational logic; so, we shall have to plan for a reset of NextState inthe combinational logic. This synthesizer behavior is a good thing in general; ithelps prevent race conditions.

We have to process requests to read and write, so our combinational block forcontrolling state transitions should be sensitive both to state changes and to theserequests.

A first cut at the combinational block might be as follows, with variables named“xxxReg” being the reg used for procedural assignment to output wire xxx:


always@(ReadReq, WriteReq, CurState, Clk) // NOTE: Request, not Register.if (Clk==1’b0) // Only read after a negedge of clock.begincase (CurState)empty: // Combines reset unique conditions

// with a simple empty state during operation:beginif (Reset==1’b1)begin// Reset conditions:FullFIFOReg = 1’b0; // Clear full flag.WriteCount = ’b0;WriteCmdReg = 1’b0;ReadCmdReg = 1’b0;NextState = empty;

end//// Generic empty conditions:EmptyFIFOReg = 1’b1; // Set empty flag.ReadCmdReg = 1’b0; // Disable RAM read.// One transition rule:if (WriteReq==1’b1 && ReadReq==1’b0)beginReadCount = WriteCount; // Could also init to Adr 0.incrWrite; // Call task, which blocks on posedge Clk.WriteCmdReg = 1’b1; // Issue a RAM write.EmptyFIFOReg = 1’b0; // Clear empty flag.NextState = a empty;end

else ReadCount = ’bz; // Nowhere.end // empty state.

a empty: begin...end // a empty state.

normal: begin...end // normal state.

a_full: begin...end // a full state.

full: begin...end // full state.

default: NextState = empty; // Always handle the unexpected!endcaseend // always.

A problem which might be overlooked here is that WriteCount is beingread in this block, but that it is not on the sensitivity list, risking creation of anunwanted latch during logic synthesis. To avoid this, in our next approximation,we shall change always@(ReadReq, WriteReq, CurState, Clk) toalways@(∗) to avoid any possible future unpleasant surprise.

Let’s consider one more state and set up transition rules for it. The normal stateseems a good candidate. We proceed as follows:

Transitions in our state machine only occur on a read or write. When a clockoccurs, but the FIFO performs neither read nor write, it is possible to consider the


machine in an “idle” state. For some state machine problems, implementation of anexplicit idle state is useful; but, for ours, it is unnecessary, because we don’t consideroccurrence of a clock to be an event relevant to the machine state. We don’t careabout anything but read or write. The two clock edges just provide orderly executionas well as adequate time for logic to settle.

In the normal state, on any read or write, we only have to check to see whetherthe next state should be a empty or a full. There is no way in our design thatwe could go directly from normal to empty or full, barring a machine reset.We also know that

(write counter value) > (read counter value) +1

and

(read counter value) > (write counter value) +1

both are required to be true to remain in the normal state.So, the simplest way to look for a transition out of normal is to add another 1

to every new counter value and compare with the current value in the other counter,checking for an equality instead of a greater-than. If an equality has occurred, thenthe machine must exit the normal state.

Taking all this into consideration, the combinational block case alternative fornormal should be something like the following:

normal: begin// On a write:if ( {WriteReq,ReadReq}==2’b10 ) // Concatenation.

beginReadCmdReg = 1’b0; // Disable RAM read.WriteCmdReg = 1’b1; // Issue a RAM write command.incrWrite; // Call task, which blocks on posedge Clk.//// Transition rule: Check for a full:if ( ReadCount == WriteCount+1 )

NextState = a full;else NextState = normal;end

// On a read:if ( {WriteReq,ReadReq}==2’b01 )

beginWriteCmdReg = 1’b0; // Disable RAM write.ReadCmdReg = 1’b1; // Issue a RAM read.incrRead; // Call task, which blocks on posedge Clk.//// Transition rule: Check for a empty:if ( ReadCount+1 == WriteCount )

NextState = a empty;else NextState = normal;end

end // normal state.

To reduce the complexity of comparing 2 bits, we concatenate them into a single,2-bit expression. The read if statement might have been put into an else, making


it the alternative to write; for now, the else has been omitted because it seemedof limited value and would have added lines of code for the designer to read andunderstand.

Keep in mind that in a context in which the request bits might be reversed by anexternal device while the write operation was being processed, the independent if’s(write and then read) in the normal state code above would be seen as a possiblerace-condition hazard. If so, an if-else or a nested case definitely would be agood idea.

Notice that the state transition is assigned last; this is to ensure that all operationsto be completed in the current state are scheduled before the state can be changed.

Finally, notice that in the code above we are comparing someCounterValue+1 with someotherCounterValue. The “+1” might be a problem, be-cause the width of the expression becomes ambiguous: Is it the expression of aninteger (‘1’) or of a rather small reg (“someCounter”)? When a verilog variablehas been assigned, the result takes on the width and type of the destination vari-able; but, this expression will not be assigned to anything. Integers are signed types,and the last thing we would want is for a negative count to be expressed. To avoidpotential problems caused by different simulator implementations, in our next ap-proximation of this model, we shall assign the sum to a reg of the same width assomeCounter before making the equality comparison.

Delay times are omitted in the code above because blocking assignments guar-antee the evaluation sequence, and preceding each statement with, say, “#1”, wouldmake no obvious difference. Furthermore, synthesis will introduce a variety of un-predictable delay differences in each branch of the logic, and backannotated delaysfrom floorplanning and layout would supercede programmed delays in this model,anyway.

One reason to add delays in the code might be to make the simulation proto-col behaviorally accurate with regard to other devices to be connected to the statemachine controller; however, such delays far better would be added in the final con-tinuous assignments, as shown in FIFOStateM header.v:

module FIFOStateM#(parameter AdrHi=4) // 2** (AdrHi+1) registers. Default=32.( output[AdrHi:0] ReadAddr, ..., Clk, Reset);

//reg[AdrHi:0] ReadAddrReg, WriteAddrReg;...reg EmptyFIFOReg, FullFIFOReg

, ReadCmdReg, WriteCmdReg;//assign #1 ReadAddr = ReadAddrReg;assign #1 WriteAddr = WriteAddrReg;assign #1 EmptyFIFO = EmptyFIFOReg;assign #1 FullFIFO = FullFIFOReg;assign #1 ReadCmd = ReadCmdReg;assign #1 WriteCmd = WriteCmdReg;

...


A possible reason to add delays in the individual statements would be to spreadthe simulator waveform display to displace edges and perhaps make the sequenceof events easier to see; however, such modifications then would have to be markedsomehow in the simulator so that they could not be forgotten and become part of thedesign. Such delays might be justified occasionally; but, generally, the only delaysin a design should be those near the top of the design file, in continuous assignmentstatements to the module outputs.

8.2 FIFO Lab 11

Do all work for this lab in the Lab11 directory.

Lab Procedure

Step 1. Use of a task in the Lab 10 design. Copy the file, FindPatternBeh.v,from the Lab10 directory to Lab11, renaming it FindPatternTask.v. Asusual, rename the module so it corresponds to the file name. In this verilog, there isa repeated stretch of code:

#1 i = i + IJump*j - 1;#1 j = 0;#1 Nkeeper = ’b0;

Clearly, the two identical occurrences are supposed to do exactly the same thing;and, if it became necessary to change one, the other should be changed the sameway. This is an ideal situation in which a task or function should be used. Timedelays are involved, so it has to be a task.

Enclose the code above in a task, and call the task at the two places where thecode above appears. Briefly run a simulation to verify correctness.

Step 2. An assertion task.Use the ErrHandle code below as an assertion somewhere in each of the Steps

in this lab, including the previous one, to warn the user of some error or dubiouscondition. Test it for each of the possible actions (see Fig. 8.14). There is an examplefile containing this task in your Lab11 directory:

The Sts type is 4 bits, which allows for 16 different status conditions. Becausea reg is unsigned, this also means that 4’b1000 or “above” will be used to rep-resent negative status values passed to the task. There are four possible Actionvalues:

8.2 FIFO Lab 11 165

• 0: The Sts is 0 or the Msg is null. This condition is normal, and the task silentlyreturns, with no message.

• 1: The Sts is -1 (4’hf). The condition is a fatal error and the simulation isfinished ($finish).

• 2: The Sts is more negative than -1 (4’hf>Sts>=4’h8). The conditionis an error and the simulation is stopped ($stop) but may be continued.

• 3: The Sts is positive (4’h1--4’h7). The condition is a warning or an in-formative one, the Msg is printed to the simulator console, but no other action istaken.

task ErrHandle(input[3:0] Sts, input[255:0] Msg);reg[1:0] Action;

beginif (Sts==4’h0 || Msg==’b0)

Action = 2’b00; // == 0.else if (Sts==4’hf) Action = 2’b01; // Sts == -1, 2’s complement.else if (Sts>=4’h8) Action = 2’b10; // Sts < -1, 2’s complement.else Action = 2’b11; // Sts > 0.//case (Action)

2’b00: Sts = 0; // Do nothing.2’b01: begin

$display("time=%4d: FATAL ERROR. %s", $time, Msg);

$finish;end

2’b10: begin$display("time=%4d: ERROR Sts=%02d. %s"

, $time, Sts, Msg);$display("\nYou may continue the simulation now.");$stop;end

default: $display("time=%4d: Sts=%02d. NOTE: %s", $time, Sts, Msg);

endcaseend

endtask


Fig. 8.14 Test simulation of ErrHandleTask

Of course, there is no effect on simulation waveforms; however, the messagesasserted will appear in the simulator console window, as shown in Fig. 8.14.

Step 3. FIFO state machine design. A module header and partial design has beenprovided in the file, FIFOStateM header.v. A sketch of a testbench also isincluded. Copy this file to a new one named FIFOStateM.v, and complete thedesign as follows:

A. Change the combinational always block’s sensitivity list so it will be sensi-tive during simulation to any change in a variable which is read within thatblock.

B. Complete the assignments and transitions for the remaining states. Simulate themachine to check correct address generation. Use for loops in your testbenchto write the FIFO full, then read it empty, to verify correct address, state register,and flag operation.

C. Synthesize the design, optimizing first for area and then for speed.

Step 4. Attach a register file to the FIFO state machine, making a complete, func-tional FIFO. We shall improve the memory and controller functionality in a laterlab. For now, proceed as follows:

8.2 FIFO Lab 11 167

Copy your Mem1kx32.v static RAM design from the Lab07 directory intothe Lab11 directory. Create a new verilog module named FIFO Top in its ownnew file in the Lab11 directory. In FIFO Top.v, instantiate your FIFOStateMand Mem1kx32 and connect them together. Supply top-level input and output databusses to read and write to the FIFO. You will have to combine the state machine’sseparate read and write address busses to provide a single address for the mem-ory; use a top-level continuous assignment statement and a conditional operatorto do this.

After simulating enough to satisfy yourself that your FIFO is working (seeFig. 8.15 and 8.16), synthesize the FIFO in random logic and optimize first forarea and then for speed.

Step 5. If time permits, and if you have not already done so, double check yourstate machine combinational block for possible race conditions as discussed in thelecture notes above. Race conditions can be avoided by using a single statement tocheck at one point in simulation time, in each state, to decide what the requestedoperation (or state transition) shall be.

Fig. 8.15 First-cut FIFO simulation – limping, but with obvious race conditions avoided

Fig. 8.16 Close-up of the FIFO simulation, showing the first read-out of the register file


What is the relationship of FIFO design across clock domains and the use of Graycode counters?



Read Thomas and Moorby (2002) section 3.5 on functions and tasks.Read Thomas and Moorby (2002) section 4.9 on fork-join.Read and understand the simple memory model in the Thomas and Moorby (2002)section 6.3, exercise 6.9.

(Optional) Thomas and Moorby (2002) contains several different perspectives onstate-machine modelling. If you would like to know more about this, try (re)readingsections 1.3.1, 2.6–2.7, chapter 7, and appendices A.14–A.17. See the optional read-ings below, too.


Read Chapter 8 on tasks and functions.The code in Appendix F.1 represents a synthesizable FIFO. The use of coun-

ters to track the FIFO state is important; some of our References discuss gray-codecounters and other FIFO subtleties. Notice that Palnitkar’s appendix F FIFO hasonly four storage registers.

There are state machine models in sections 7.9.3 and 14.7. Look through theseto see how the control is implemented.


9.1 Rise-Fall Delays and Event Scheduling

This time we shall look into delays, timing, and the way they are used in verilogstatements.

9.1.1 Types of Delay Expression

We’ll deal with path delays, component internal delays, and switch-level delayslater in the course. For now, we shall concentrate only on gate-level (gate-to-gate)and procedural delays.

Regular vs. Scheduled Delay. Two kinds of delay expression are regular andscheduled. The scheduled kind is called “intra-assignment” in the Thomas andMoorby (2002), section 4.7. Although the Thomas and Moorby (2002) authorsseem to find scheduled delays useful, for example in section 8.3.1, in practice theyrarely are so. A regular delay appears to the left of the target of a statement; ascheduled delay appears to the right of the assignment operator (‘=’ or ‘<=’). Thepossibilities are,

#(delay) variable1 = value; // 1. regular blocking.#(delay) variable2 <= value; // 2. regular nonblocking.variable3 = #(delay) value; // 3. scheduled blocking.variable4 <= #(delay) value; // 4. scheduled nonblocking.#(delay) variable5 = #(delay) value; // 5. both, blocking.#(delay) variable6 <= #(delay) value; // 6. both, nonblocking.

The regular delay statement delays the statement until the specified delay valueof time has lapsed; then, the RHS expression is evaluated, and the statement isexecuted.

All nonblocking delay statements are flagged as errors or with serious warn-ings by the synthesizer. Blocking delays are synthesized with the delay values ig-nored. As we have mentioned before, the synthesizer can’t synthesize delay values,



especially those scheduling concurrent, future events, and delayed nonblocking as-signments might be meant to be executed in any arbitrary order, from the synthe-sizer’s perspective.

The scheduled delay statement causes immediate evaluation of the RHS of thestatement and delays assignment of the result for when the specified value; then,the statement is executed. This is equivalent to a VHDL simulator’s transport delayscheduling mode.

The “both” delay statement above combines regular and scheduled delays. Weshall examine it a little more in lab.

Terminology Difference:Thomas and Moorby (2002) introduces the term regular event in section 8.4.3;this is meant to refer to something different from our regular delay. TheThomas and Moorby (2002) regular event is what this book and the IEEEStd 1364 would call an active event, as explained below.

It is strongly discouraged to use the scheduled delay for anything. In effect, thisis an analogue, active-element RC delay-line delay which almost always will beunrealistic in the simulation of on-chip digital logic. Furthermore, this constructin effect creates a new concurrent thread of execution within a procedural block,negating the value of sequence to control functionality. For concurrency, instead useindependent always blocks or continuous assignment statements. If you are notplanning to synthesize the logic, you may use a fork-join. However, this shouldbe a last resort. Keep in mind that a fork-join can not allow delays to time outindependently; rather, the join restores sequence only after all forked statementdelays have lapsed.

It might seem reasonable to use scheduled delays for appearances: To separatesimultaneous edges in a simulator waveform display for better visibility. However,no simulator known to the author will display scheduled-delay displacements dif-ferently from regular-delay displacements; thus, the verification tool may becomerisky for verification, by confusing modelled delays (setup or hold timing slacks,for example) with scheduled delay displacements which are not part of the design.

For example, suppose a D flip-flop was clocking in data, but the setup delay wasnot evident (see Fig. 9.1 A). A scheduled delay would make this clearly visible, butonly if everything went well, and no design problem developed:

Fig. 9.1 Scheduled delays can cause confusion. A, original display; B, appearance fixed by sched-uled delay on C1K and Q; C, external events cause loss of setup, but scheduled delay conceals thereason; D, same external events as C, and omission of scheduled delays exposes the setup loss

9.1 Rise-Fall Delays and Event Scheduling 171

As pointed out in Thomas and Moorby (2002) (section 8.4.1), a scheduled delayalso may be viewed as though it created a temporary reg variable which is invis-ible and thus protected against other assignment statements: For a delay value d,“x <= #d y;” is the same as, “temp = y; #d x = temp;” in which tempis protected against other assignment during the intervening #d delay. This can bewasteful of simulator memory, and it may introduce design bugs where otherwisenone would be. The hidden value can’t be cancelled by input changes, making real-istic inertial scheduling impossible.

Multivalue Wire-Delay Expressions. Even in CMOS technology, in whichgates are relatively symmetrical, rise vs. fall, there is some difference in performancebetween the delay to a ‘0’ and the delay to a ‘1’. For example, capacitive interactionwith ground or power planes can change timing differently. Also, NMOS transistorscan be very slow to pull up to a ‘1’, and PMOS to a ‘0’. To model this kind ofdifference, verilog allows primitive-gate or wire-connecting timing expressions toinclude two, and sometimes three, values, thus:

assign #(3, 5) OutWire 001 = newvalue;bufif1 #(3,5,7) GateInst 001(OutWire 001, InWire 001, Control 001);

The two values are in (rise, fall) order, and the three values are in (rise, fall, toz)order and represent the rise delay from ‘0’ to ‘1’ and the fall delay from ‘1’ to ‘0’.For assignments, or for components capable of it, the third value represents high-impedance delay. We shall study these differences in detail later in the course. Theyare not allowed in procedural assignments; a procedural statement only is allowedone delay value.

In summary, a connecting delay expression may have one, two, or three delayvalues:

#(every delay) Simulator uses this for all scheduling.

#(rise delay, fall delay) Simulator uses rise delay to schedulechanges to ‘1’.

Simulator uses fall delay to schedule changesto ‘0’.

#(rise delay, fall delay, z delay) Simulator uses rise delay and fall delay asfor 2 values.

Simulator uses z delay to schedule changesto ‘z’.

Delays to ‘x’, or two-valued delays to ‘z’, use the shortest delay in the expres-sion; this is one facet of the principle of “delay pessimism”.

Multivalue delay expressions may be used in:

• Continuous assignment statements.• Primitive component instantiations.


Scheduling Surprises. Here are a few perhaps surprising points to ponder, be-fore we look into the verilog event queue. In the following, keep in mind that manysimulators designed for VLSI netlists will not simulate inertial delay properly, whenthe delays are hand-entered in continuous assignments or procedural blocks:

First, when a list of #0 nonblocking statements is encountered in a proceduralblock, the simulator schedules them all at once in the current time, after undelayedevents, and they can not be depended upon to be run in any well-defined order (orderis implementation-dependent):

x <= 1’b1;y <= 1’b0; // z gets the new 1’b1 value of x.

#0 z <= x;

Of course, if the designer only writes synthesizable code, this problem neverarises.

Second, when a list of undelayed statements is encountered, they are guaranteedto be run in the order listed, in an interval of zero time duration. However, nonblock-ing statements assign the right-hand value read, whereas blocking statements blockevaluations, update the left-hand sides in order, and use only the updated values:

x <= 1’b1;y <= 1’b0; // z gets the old value of x;z <= x; // x gets the 1’b1 originally scheduled.

Third, only one evaluation can be held scheduled for a future time. Thus, whenconflicting assignment statements are read, the last one read determines the delayand value to be scheduled:

#2 x <= 1’b1;#2 x <= 1’b0; // x will be scheduled for 1’bz#2 x <= 1’bz; // at 2 time units from current time.

This is pathological code, and many tools will refuse to simulate it correctly.Again, writing only synthesizable code avoids this whole issue; but, it is good to beaware of it – for example, when writing a testbench.

Now, let us proceed to an understanding of why we should not have beensurprised.

9.1.2 Verilog Simulation Event Queue

Specifications for simulation are built into the verilog language, including specifi-cation of how events shall be queued for execution and scheduled for future simu-lation times. From here on, the word time will refer only to simulation time, unlessexplicitly stated otherwise.

As described in IEEE Std 1364, the verilog language depends on a stratifiedevent queue. A conforming simulator initially starts with a list of undelayed events,


executes them, sets time to 0, and then proceeds forward in time in a completelydeterministic way. However, when two or more events are scheduled for the sametime, the language may not specify an order of execution; in other words, all may befound in the same stratum. Thus, when events are simultaneous, different simulatorimplementations sometimes may produce logically different, but equally correct,results. It is up to the designer to write verilog avoiding such differences, when theymatter to the validity of the design. Thus, it is important to understand the stratifiedevent queue.

The verilog stratified event queue is illustrated in Fig. 9.2. Before 2005, the ver-ilog IEEE Std 1364 defined additional, PLI-related strata which now are obsolete.

Keep in mind that an event is a change in value, caused by an output driver(structurally) or by execution of an assignment statement. Mere evaluation of aninput, or of the expression on the right-hand-side of a statement, is not an event.

Fig. 9.2 Verilog Event Queue.. The five-region queue proper is on the right, separated from simu-lator initialization actions. Solid lines represent current-time flow or processing order; dotted linesrepresent input obtained on advancing the current time to the next (future) time

The solid lines in Fig. 9.2 are meant to represent current-time accesses to data,or the flow of control in the current time. For example, one of the solid lines (“Toactive in any order”) represents access to the inactive region of the stratified queueafter all active events have been executed (after all new values are assigned to theirvariables). That solid line means that all inactive events found are moved to theactive region. Any ordering of statements in the verilog source code, or any order-ing which occurred while moving events into the inactive region, is lost during thetransfer from the inactive to the active region.

It is interesting that, even though #0 means zero time delay, events scheduledat the current time with #0 are placed initially in the inactive region, until allactive events have been processed. The active events can include undelayed blocking


assignments, changes in verilog primitive inputs resulting in undelayed outputchanges, or other events which had been scheduled for the current time becauseof being delayed from some previous time.

Other solid lines in Fig. 9.2 show that the active region may trigger new inactiveevents (“New #0 events”), new nonblocking assignments, and so forth. When allactive events have been exhausted, and no new event can be created in the currenttime (by transfer from the inactive, nonblocking assignment, or monitor region, inthat order), the time is advanced to that of the next event, in the future, which wasread and scheduled. The dotted lines in Fig. 9.2 then represent the flow of new datafrom the new current time into the stratified queue.

The often-useful $strobe system task is executed from the monitor stratum,which is after the nonblocking assign stratum; this is why $strobe can report theresult of a nonblocking assignment.

9.1.3 Simple Stratified Queue Example

There is an example named InactiveStratum.v in your Lab12 directory. Youmay wish to simulate it to understand the reasoning here. The verilog, with some ofthe comments and extra spacing omitted, is as follows:

‘timescale 1ns/100psmodule InactiveStratum;reg Clk, A, Z, Zin;always@(posedge Clk)begin

A = 1’b1;#0 A = 1’b0;end

‘ifdef Case1// Case 1: Z inactive:always@(A) #0 Z = Zin; always@(A) Zin = A;

‘else// Case 2: Zin inactive:always@(A) Z = Zin; always@(A) #0 Zin = A;

‘endif//initialbegin#50 Clk = 1’bz;#50 Clk = 1’b0;#50 Clk = 1’b1;#50 $finish;end

endmodule


For either condition of compilation, there are two always blocks sensitiveto change on A. There is only one active Clk edge in the testbench, and thattriggers two successive changes in A: The first change is from ‘x’ to ‘1’, andthe second, without advancing simulation time, is a change of A from ‘1’to ‘0’.

Here is how the stratified event queue determines this simple simulation:

1. Nothing happens (except to Clk) until time 150, when Clk goes to ‘1’, whichis a positive edge in the active region during time 150.

2. At time 150, the top always block is desensitized, and the simulator puts theone new active event, the top assignment statement, into the time = 150 activeregion.

3. Evaluating and finishing the one active assignment statement event, the simulatorassigns the value ‘1’ to A.

4. This assignment allows the blocked, second statement in the top always blockto be read and placed in the time = 150 inactive region.

5. The always blocks sensitive to the change in A to ‘1’ now must be read.

The remaining events depend on whether we are in Case 1 or Case 2.

Case 1

5. Both other always blocks are desensitized, and the undelayed assignment ofA to Zin becomes a new event in the time= 150 active region. The #0 de-layed assignment to Z in the other always block becomes a new event in thetime = 150 inactive region.

6. There is only one active event, so A is evaluated to ‘1’, and Zin isassigned ‘1’.

7. There are no more active events, so the two events in the time = 150 inactiveregion are moved in random order into the active region: These events are:

A to 1’b0; Z to Zin;

8. We know Zin definitely is 1’b1, so Z goes to 1’b1. Whether or not executedfirst, A goes to 1’b0.

9. The change in A again triggers both of the Case 1 always blocks, and thisputs the assignment to Zin into the active region and the assignment to Z inthe inactive region.

10. The value of Zin is changed to 1’b0, and this empties the active region. Theone event in the inactive region then is moved into the active region and exe-cuted, changing Z also to 1’b0.

11. All statements in all always blocks have been read; so, there are no further ac-tive events. All always blocks are resensitized. The simulator moves all non-blocking assignments, without reevaluation, to the active region of time = 150(there are none). After that, all events from the time = 150 monitor region aremoved into the active region (there are none).


12. The simulator locates the first event in future time, the $finish, puts it in thetime= 200 active region and executes it, terminating the session.

The final values in Case 1 then should be: A=0, Z=0, Zin=0. Notice that de-laying the assignment to Z has guaranteed that it will be determined by the value ofZin, whenever A changes.

Case 2

5. Both other always blocks are desensitized, and the undelayed assignment ofZin to Z becomes a new event in the time= 150 active region. The #0 delayedassignment to Zin in the other always block becomes a new event in thetime = 150 inactive region.

6. There is only one active event, so Zin, never yet assigned, is evaluated to ‘x’,and this value also is transferred to Z.

7. There are no more active events, so the two events in the time = 150 inactiveregion are moved in random order into the active region: These events are:

A to 1’b0; Zin to A;

8. We know A definitely will go to 1’b0, but we can not tell how the race toassign Zin will be resolved. However, we know that either the old or the newvalue of A will be used, so Zin definitely will become either 1’b1 or 1’b0,not 1’bx.

9. The change in A again triggers both of the Case 1 always blocks, and this putsthe assignment to Z into the active region and the assignment to Zin into theinactive region.

10. The value of Z is changed either to 1’b0 or 1’b1, and this empties the activeregion. The one event in the inactive region then is moved into the active regionand executed, changing Zin to 1’b0.

11. All statements in all always blocks have been read; so, there are no furtheractive events. All always blocks are resensitized. The simulator moves allnonblocking assignments to the active region of time = 150 (there are none).After that, all events from the time = 150 monitor region are moved into theactive region (there are none).

12. The simulator locates the first event in future time, the $finish, puts it in thetime=200 active region and executes it, terminating the session.

The final values in Case 2 then should be: A=0, Z=(0 or 1), Zin=0. No-tice that delaying the assignment to Zin by #0 has created a race condition on Z.

This example was unusually complex to decipher and understand manually; itwould be an error to write such code in a real design, because maintenance or modi-fication would be difficult, time-consuming, and error-prone, especially if there wasmore to the module than three simple always blocks. Also, as commented in theon-disc Lab12 file, two very widely used and well-written simulators produce dif-ferent simulations!


Application of even one of our rules of thumb would mitigate the complicationsof this instructional example. The example purposely ignores three of our codingrules of thumb:

• Use nonblocking assignments in a clocked block.• Never put a #0 delay in a design.• Don’t allow strange latches (the two always@(A) blocks).

9.1.4 Event Controls

There are just two kinds of event control, other than (re)location of a statementin a specific procedural block: The @ statement and the wait statement. All otherstatements merely are executed or not.

The @ statement is allowed only in (or at the start of) a procedural block, andit causes a wait in further reading of the block until the event expression changes.This change makes @ effectively edge-sensitive. We already have used this statementextensively, so examples should not be necessary.

The @ event expression may be the name of a data object or the choice ofan object edge (posedge or negedge). When the expression changes fromsome other logic level either to a logic ‘1’ or ‘0’, the expression comes true,the @ statement is executed, and subsequent procedural statements may be readand possibly executed. When the edge-chosen expression changes to the speci-fied level, up to ‘1’ (posedge) or down to ‘0’ (negedge), the @ statementlikewise is executed. The @ statement itself changes nothing in the simulationdata.

Introducing @ with the concurrent always (“always @”) allows an @ state-ment concurrently to initiate reading of a procedural block of otherstatements.

The declared event, although rarely used in practice, is mentioned here for com-pleteness. A declared event is introduced by the verilog keyword event, and itallows for the assignment of names to nondesign objects called events, so that eventcontrols (@ statements) might be triggered remotely from anywhere in the samemodule. The syntax is, “event MyEventName;” followed somewhere else byan event statement using the name. For example, suppose a task included the event-control line,

@(MyEventName)do something;

in which do something was some statement. The event control would be triggeredby a statement consisting of an arrow (hyphen plus greater-than) and the declaredname; for example,

if (expr) -> MyEventName;


Declared events effectively are goto constructs, and designers have avoided us-ing them; they depend on an unusual syntax and thus provide complexity with noredeeming special functionality. Better to declare a function or task and simply callit under event control.

The wait statement is a level-sensitive construct, and it depends on a logicalexpression, not a design object name. The syntax is

wait (expr)statement;

When expr comes true, wait executes its statement. For example, wait (x>5)x = 0;Although a construct of the same name is used frequently in VHDL, waitin verilog rarely is seen; this probably is because of availability of the more versatile@ event control.

9.1.5 Event Queue Summary

All the preceding considered, it’s a good idea not to use #n delays in procedural codeor anywhere else in a design module. If necessary, put #n delays only on moduleoutput drivers.

If this advice is taken, then the event queue complexities can be ignored in syn-thesizable coding; and, the following simple considerations are the only ones whichwill remain:

1. Nonblocking assignments in a begin-end block always use old values and areexecuted after all other statements at a given simulation time.

2. Blocking assignments in a begin-end block always use updated values, as inC, and are completed before the first nonblocking statement at a given simulationtime.

3. Therefore, under ordinary circumstances, use only nonblocking assignments inclocked (edge-sensitive) blocks, and use only blocking assignments in otherblocks. This simulates normal setup and sequential activity properly and tellsthe synthesizer where setup is required.

Because understanding of the stratified event queue is necessary to an under-standing of verilog, we shall be using procedural delays frequently for instruc-tional reasons. But, in general, one should not use procedural delays in onesdesign.

9.2 Scheduling Lab 12 179

9.2 Scheduling Lab 12

Do this work in the Lab12 directory.

Lab Procedure

Step 1. Two scheduling and delay examples.

module SchedDelayA;reg a, b;initialbegin#1 a <= b;#1 a = 1’b1;#2 a = 1’b0;#2 a = 1’b1;#1 a = 1’b0;#0 b = 1’b1;#0 b = 1’b0;#0 b <= 1’b1;#5 $finish;end

//always@(b) a = b;//always@(a) b $<$= a;//endmodule // SchedDelayA.

module SchedDelayB;reg a, b;initialbegin#0 b = 1’b1;#0 b = 1’b0;#0 b <= 1’b1;#1 a <= b;#1 a = 1’b1;#2 a = 1’b0;#2 a = 1’b1;#1 a = 1’b0;#5 $finish;end

//always@(a) b <= a;//always@(b) a = b;//endmodule // SchedDelayB.

Here are a few questions to ponder, based on the two examples above. You maysimulate to answer them, but try first to guess without simulation.

In SchedDelayA, at what time does a first rise? b?In SchedDelayA, at what time does a last change? b?Answer the preceding questions for SchedDelayB.


Step 2. Scheduling and delay with rise-fall timing. Here is SchedDelayC:

module SchedDelayC;wire a;reg b, c, d;initial

begin#0 b = 1’b0;

c = 1’b1;d = 1’b1;

#5 d = 1’b0;#10 c = 1’b1#20 $finish;end

//assign #(1,3,5) a = c;assign #(2,3,4) a = d;//endmodule // SchedDelayC.

A. When does a first get a well-defined logic level (‘1’ or ‘0’)?B. Is the order of first assignment of b, c, and d predictable? If not, why?C. What is the final value of a?

Step 3. Scheduling with mixed-up assignments. First, using the file,Scheduler.v in your Lab 12 directory, simulate to see the difference in strati-fied order of evaluation between events from blocking assignments, zero-delay (#0)assignments, nonblocking assignments, and monitor events. See Fig. 9.3. The com-ments in the verilog file explain the simulation result.

Fig. 9.3 Scheduler.v simulation


Optional: For the rest of this Step, use Silos, if you wish to see exactly correctverilog, because simulators designed for serious coding do not handle mixed block-ing and nonblocking assignments, or nonblocking concurrency, properly. If not, justread through the verilog files provided. This exercise makes a point about using reg-ular delays, only. In the Lab12 directory, open the file BothDelay.v and readthrough it. It is based on the code example immediately following the topic abovein this section entitled, “Regular vs. Scheduled Delay”.

Try to predict when each change will take place. If you simulate this module inSilos, you will find that the initial block ends with what may be a surprise: Canyou explain it?

Finally, suppose these assignments in a new module:

initialbegin

Y <= 1’b0;#0 Y <= 1’b1;#0 Z <= 1’b1;

Z <= 1’b0;$display("display: %04d: Y=%1b Z=%1b", $time, Y, Z);$strobe( "strobe: %04d: Y=%1b Z=%1b", $time, Y, Z);$finish;

end

What will be the final values of Y and Z whenever this block is run? Which sys-tem function will report them correctly? Should the simulator compiler place thedelayed assignments initially in the inactive region or in the nonblocking delay re-gion? What on Earth would be the reason for writing such code!? By now, it shouldbe clear why the synthesizer rejects, or warns about, delayed procedural statements,especially nonblocking ones.

Step 4. Event control sensitivity. Below are two different modules. Instantiate bothof them in a third, containing module named EventCtl, of course in a file namedEventCtl.v. Simulate to check functionality (see Fig. 9.4); then, synthesize.


module EventCtlPart(output xPart, yPart, input a, b, c);reg xReg, yReg;assign xPart = xReg;assign yPart = yReg;always@(a,b)begin: PartListxReg <= a & b & c;yReg <= (b | c) ˆ a;end

endmodule // EventCtlPart.//module EventCtlLatch(output xLatch, yLatch, input a, b, c);reg xReg, yReg;assign xLatch = xReg;assign yLatch = yReg;always@(a)begin: aLatcherif (a==1’b1)xReg <= b & c;

endalways@(b)begin: bLatcherif (b==1’b1)yReg <= (b | c)ˆa;

endendmodule // EventCtlLatch.

Fig. 9.4 Simulation of EventCtl. The �Part and �Latch wires respectively are from its twoinstances

The omission of one variable from the sensitivity list should cause synthesis of alatch. The module names are to help keep track of what is what in the synthesizednetlist.

The synthesizer is biased not to infer latches; and, if you want latches, in additionto an incomplete sensitivity list, for guaranteed success, you should code for latchmodules with 1-bit output ports.


Examine the synthesized verilog netlist to see whether the expected latches havebeen created.

Step 5. Rise-fall delays. Rise-fall delays mainly are used with structures in anetlist, such as gates or IP (“Intellectual Property”: large, predesigned blocks); how-ever, they work with RTL code, too, with some adaptation.

Use parameters tR, tF, and tZ, with regular delays, in a module namedRiseFall to schedule the following assignments on each of the ∗Reg variablesshown in the code block below:

1. a rise (tR) after 4 time units;2. a fall (tF) after 3; and3. a high-impedance (tZ) after 5 time units.

All rise-fall delays specify delay characteristics of a wire or port, not a statement,so it won’t work to insert them in procedural code, for example in an alwaysblock.

Therefore, the values of the regs declared below will have to be assigned towires in continuous assignment statements to see the effect of specifying differentdelays for rise and fall. Declare these wires (use the names given, without “Reg”)and assign them:

reg[3:0] OutBusReg;reg[7:0] DataBusReg;reg Out2valReg, Out3valReg;...always@(negedge Clk)beginOutBusReg <= 4’bzz01;DataBusReg <= 8’b1111 0zzz;Out2valReg <= 1’b1;Out3valReg <= 1’bz;end

always@(posedge Clk)beginOutBusReg <= 4’b0101;DataBusReg <= 8’b1zzz 0000;Out2valReg <= 1’b0;Out3valReg <= 1’b0;end

Simulate a module which includes the code above, and the specified delays. Youshould obtain waveforms as in Fig. 9.5.


Fig. 9.5 Simulation of RiseFall, showing the delay differences


Should the verilog event queue have fewer strata? More?Mull over and explain the BothDelay.v result.What is the value of named blocks in synthesis?What is the use of @(�)?Why not use nonblocking assignments in combinational blocks?What if a ‘1’ goes to ‘z’? Is that a rise or a fall? How about ‘0’ to ‘z’?


Read Thomas and Moorby (2002) section 6.5 on rise, fall, and three-state net-connection delay specification.

Read Thomas and Moorby (2002) chapter 8 on scheduling of procedural andbehavioral events.

(Optional) Thomas and Moorby (2002) explains scheduled (“intra-assignment”)delay usage in section 4.7. This construct usually should be avoided, but it is worthunderstanding what it is supposed to do. The authors point out that even where ap-parently useful, a scheduled delay can not do any better than a regular nonblockingdelay in fixing an otherwise malfunctioning design.

(Optional) If you are interested, you can find further information on the verilogsimulator event queue in IEEE Std 1364, section 11.


Read chapter 7, with special attention in sections 7.2.2 and 7.3 to the use of a # delayexpression on the RHS of an assignment statement (“intra-assignment delay”). Weshall avoid this usage, but it is important to understand what it is intended to do.

Read section 5.2.1 on rise-fall delays, which primarily are used when describingthe timing in netlists.


10.1 Built-in Gates and Net Types

In this chapter, we shall study verilog netlists in some detail. Netlists are purelystructural implementations made up of elementary gates and hierarchical (some-times large) other components.

10.1.1 Verilog Built-in Gates

Thomas and Moorby (2002) section 6.2.1 and appendix D discusses all the verilogbuilt-in primitive gates; they are specified in IEEE 1364, section 7. We list thembelow for convenience, omitting for now the switch-level primitives. Notice thatnone of them is sequential logic, although sometimes a three-state output can beconsidered as saving briefly its most recent non-z logic state.

MultipleInputs

MultipleOutputs

One Input &Output

One Output

and buf bufif1 pullupnand not bufif0 pulldownor notif1nor notif0xorxnor

We already have used bufif1 in a lab; notifx is just an inverting bufifx.In their port connection lists, these primitives always have their output(s) as the

first, leftmost pin(s); and, they are allowed instantiation without an instance name.Verilog primitives are the only gates which can be instantiated without an instancename. As shown already in lab, any of the gates above may be instantiated with astrength specification.



The pullup and pulldown components should be avoided, because they arenot synthesizable.

10.1.2 Implied Wire Names

It isn’t always necessary explicitly to declare a net if it is of wire type. When oneend of a connection is a module port connection, merely providing the name of theport connected is enough to declare implicitly the name of a net. For example, thisis a complete module declaration:

module NoNets (output Xout, input Ain, Bin);and And01(Xout, Ain, Bin);endmodule

The following verilog unidirectional buffer or pass-through module also is com-plete and legal, although it probably should be removed during logic optimization:

module NoThing (output Xout, input Ain);assign Xout = Ain;endmodule

10.1.3 Net Types and their Default

There is a refinement of the idea of implied net names: It is legal to change the de-fault type of implied connection nets, as explained in IEEE Std 1364, section 19.This is done by a compiler directive, ‘default nettype, which affects allimplied nets following it during compilation. The type named may be any nettype; the default type is wire under the usual circumstance of no directive. A‘default nettype directive changes this default. We have used the wor typein a lab, and the others with logic function work the same way. The types whichmay be defaulted are as follows:

Type Net Functionality

wire Connection only.tri Connection; identical to wire except in name.tri0 Pull down to logic ‘0’ level, with resistive (pull) strength.tri1 Pull up to logic ‘1’ level, with resistive (pull) strength.wand Logical and of driver logic levels.triand Logical and; identical to wand except in name.wor Logical or of driver logic levels.trior Logical or; identical to wor except in name.trireg Unique. Storage of a (capacitive) charge at a given strength level

when its driver(s) all are in high-impedance (‘z’) logic state.

10.1 Built-in Gates and Net Types 187

The default net type none is discussed below. The two remaining verilognet types, supply0 and supply1, are power-supply elements and may not beused as default implied net types. Note that all these net types are gate, RTL,and behavioral modelling constructs and are independent of verilog strength ex-pressions and of the verilog switch-level constructs we shall study later in thecourse.

Use of the ‘default nettype directive may incur a big risk for very lit-tle advantage: What if a module simulates correctly because of implied nets in adefault state of, say wor, and then is reused or moved elsewhere, so that it is com-piled before ‘default nettype wor appears? The simulation probably thenwill fail, possibly with mysterious symptoms.

Most importantly, the default type may be set to none. After‘default nettype none is encountered, no implied net connection of anytype is allowed; all nets must be declared explicitly.

The safest way of using ‘default nettype is to avoid it. If this directive hasto be used, it is recommended to use it only to set the default type to none.

10.1.4 Structural Use of Wire vs. Reg

Thomas and Moorby (2002) (section 5.1) explains the verilog port connection rules.The simulator will enforce them when they are overlooked. It is easy to overlookthese rules, especially when writing a testbench, where design considerations arenot a priority.

Here’s another perspective on the connection rules:

• First, drivers may be a reg or any net type. This is shown in Fig. 10.1, in whichdeclarations in the containing module are assumed to include “reg InReg;”and “wire InWire;”.

Fig. 10.1 M m01( . . ., .In1(InWire), .In2(InReg) );

• Second, anything driven by a module output and external to that module mustbe a net. The reasoning here is very simple: A reg may hold a value or be adriver (first rule); if a module output could be connected directly to a reg type,there would be contention between the reg, as a driver holding a value, and themodule output port. Therefore, module outputs must be connected to net typesonly. This is shown in Fig. 10.2; the reg connection to an output port is illegal –and, the dual-meaning ‘X’ emphasizes the reason why.


Fig. 10.2 M m01(.Out1 (OutWire), .Out2 (OutReg), . . .);

• Third, anything driving internally from a module input must drive a net; in otherwords, internal connections to input ports must be nets. There is no way to avoidthis, because driving anything with a module input, including a (internal) reg,uses the implied net associated with every input port.

• Fourth, these considerations also mean that a bidirectional port (an inout) mustbe connected on both sides to a net. This is because an inout port is subject bothto the input and the output restrictions just described; these leave only a net typeas a way to connect on either side of an inout port. Thomas and Moorby (2002)says that an inout port must be connected to a three-state gate; however, a plainwire may be used, assuming that internal-external contention can be resolvedin a meaningful way because of strength differences.

10.1.5 Port and Parameter Syntax Note

We have done many port and parameter connections already in lab. We point out thatboth ports and parameters are declared similarly in ANSI format; and, in instanti-ation, the connection format also is very similar. Parameter declarations, like delayvalues, are introduced by the ‘#’ token. They are easily distinguished using ANSIdeclarations, because the keyword, parameter, always appears in a parameterdeclaration. For example, there is no delay involved in this declaration (the param-eter named Delay might be used for anything):

module DeviceM#(parameter BusWidth=8, Delay=1) // No resemblance to a delay time.(output[BusWidth-1:0] OutBus,, input[BusWidth:1] InA, InB, input Clk, Rst);

...

In another module instantiating the one above, parameter overrides by name alsoobviously are not delays:

DeviceM #( .BusWidth(16), .Delay(5) ) // Default overrides, not delays.DevM 01 ( .OutBus(Dbus), .InA(ArgA), .InB(ArgB)

, .Clk(ClockIn), .Rst(Reset));


A confusion can occur when assigning a parameter value by position, rather thanby name. Suppose in the preceding example that we overrode parameter values byposition and not by (ANSI) name. The following might look like a delay assignment:

DeviceM #(16, 5) // Default overrides may resemble delays.DevM 01 ( .OutBus(Dbus), .InA(ArgA), .InB(ArgB)

, .Clk(ClockIn), .Rst(Reset));

For comparison, here is a delay associated with the output of an instantiation ofa verilog primitive:

xnor #(16, 5) Adder U12 ( Z, A, B, C );

Parameter overrides, like delays, immediately precede the instance name; toavoid misunderstanding, not only between delays and parameters but also amongdifferent parameters, it is strongly recommended to override parameters only byname.

10.1.6 A D Flip-flop from SR Latches

In our next lab, a gate-level, simple D flip-flop will be constructed from nand gatesassembled as three SR latches.

To understand the logic, start with an output and choose output states whichimply fully defined inputs. A nand gate outputs ‘1’ if any input is ‘0’; however,when the output is ‘0’, all inputs are fully determined to be ‘1’. So, look first at thecase of a nand output of ‘0’.

Why not begin by considering an SR latch with a small modification in whichboth SR inputs are tied together?

Fig. 10.3 An SR latch withone input

In Fig. 10.3, arbitrarily labelling the outputs q and qn, suppose q=0. Then,qn must be 1 and In must be 1. We could have had qn=0, also, if In was 1.Therefore, the latched state is with In=1. If In=0, then both q and qn must beforced to 1.

This design doesn’t permit any specified input value to be latched, but it doesexhibit sequential behavior.


Now let us add an input which we hope might supply the value to be latched(Fig. 10.4):

Fig. 10.4 A one-input SRlatch with a second input

To latch data, we now must have both In=1 and X=1. With X=1, if In goesto 0, q and qn go to 0; if In then returns to 1, there is a latched value, but it isindeterminate.

From the latched state, if X goes to 0, q is forced to 1 and qn to 0. If X returnsto 1, the latched state will remain q=1 and qn=0.

In this second design, X does act somewhat as a data input, with In a clock ormaybe an asynchronous clear, if we look only at qn as the stored bit. Suppose wehold X at 0 and toggle In from 1 to 0 and back to 1: We get qn=0. If we couldinvert X and store its qn of 0 in another SR latch which always would be interpretedas inverted data, maybe we could get closer to a real D flip-flop?

For starters, we need something more loosely tied to the In. Let’s go back andlook at a plain SR latch again, as in Fig. 10.5:

Fig. 10.5 An SR latch whichacts like a flip-flop, but onlywhen D is ‘0’

Truth-table for the device in Fig. 10.5

time D Clk q1 qn10 0 0 1 11 0 1 1 0 qn1 follows D2 1 1 1 0 qn1 latches D3 1 0 0 1 q1 follows Clk4 1 1 0 0 q1 latches Clk5 0 1 1 0 qn1 follows D


Just as above, we see that it acts like a positive-edge flip-flop, but only when thedata input is at ‘0’.

To make this work for data input ‘1’, we can invert the data as in Fig. 10.6:

Fig. 10.6 An SR latch whichacts like a flip-flop when Dis ‘1’


time D !D Clk q2 qn20 1 0 0 1 11 1 0 1 1 0 qn2 follows !D2 0 1 1 1 0 qn2 latches !D3 0 1 0 0 1 q2 follows Clk4 0 1 1 0 0 q2 latches Clk5 1 0 1 1 0 qn2 follows !D

So, from the two preceding tables, all we need do now is to build a flip-flop whichassigns qn1 to its Q when D was latched as a ‘0’ and assigns qn2 to its Q when Dwas latched as a ‘1’. We can’t use a mux for this, because a mux is combinationaland can not retain the past state of D while selecting the current assignment to Q.

However, we can use a third SR latch in a latched state whenever clock is low;we do this simply by driving the inputs of the third SR latch with the outputs of thenands which receive the clock.

The result is our D flip-flop, shown in Fig. 10.7:

Fig. 10.7 Two SR latches onD, one for ‘1’ and the otherfor ‘0’, with a third SR tolatch the result



time D !D Clk qn1 qn2 Q0 1 0 0 1 1 ? L

1 1 0 1 1 L 0 12 0 1 1 0 0 L 13 0 1 0 1 1 1 L

4 0 1 1 0 1 L 05 0 1 1 0 1 L 06 0 1 0 1 1 0 L

7 1 1 0 1 1 0 L

L = latched

10.2 Netlist Lab 13


Lab Procedure

Step 1. A gate-level D flip-flop. Figure 10.8 gives a gate-level schematic of a Dflip-flop, with a reset:

Fig. 10.8 A D flip-flopimplemented structurallywith nand gates

Use this schematic to create your own D flip-flop module named DFFGates; useverilog primitive nand gates for the gates shown. In your design, make the D flip-flop positive-edge triggered and with asserted-high clear. As usual, put DFFGatesinto a file named the same as the module.

Suggestion: Start by assigning names to the nand gates in Fig. 10.8, and use thosenames as instance names in your netlist. Also, consider the possibility of breakingdown the D flip-flop into S-R latch elements and connecting latches instead of nandgates. The on-disc answer for this version of the exercise includes a PDF schematicof the S-R latch approach.

Simulate your structural D flip-flop to verify its functionality.

10.2 Netlist Lab 13 193

Step 2. A gate-level synchronous counter. Go back to Lab08 and find yourSynch4DFF design, a synchronous counter assembled from behavioral D flip-flops. If you did not complete this Lab08 step, use the answer module provided.

Copy both your behavioral DFFC.v and Synch4DFF.v into the Lab13 direc-tory. Duplicate Synch4DFF.v as Synch4DFFGates.v, renaming the moduleinside correspondingly.

Replace the DFFC instances in Synch4DFFGates.v with DFFGates in-stances. Simulate to verify your completely gate-level design (see Fig. 10.9).

Fig. 10.9 Simulation of the synchronous counter of DFF’s defined structurally by nand gates

Step 3. Synthesis at gate level. Synthesize both the Synch4DFF andSynch4DFFGates designs. Keep your synthesis design rules the same for all con-ditions, but vary the constraints as follows:

Optimize both for area and then for speed. When doing the area optimization,use no constraint except one for area; set that to 0.

When doing the speed optimization, impose no constraint except one for maxi-mum output delay and one for clock period. Adjust the result so that both of theseconstraints are unfulfilled but are no more than 1 ns too small for the actual netlistresult reported by the synthesizer.

Compare the speed and area netlist sizes: Did the design make any difference?Optional: Simulate the speed-optimized netlist (see Figs. 10.10 and 10.11).

Fig. 10.10 Simulation of the speed-optimized Synch4DFFGates synthesized netlist


Fig. 10.11 Closeup of the Synch4DFFGates synthesized netlist simulation near a counter reset


Do you understand the relationship of built-in gates and strengths?What is the relationship of structural design to logic optimization and technology

mapping?In synthesizing from a behavioral vs. gate-level design, as in the lab Step 3,

should the technology mapper do better when the D flip-flops are synthesized frombehavior or from gates? Why?


Read Thomas and Moorby (2002) sections 6.2.1–6.2.3 on net types and primitives.Read Thomas and Moorby (2002) chapter 10 on strength and switch-level mod-

elling.Read Thomas and Moorby (2002) section 5.1 on port connection rules.


Section 4.2.3 shows the verilog port connection rules.Read chapter 10.1 on assignment of delays to gates in a netlist. Look through the

rest of chapter 10 if you are interested; we’ll study component internal delays laterin the course.

Do the exercises in section 5.4.Work through the verilog of the examples in section 5.1.4; the code is available on

the Palnitkar CD, so these examples can be simulated without any keyboard entry.The simulator will display waveforms more realistically if you add a ‘timescale.

There are procedural models of the devices of section 5.1.4 in section 6.5, so itmay be interesting to contrast the differences. Palnitkar gives a gate-level schematicfor a D flip-flop in his figure 6.4 (section 6.5.3).


11.1 Procedural Control and Concurrency

In this chapter, we’ll review and extend our understanding of verilog proceduralcontrol statements. We’ll also look further into concurrent control with tasks andother named blocks.

11.1.1 Verilog Procedural Control Statements

The for is the most versatile looping control statement in verilog; and, synthesizerstypically can make better use of a for statement than the others. For completeness,now, we shall describe the others. Except in this lab, you are encouraged to use foras much as possible in your work.

We already have used if extensively, and we shall discuss the case variantsbelow. The remaining control statements are forever, repeat, and while.forever. The syntax simply is forever, followed by a statement or possibly

a block of them.The forever is a looping statement which is not intended to be terminated

once it is executed; however, other verilog constructs may be used to terminateit. It is used almost always in an initial block, for obvious reasons. Usu-ally, a forever loop will continue until $stop or $finish is executed by thesimulator.



initialbegin : Clock genClock1 = 1’b0;forever #10 Clock1 = ∼Clock1;$finish;end

...initial

begin : Test vectorsAbus = 32’h0101 1010;Dbus = ’b0;#5 Reset = 1’b0;...#220 $finish; // Terminate the simulation after about 10 clocks.end

A concurrent clock generator such as, “always@(Clock1) #10 Clock1<= ∼Clock1;”, is not equivalent to the forever above. The main difference isthat the above forever clock has been implemented with a blocking assignment,preventing reliable setup of clocked variables unless special delays are added torelated combinational logic.

Another difference is that the forever schedules an assignment to Clock1every 10 units regardless of anything; the always can not schedule a new assign-ment until a change in the current value has been simulated. Thus, the alwaysclock can not oscillate if the assignment is a blocking one, while the foreverclock can.

Also, always is itself a concurrent construct, not a procedural one. A loopingalways can not be included in a procedural block.

The forever example is not synthesizable; initial blocks in general haveno corresponding hardware. Of course, neither is the always synthesizable.

Finally, notice that the $finish in the Clock gen initial block above neverwill execute. Not until after forever, anyway.

Two different termination techniques for forever:

// Technique 1:...foreverbegin#1 Count = Count + 1; // Include a delay or @!if (Count>=1000) $finish;end

11.1 Procedural Control and Concurrency 197

// Technique 2:begin: Mod16 Counter

foreverbeginCount = Count + 1; // Count is an integer or wide reg.#5 Dbus[4:0] = Count%16; // This delay avoids hanging the simulator.end

end // Mod16 Counter.// Somewhere else, or in the loop, put disable Mod16 Counter.

A restartable clock generator is easy to do:

always@(posedge Run)begin : RunClock1Clock1 = 1’b0;forever #10 Clock1 <= !Clock1;end

always@(negedge Run) disable RunClock1;

Again, one should consider the setup and hold implications of the use of a block-ing vs. a nonblocking assignment in any clock generator.

repeat. The syntax simply is repeat (number of times), followed by astatement or a block of statements. The repeat is strictly a loop-counting con-struct. When the repeat is read, the value of number of times is stored, and thestatement is executed that number of times. Unlike the for iteration value, the valueof number can not be reread or changed during loop execution. The repeat state-ment(s) may be terminated early the same way as shown above for forever.

Example of a repeat:

...i = 0;repeat(32)begin#2 Abus[i] = LocalAbus[i+32] + AdrOffset;i = i + 1;end

You’ll notice that the coding overhead for useful invocation of repeat is similarto that of for in a loop-counting construct; but, the code in a disable’d repeatwould be spread out and thus may be more prone to error. In a block of severalhundred lines, the control might be difficult to find. The author has worked withstate machine designs of over a thousand lines in a module. Usually, for will bepreferred to repeat, unless no loop control is to be exercised.while. The syntax is while (expression), followed by a statement or block

of statements. The expression is evaluated on first reading, and it is reevaluated eachtime after that, as soon as the controlled statement ends. On any evaluation, if theexpression is not equal to false (a logic-level or numerical ‘0’), the while executes


its statement(s). Otherwise, the while terminates, and execution picks up after thestatement(s) controlled by the while.

The while arguably is a reasonably efficient alternative to a for in many con-texts. The while’s expression may be a relational one involving a count, so whileis more versatile than repeat. Also, a while may be almost interchangeable withfor in a loop-counting context involving a delay. Examples follow.

// Example 1:...SleepState = 1’b1;while(SleepState==1’b1)beginslowRefresh; // A task.#SleepDelay Status = CheckUserInput; // A function call.if (Status!=Asleep) SleepState = 1’b0; // Exit the while.end

// Example 2:Count = 10;while(Count > 0)begin... (do stuff)#3 BusA[Count] = BusB[Count];Count = Count - 1;end

The following for statements are equivalent to the while examples above:

// Example 1:...for (SleepState = 1’b1; SleepState==1’b1;

SleepState = (Status!=Asleep)? 1’b0: 1’b1)

beginslowRefresh;#SleepDelay Status = CheckUserInput;end

// Example 2:...for(Count=10; Count>0; Count=Count-1)begin... (do stuff)#3 BusA[Count] = BusB[Count];end

Assignment of an updated control variable has to be located away from thewhile header, in any while statement block. This brings out an advantage ofthe while: It works the same way whether or not the control variable is changed


with a delayed assignment. A for control header is not allowed to include a delay;so, iterator updates sometimes have to be moved out of the for control header.

In the for, usually all control is up-front, in the header; this makes it easier tounderstand the control or to debug errors. For this one reason, it is recommended touse for in preference to any other loop control, whenever it is reasonably possible.

11.1.2 Verilog case Variants

The main advantage of if is that relational expressions can be used, so whole rangesof values may be covered in one expression. By comparison, in a verilog casestatement, only specific values are allowed in the alternatives. These values maybe variables and may be combined with comma separators, but they still have tobe enumerated explicitly. However, the table-like arrangement of case alternativesoften makes a case more readable than a chain of if . . . else’s. In addition, ifand case differ importantly in the way they handle expressions containing ‘x’ or‘z’ logic levels.

An if attempts equality matches on all logic states at once, including ‘x’ and‘z’. If a bit pattern includes an ‘x’, then if interprets the ‘x’ as a combined ‘1’ and‘0’ and returns ‘x’ ; thus, it can not express a match, even if that ‘x’ is includedexplicitly in the expression. If a variable X contains any bit at ‘x’ or ‘z’ level, theneven “if (X==X)” will fail to match, and the else (if any) will execute!

A case attempts a match on the specific pattern of bits, whether or not any ofthem is ‘x’ or ‘z’. If a vector in the case expression contains some ‘x’ or ‘z’levels, and one of the alternatives contains the same pattern, case evaluates toa match and will selectively execute the statement for the matching alternative.

For example,

X = 4’b101x;Y = 4’b101z;//if (X==4’b101x) ...; // Evaluates as ’x’; won’t match the value above.if (X==Y) ...; // Won’t match; and if (X!=Y) won’t, either.if (X==X) ...; // This won’t match, either!//case (X)4’b1010: ...;4’b1011: ...;4’b101z: ...;4’bxxxx,4’bzzzz: ...; // Comma separation is legal.default: ...; // This will execute because nothing else did.

endcasecase (X)4’b1010: ...;4’b101x: ...; // This matches and will execute.4’b101z: ...;4’bxxxx,4’bzzzz: ...;default: ...;

endcase


One other thing to keep in mind: Although the conditional operator (“?:”) canbe used as a substitute for if under some circumstances, it is an operator and nota statement; it returns an expression. The logic is the same as that of if for 1-bitalternatives. For multibit vector alternatives, when an ‘x’ or ‘z’ appears in the con-dition, this causes the return value of the expression to go bitwise to ‘x’, unlesscorresponding bits in the ‘?’ and ‘:’ vector values agree.

A special operator is provided to allow if to match unknowns, the case equalityoperator, “===” (negation is “!==”), which we briefly have mentioned before. Thisoperator can not cause an evaluation to be ‘x’. Using this operator, ‘x’ and ‘z’match themselves, just as do ‘1’, or ‘0’, wherever they appear, just as in a casestatement:

X <= 4’b101x;Y <= 4’b101z;//if (X===4’b101x) ... else ...; // expr = 1; the if executes.if (X!==Y) ... else ...; // expr = 1; the if executes.if (X!=Y) ... else ...; // expr = x; the else executes.//// The conditional operator is not an if but accepts case equality:Z = (X==Y )? 1’b1: 1’b0; // Assigns 1’bx.Z = (X===Y)? 1’b1: 1’b0; // Assigns 1’b0.Z = (X!==Y)? 1’b1: 1’b0; // Assigns 1’b1.

To take advantage of the table-like case syntax, two variants of that statementhave been defined in verilog, casex and casez.

casex. A casex expression matches an alternative as though ‘x’ and ‘z’ werewildcards: Any bit matches an ‘x’ or a ‘z’, including an ‘x’ or a ‘z’. The statementends with endcase, the same as for the case statement we have been using. It canbe very confused to allow a specific logic level such as ‘z’ become an “anything”character; so, to clarify intent, a ‘z’ wildcard in an alternative may be written as‘?’, rather than arbitrarily writing either ‘x’ or ‘z’, or even mixing them. Examplesfollow.

X <= 4’b101x; Y <= 4’b101z; // X and Y are reg[3:0].// Example 1:casex (X)4’b100z: ...; // No match.4’b10xx: ...; // Executes, because it matches and comes first.4’b11xz: ...; // Can’t execute on this value of X.4’bxxxx: ...; // Would execute if 4’b10xx didn’t.default: ...; // Would execute if nothing else did.

endcase// Example 2: Some sort of bit-mask or decoder:casex (Y)4’b???1: ...; // Executes, because of casex LSB wildcard ’z’ in Y above!4’b??1?: ...; // Would execute if the first one didn’t.4’b?1??: ...; // Can’t execute on this value of Y.4’b1???: ...; // Would execute if nothing above did.default: ...; // Would execute if no ’1’ or wildcard in Y.

endcase


There might seem to be an advantage in using casex when decoding or search-ing for patterns in data, especially in sparse matrices of data. In Example 2 above,either a chained if would have had to be used, or a case would have had to bewritten which individually rejected 12 of the 16 patterns of ‘1’ and ‘0’ possible ina 4-bit object.

Unfortunately, casex is extremely dangerous and error-prone. In the codeabove, the designer’s intent presumably was a prioritized search for 1’s in the fourbits of Y. However, any failure in an assignment to Y during simulation, or a high-impedance value as actually shown, might cause the casex to switch to a Y with-out a ‘1’. Thus, if the intent was to look for a ‘1’ in well-defined but arbitrary data,chained if’s would be a better way:

Y <= 4’b101z; // Y is reg[3:0].//if (Y[0]==1’b1) ...; // ’z’ means no match (but === 1’bz would match).else if (Y[1]==1’b1) ...; // This one executes.else if (Y[2]==1’b1) ...; // No, not ’1’ and below the first match.else if (Y[3]==1’b1) ...; // No, below the one that first matches.else ...; // Executes if no ’1’ in Y.

Notice that indenting the chain this way makes it fairly readable, although theexpression is changing on every line, risking a typo. Also, the if’s are a little risky,because the evaluations would be wrong if Y had been declared reg[0:3] insteadof reg[3:0], and this is harder to see with the if’s instead of a case.

The consequences of using casex are even worse when one considers synthe-sis. The “don’t care” entries in the casex alternatives might seem advantageousin synthesis, because usually a synthesizer uses ‘x’ or other don’t-cares to its ad-vantage: The designer doesn’t care, so the synthesizer is free to do its best. Whatthe synthesizer does, then, is ignore wildcarded alternatives. In the first casex ex-ample above, then, the second and third alternatives are folded into one, the earlierstatements are synthesized but not the later (depending on the synthesizer imple-mentation), and the resulting netlist probably will not simulate the way the origi-nal verilog does, assuming that Y can vary during design operation. In addition, inthat example, the default alternative probably will be folded into the 4’bxxxxone above it during synthesis, which may or may not have been foreseen by thedesigner.

Because of unexpected consequences of too much wildcarding, it is not recom-mended ever to use casex for anything. When data are sparse, or when wildcardingis unavoidable for other reasons, use casez in place of case or if. At least withcasez, an unexpected simulator ‘x’ will not be wildcarded, and the synthesizerwill jump on the don’t-cares more lightly.

casez. A casez expression matches an alternative with any ‘z’ in the expressionor in an alternative treated as a wildcard. No ‘x’ is a wildcard. Also, as for casex,a ‘?’ may be used as a wildcard character in an alternative instead of ‘z’.


X <= 4’b1x00; // X is reg[3:0].// Example 1: No match for any of the alternatives:casez (X)4’b100z: ...;4’b10xx: ...;4’b10xz: ...;4’bxxxx: ...;4’b0zzz: ...;default: ...; // Executes.

endcase// Example 2: ’?’ is same as ’z’casez (X)4’b???1: ...; // No match.4’b??1?: ...; // No match.4’b?1??: ...; // No match.4’b1???: ...; // Executes; X[3] matches 1==1, and others are wild.default: ...; // Can execute on X == 4’b0000, or on anythingendcase // with no ’1’ or ’z’ anywhere, etc.

The casez, like casex, should be avoided; when one is tempted to use it, seewhether a simple case or chain of if can be used instead. If there is no otherway, casez may be used. The casex never should be used, because it has nowildcarding advantage over casez, and it is far more prone to create results inwhich simulation of the synthesized netlist can not be made to match simulation ofthe original design.

11.1.3 Procedural Concurrency

We shall use the word thread somewhat generically and intuitively here; there isno intentional reference to the similar concept in software parallelism, in which a“thread” is differentiated from a “process”.

We already have studied parallel blocks (fork-join blocks) and have workedwith them a little in lab. It is possible to take advantage of parallelism not only inindividual statements, as we have done, but in entire threads of execution of thesimulator. This most easily is done by parallelizing tasks.

Assigning two or more parallel threads each to its own task has many advantages:(a) the thread is completely well-defined, because it consists just of the statements ineach task. (b) Tasks parallelized in a given procedural block easily can be identifiedfor design debugging purposes; one can know where they start (fork) and wherethey finish (join). This can’t usually be said of a collection of concurrently execut-ing always blocks. (c) The block names (task names) generally will be preservedin the synthesized netlist; otherwise, one must dedicate special attention to namingalways blocks or whatever else one has chosen instead of tasks. (d ) Modificationof the statements in a task can be relatively free of unintended side effects, becausestatements in a task are localized and thus can be changed independent of any ofthe design constructs which cause the task to execute. By contrast, parallelizing a


collection of unnamed statements requires that any change be made in a collectionof code containing all the statements parallelized, all at once.

Let’s look at a typical example of concurrency implemented with the concurrentconstruct we have been using most throughout this course, the always block. Hereis the essential code of a device which detects toggling bits on a 4-bit bus and reportsthe bit number on a 2-bit output bus. The reason for the concurrency is to simulatenonprioritized evaluation of the bus bits; checking them in a procedural block wouldimply a priority of some bits, which would be checked before others:

...task CheckToggle(input[1:0] BitNo);begin@(InBus[BitNo]) #1 ToggledReg = BitNo;end

endtask//always@(posedge CheckInBus) CheckToggle(0);always@(posedge CheckInBus) CheckToggle(1);always@(posedge CheckInBus) CheckToggle(2);always@(posedge CheckInBus) CheckToggle(3);...

Very similar functionality can be achieved by replacing the multiple alwaysblocks with a single always block containing a fork-join:

...always@(posedge CheckInBus)beginforkCheckToggle(0);CheckToggle(1);CheckToggle(2);CheckToggle(3);joinend

...

However, the fork-join is not entirely equivalent to the multiple alwaysblocks, which is the reason we prefer to call it a fork-join rather than a parallelblock: The fork-join block does not exit until all four input bits have toggled,meaning that all four concurrent tasks have been run to completion. If one inputnever toggles, then each change of an InBus bit will spawn a new fork-joinblock instance, which will join the others in the simulator event queue, waitingfor all four bits to toggle so it can exit. In effect, this creates a simulator softwarememory leak, and it might crash the simulator or cause it to become very slow atevaluating new events.

The point here is that, when using a fork-join block, one must be certain thatthe join eventually will be reached.


The effect of fork-join may be seen easily by its effect on mixed block-ing and nonblocking assignments. The following example is for illustration only(Fig. 11.1); it is a bad idea to mix blocking and nonblocking assignments in a realdesign:

Fig. 11.1 A fork-join radically can change scheduling of delayed events

There is another very obvious way of achieving concurrency in verilog: Put theconcurrent functionalities into separate modules or different module instances.Events in different module instances are scheduled concurrently, just as are thealways and initial blocks within them, and just as are the different hardwarechips on a board.

11.1.4 Verilog Name Space

One final thing: Up until now, all the examples have built upon assumed experiencewith C language, or other programming languages, in which the name declarationsgenerally precede the programming. Verilog does not allow “global” names, the wayC does, declared outside a module; so, we have assumed this sort of layout in theverilog source file:


module module name ( I/O’s );reg name declarations; ...wire name declarations; ...(programming stuff, using previously declared (or implied) names in statements)endmodule

In many other languages, it is bad style, or illegal, to add declarations anywherebut before the programming. Declare everything first (to avoid name conflicts); thendo the programming. However, verilog only requires that a variable name be de-clared before it is used. Furthermore, functions or tasks may be declared anywherein a module.

We have seen that verilog variable names may be declared locally in modules,or in functions, tasks, and, for temporary use, even inside always blocks.These local names do not conflict with one another because the name is visible tothe compiler only in the local block of code involved.

Verilog allows declarations in a named region. So, it is possible to declare newreg or net names anywhere in a module, which is a named region, even betweenalways or initial blocks, and have those names usable anywhere in the filefollowing the declaration.

This feature should be used with caution; but, in our next lab, it may be useful todeclare a few variables close to the always blocks in which they are used, ratherthan far away, at the beginning of the source file. This is a feature which makesverilog more object-oriented than C: Objects which do distinct things can be moreself-contained than otherwise. For example:

module ...... (500 lines of verilog) ...reg[7:0] ClockCount;always@(negedge ClockIn)begin : Tickerif (StartCount==1’b1)

ClockCount = ’b0;else ClockCount = ClockCount + 8’h1;if (ClockCount >= 8’h3a) (do something);end

It should be mentioned that the synthesizer may not synthesize this, because thealways block contains more than just one top-level if statement.

Notice in the next example that the count would not be preserved by some simu-lators if the counter was declared local to the always, because on every negedgeof ClockIn, the reg would be redeclared and would contain 4’bzzzz by de-fault, going into the if:


always@(negedge ClockIn)begin : Tickerreg[7:0] ClockCount; // ERROR! Redeclared every time always is read!if (StartCount==1’b1)

ClockCount = ’b0;else ClockCount = ClockCount + 8’h1;...

end

Variables declared in a task, however, are static and hold their most recentvalues no matter how many times the task is called. Also, concurrently runninginstances of the same task share its declared variables. This holds except for tasksdeclared automatic: task automatic Mytask gets private copies of its lo-cal variables for each instance. Because of this, an automatic task may becalled recursively.

In the CheckToggle examples above, the examples actually will not workas expected, because the input BitNo will be declared implicitly as a reg justonce, and this one variable will be shared among all executing instances ofCheckToggle.

Thus, all executions of CheckToggle above will be checking the same bitof InBus! The effect is exactly the same as though reg[1:0] BitNo hadbeen declared in the module but external to the task declaration, and the taskdeclaration included no I/O. Task local variables and passed variables (declaredtask I/O’s) are static reg variables shared among all calls. Sometimes this shar-ing may be desired, because it allows running instances to communicate with oneanother.

In the CheckToggle examples above, and in general when a task may be calledmore than once in the same simulation time interval, one would want independentvariables, not shared ones. Thus, the example task declaration above should haveincluded an automatic keyword:

...task automatic CheckToggle(input[1:0] BitNo);begin@(InBus[BitNo]) #1 ToggledReg = BitNo;end

endtask...

A verilog function also may be declared automatic for purposes of recur-sion. A simple example of useful function recursion is calculation of a factorial:

11.2 Concurrency Lab 14 207

...function automatic[31:0] Factorial(input[3:0] N);

beginif ( N>1 )

Factorial = N * Factorial(N-1);else Factorial = 1;end

endfunction...

Notice the location of the function width index expression, and the use of the de-clared input as an implied reg. The externally-called Factorial function can notreturn until Factorial internally has been called with N==1, resulting in evalua-tion of N∗(N-1)∗(N-2) . . . 2∗1 == N!. Recursive functions or tasks may notbe synthesizable; so, they should be avoided when not essential to the design.

11.2 Concurrency Lab 14

Do this lab in the Lab14 directory.

Lab Procedure

Step 1. A forever block. Write a small simulation model with a testbench clockimplemented by means of a forever block. Consider a clock generator an excep-tion to the general rule that one should not have more than one initial block ina module. Simulate the clock to verify functionality.

Step 2. A repeat block. To the Step 1 design module, add a clocked alwayswhich contains a repeat block to initialize a 32-bit bus with a fixed pattern ofalternating ‘1’ and ‘0’ (“. . .010101. . .”), one bit at a time. Simulate it.

Step 3. A while block. To the Step 2 module, add a while in its own alwaysblock to check a 32-bit input bus, one bit at a time, for a specific pattern of ‘1’ and‘0’ (anything you want). If the pattern should be found, cause the while to set anoutput flag bit. Do this without using a disable statement. Simulate the combinedmodule.

Step 4. A case application. Write an encoder which receives as input a one-hotbit pattern in an 8-bit register and which outputs the 3-bit binary value giving thenumerical position of the ‘1’ in the register (starting at 0). The encoder also shouldhave a second output equal to the ASCII code for the bit position ( ‘0’ = 8’h30,‘1’= 8’h31, . . .). Use a case statement to do the encoding. Simulate the modelto verify it.

Step 5. Concurrency exercise. This is what may be a difficult exercise in under-standing simulation; don’t worry about making the design synthesizable.


Fig. 11.2 The CPU and WatchDog design. Dotted blocks represent logic, not hierarchy

Set up a top-level module named CPU Board with two named always blocks,CPU and WatchDog. For this exercise, implement the two always blocks as de-scribed below; normally, a good design would put the CPU and WatchDog in sep-arate modules (Fig. 11.2).

The module should have a 32-bit Abus output, a 32-bit Dbus bidirectionalbus, and Halt and Clk inputs. It also should have three other 1-bit outputs,RecoveryMode, INT00, and INT00 Ack. Supply the same clock to bothalways blocks, but different edges may be used.

We don’t care much about CPU functionality in this exercise, so we’ll use the ver-ilog $random system function to supply bit patterns on the CPU busses, somethinglike this:

...always@(posedge Clk)

begin : CPU...#1 Dbus = $random($time); // $time repeats only between simulations;#1 Abus = $random($time); // so should the (32-bit) random patterns....end

A watch-dog device is a simple and essentially nonfunctional component of asystem. The watch-dog device monitors activity of a more complex device and helpsthe system recover if that activity should indicate a malfunction.

Typically, the watch-dog starts a timer whenever activity of some kind ceases;when the timer lapses, the watch-dog device interrupts or reboots the system. In thisway, a high-reliability system can be protected from hardware or software defectswhich cause a communications deadlock. One of the shortcomings of distributed orparallelized systems is that they can deadlock if two or more components persisteach in waiting for the other(s) to release shared resources.

11.2 Concurrency Lab 14 209

A. The WatchDog always block. It should do these four things: (a) It should countclocks after each change on the CPU address bus, resetting the count to 0 after eachsuch change; (b) when the count exceeds 10 cycles, it should interrupt the CPUwith an INT00 pulse to attempt to restore it to activity; and, it should repeat theINT00 pulse periodically until the CPU acknowledges; (c) it also should assert aRecoveryMode output on the CPU Board module, to signal external devicesof its action; (d) on assertion of INT00 Ack by the CPU, the WatchDog shoulddeassert the RecoveryMode output, stop issuing INT00, and resume countingclocks.B. The CPU always block. It should do these four things: (a) It normally shouldoutput a variety of Dbus and Abus patterns, as described above; (b) it should halton command, by an external (testbench) Halt input pulse; (c) it should servicean INT00 interrupt by resuming activity (assuming Halt not asserted); and (d)it should assert INT00 Ack briefly upon resuming activity, or while continuingactivity if the interrupt should occur while the CPU was not halted.

Implement the CPU Board to fulfill these requirements, using two namedalways blocks and at least one verilog task.

It is suggested to implement one always block completely before worryingabout the other one. The CPU should be easier than WatchDog, so maybe try itfirst. Simulate the result to verify it (see Fig. 11.3).

Fig. 11.3 Simulation of an implementation of the CPU Board design

The clock counts in this design are for lab exercise; a real embedded CPU prob-ably would require many more clocks than 5 or 10 to service an interrupt.


What’s wrong with casex?How should the CPU Board design be described by state-machine bubbles? One

or two machines?


Read Thomas and Moorby (2002) sections 3.1–3.4.4 on procedural control con-structs. Note that the repeat statement is synthesizable with current software.


Also, keep in mind that casex never should be used in design, and that casezshould be avoided unless absolutely necessary.

(Re)read Thomas and Moorby (2002) section 4.9 on fork-join.(Optional) Read the paper by Mike Turpin on the dangers of casex and casez

in synthesis: “The Dangers of Living with an X (bugs hidden in your Verilog)”,downloadable at: http://www.arm.com/pdfs/Verilog X Bugs.pdf.


Read sections 7.5–7.7 on the topics presented this time.


12.1 Hierarchical Names and generate Blocks

Our purpose will be to introduce in this chapter several different constructs usefulto create regular patterns of hardware structure.

12.1.1 Hierarchical Name Access

We introduced this briefly in an earlier lab. It is possible in verilog to access anyelement in a design tree from any other location in that tree. The rationale is verymuch like that of access to files in a modern filing system by means of path names:The file is referenced by a path either beginning at the root directory (top of thedesign) or beginning in the current, working directory (current module instance). Ina unix or linux filing system, the name separator is a slash ‘/’; in a verilog design,the separator is a dot ‘.’.

Fig. 12.1 Module A andits design hierarchy as a“floorplan”

For example, consider an arrangement of module instances in which module Ainstantiates modules B, C, and D; and module D in turn instantiates module E; A is amodule name; the others are instance names, the instances being of various differenttypes (declared modules).



This kind of system is called a hierarchy, because each module instance is con-tained in only one other one. Representation of this containment resembles a floor-plan of a building, as illustrated in Fig. 12.1.

Another way of representing this is as a sort of tree – a family tree, or any otherone which grows with its root spreading outward and downward. In a design tree, asin a tree’s root system, each element has only one precedent (parent) element. Thisrepresentation is shown in Fig. 12.2.

Fig. 12.2 Module A and itsdesign hierarchy “tree”

Verilog hierarchical names are implied when any object is named; it does notapply only to module instances. So, naming a task or a fork-join block,or declaring a variable, establishes a name which may be used hierarchically. Anyprocedural begin also may be named.

In verilog, hierarchical name access is allowed anywhere in the design by use ofthe full path in the hierarchy, through module names or instance names or both. Thisisn’t to say that such access is advisable as a general design practice.

For example, suppose in Fig. 12.1 and 12.2, module DFFC was a flip-flop, mak-ing C a flip-flop instance. Then, inside module A, the Q output port of module DFFC,instance C, could be described by the hierarchical name, C.Q. For example, C.Qcould be connected to the B.D input port by

assign B.D = C.Q;

Notice that the preceding statement is written with paths to the ports involved(and their implied nets) implying that the statement must be located in module A.The instance names of the two DFFC’s can be known from A ; therefore, instanceaccess is allowed without specifying the full path in the design.

When a lower-level statement is made, hierarchical references to containing (par-ent) objects only may be made through the containing module names; instancenames are not allowed.

If located in D (actually, if written in module D Type), the preceding assignmentstatement would have to be written as,

assign A.B.D = A.C.Q;

In instance D of module D Type, writing assign B.D = C.Q would not belegal, because B and C are instance names of objects not instantiated in D.

12.1 Hierarchical Names and generate Blocks 213

As another example, suppose module E Type had an output port wired byOutBus, and suppose module DFFC had a net OutWire; then, to connect a wireOutWire to that port, using a statement in module DFFC, one might write,

assign A.D.E.OutBus = OutWire;

The same connection made by a statement in module A might be written,

assign D.E.OutBus = C.OutWire;

Obviously, hierarchical names are a complex and dangerous feature, and theyare outlawed in most design projects except only under very restricted circum-stances – usually, they are permitted in the design only for reference to objectswithin the current module. Hierarchical name references are similar to the in-famous goto in software engineering; a rarely useful feature which generallycreates bugs.

This use of the hierarchy is not recommended as a general practice; proper designwould dictate that wiring be routed through the ports of modules, and not across theverilog name space. However, it is allowed by the language; and, it can be usefulin a testbench or assertion, in which reference across the design does not imply abreach of the design structure.

Hierarchical reference can be useful, and consistent with good design, when it isused to access automatically generated substructure within a module. We shall seesoon how this kind of downward hierarchical access can help us use generate’dstructures.

12.1.2 Verilog Arrayed Instances

An interesting feature of the language is that it permits instances to be arrayed,in a way very much like the memory arrays we have studied before. The rangespecification again is an enumeration range, not a bit width. For example,

and #(3,5) InputGater[2:10](InBus, Dbus1, Dbus2, Dbus3);

inserts an and gate so that it drives the corresponding InBus bit with the and ofthree data bus bits, each corresponding in bit number. The preceding statement isequivalent to nine instantiations:

and #(3,5) InputGater[2] (InBus[2], Dbus1[2], Dbus2[2], Dbus3[2]);...and #(3,5) InputGater[10](InBus[10], Dbus1[10], Dbus2[10], Dbus3[10]);

Notice that the array number remains as an index value in brackets []; also, theport connections remain indexed.


Instance arrays may be applied to user-defined primitives or modules. The rangemay be parameterized. Also, the enumeration ranges may be replaced by a concate-nation to change the connection bit order. For example,

MyBuffer Xbuff[1:3](.OutPin(OutBus), .InPin({InBus[5], InBus[3], InBus[7]}));

is equivalent to,

MyBuffer Xbuff[1] (.OutPin(OutBus[1]), .InPin(InBus[5]));MyBuffer Xbuff[2] (.OutPin(OutBus[2]), .InPin(InBus[3]));MyBuffer Xbuff[3] (.OutPin(OutBus[3]), .InPin(InBus[7]));

Note: The Silos simulator (demo version) may not be able to compile arrayedinstances of user-defined modules; apparently, it can compile arrays of verilog prim-itive gates with ascending-order array indices, only.

Next, we shall study the more flexible generate statement to create bussed de-sign objects. However, simple bussing can be accomplished easily and transparentlyby the arrayed instance construct.

12.1.3 generate Statements

A generate statement is a concurrent statement invoking one of the verilog pro-cedural control statements. A generate can be conditional and be based on anif or case; but, more frequently, generate is used with for to create arrays ofdesign objects. A generate statement can contain design structure such as gateor module instances, regs, nets, continuous assignment statements, or alwaysor initial blocks. A generate statement is not allowed to contain anothergenerate.

There are two somewhat different kinds of generate statement, conditionaland looping. We shall discuss the conditional kind first, introducing it in the contextof conditional compilation:

12.1.4 Conditional Macroes and Conditional generates

In C or C ++, there is a collection of directives, also called macroes, for the com-piler preprocessor; these are used for conditional compilation. For example, in thefollowing, each line beginning with ‘#’ introduces a preprocessor directive:


#define MacroName...#ifdef MacroName2... (compile something)#else... (compile something different)#endif

Verilog has similar directives, but they are introduced with an accent grave ‘‘’,also called a “backquote”, and not a ‘#’, the latter being reserved in verilog forquantification by delay or parameter. We have seen ‘timescale in almost everyverilog source file, and we used ‘include in the first lab exercise of the course.We discussed ‘default nettype at some length.

Verilog’s compilation directives include ‘define, ‘ifdef, and so forth (seeIEEE Std 1364 section 19, or our Week 12 Class 1 chapter, for a complete list), buttheir usage is not especially subtle or informative, and we shall introduce them onlyas they become useful to us. However, it is important to distinguish their meaningfrom that of a conditional generate. Compiler directives can create alternativesimulations by acting outside the simulation language to rearrange it; conditionalgenerates create language-based alternative structures.

A generate can be parameterized just as can be any other verilog statement;so, the generate statement can facilitate parameterized modelling of large IPblocks.

For example, here is a conditional generate fragment which instantiates amultiplexer named Mux01 either with latched or unlatched outputs:

parameter Latch = 1;...generateif (Latch==1)

Mux32BitL Mux01(OutBus, ArgA, ArgB, Sel, Ena);else Mux32Bit Mux01(OutBus, ArgA, ArgB, Sel, Ena/*unused internally*/);endgenerate

A similar result might be achieved by conditional compilation directives:

‘define Latch 1 // Macro name, but no macro; used as flag....‘ifdef Latch // NOT ‘ifdef ‘Latch; that substitutes a ‘1’!

Mux32BitL Mux01(OutBus, ArgA, ArgB, Sel, Ena);‘else

Mux32Bit Mux01(OutBus, ArgA, ArgB, Sel, Ena/*unused internally*/);‘endif

However, conditional generate should be preferred. The reason is that‘define’d token states persist during compilation from the point at which they


occur, over all modules subsequently accessed by the verilog compiler. If the com-pilation order of modules using the ‘define’d Latch above should change, (a) itmight not be defined yet (or even might have been ‘undef’ed) when the ‘ifdefabove was encountered; or, (b) Latch might be ‘define’d for unrelated reasonsbefore the ‘ifdef above was encountered. Either alternative could cause unex-pected results not accompanied by any warning.

A similar problem was described with ‘default nettype previously. Avoidcompiler directives when possible, except ‘timescale. If they are necessary, tryto use them only within a single module file, and then ‘undef every one possibleat the end of that file.

12.1.5 Looping Generate Statements

A looping generatemay be used for hardware alternatives, but its primary benefitis its applicability in building parameterized, repetitive hardware structures whichwould be time-consuming and error-prone if done manually (even by cut-and-paste).A looping generate statement itself can be generated conditionally, but we shallnot concern ourselves here with that level of complexity.

The Generate genvarThere is a special, nonnegative integer type required for use in a looping generateand which is guaranteed not to be visible in synthesis or simulation; this type isgenvar. A genvar is used for indexing and instance name generation by the com-piler as it converts the generate statement to the structure it represents. The con-version may be viewed as an unrolling process: The looping statement is unrolled,like a carpet, to a netlist, which then may be simulated.

One or more genvars may be declared inside the generate statement, pro-vided each is declared before the looping construct using it. Different genvarsmust be used at different levels in nested loops. The looping construct almost al-ways is a for statement.

12.1.6 generate Blocks and Instance Names

The block containing the looped statements must be named; this name is propagated,indexed by one or more genvar values, as the hierarchical root path to the instancesgenerated.

In effect, the generation block is a structural element, a submodule, in which arelocated the instances. This makes it possible to use object names within each blockwithout indexing, or modifying, them to make them unique.

A generate block, looping or not, may not contain another generate block;the verilog language forbids it. It may not contain module I/O declarations, a


specify block, or parameter declarations. However, multiple levels of looping(for loops) are allowed.

As our first example, suppose we want to generate an 8-bit bus of wires, eachdriving the D input of a flip-flop the output of which should be gated by an invertingthree-state buffer.

Also, suppose these other requirements: The library type of the flip-flop is DFFa;the instance names all should be FF-something.

In addition, the data input bus should be named DBus; the buffered output shouldbe DBusBuf. There should be common Clk, and Rst; and there should be indi-vidual QEna signals; all with the obvious functionality, as shown in Fig. 12.3.

Fig. 12.3 Generic schematicof one of the required array offlip-flops

The code for the generated array is below.

// module I/O’s include 8-bit DBusBuf output, and DBus & QEna inputs....generate

genvar i;for(i=0; i<=7; i=i+1)begin : BuffedBus // Here is the block name.wire QWire, QWireNot;DFFa FF (.Q(QWire), .D(DBus[i]), .Clr(Rst), .Clk(Clk) );not Inv(QWireNot, QWire);bufif1 Buf(DBusBuf[i], QWireNot, QEna[i]); // notif1 would be OK here.end

endgenerate

We used a verilog primitive inverter (not) in this example to illustrate localwiring; a notif1 would have been fine instead of the not and the bufif1.

The result of the above generate loop, unrolled, is an array of 8 namedblocks, BuffedBus[0] . . . BuffedBus[7]. Each unrolled block contains itsown nets named Qwire and QWireNot, and its own three gate instances, oneDFFa, one bufif1, and one not, each with the instance name exactly as given inthe loop statement; thus, logic levels are propagated properly among the unrolled,i-indexed array of named blocks.

To clarify the process, consider the code example immediately above. The firstgenerate, with index number 0, would unroll as a set of instances with the followingidentifiers:


BuffedBus[0].QWireBuffedBus[0].QWireNotBuffedBus[0].FF(.Q(BuffedBus[0].QWire), .D(DBus[0]), .Clr(Rst), .Clk(Clk))BuffedBus[0].Inv(BuffedBus[0].QWireNot, BuffedBus[0].QWire)BuffedBus[0].Buf(DBusBuf[0], BuffedBus[0].QWireNot, QEna[0])

The index numbers 0 for DBus[0], QEna[0], and DBusBuf[0] are fromexternally indexed vectors, not from the genvar i.

We don’t get the desired flip-flops named “FF[i]”, but we do get somethingjust as good, “BuffedBus[i].FF”.

The unrolled loop is shown in Fig. 12.4.

Fig. 12.4 Generated array of flip-flops with individual output enables

Notice the use of the genvar in the verilog coding above: It generates theinstance names of the unrolled blocks, and it is used to connect generate’d in-stances to outside indexed objects. There is no special reason to index componentsor nets within a generate loop statement, because the differently named, unrolledblocks create their own, unique new local instances and nets. For example, the notinstance in the 3rd block unrolled would be addressed in the simulator or in a syn-thesized netlist as, BuffedBus[2].Inv. If the block name was changed in theloop statement to DB, then that not would be named, DB[2].Inv.

So, if an external object is vectored or arrayed, that object must be refer-enced in the generate loop statement using a genvar index value. However,something in the loop must be indexed in order to create a mapping to morethan one indexed block name. External objects (nets) which are not vectored orarrayed, are not replicated within the unrolled loop and instead are fanned outwithin it. We can see this with the external nets, Clk and Rst, in the exampleabove.

Declarations allowed inside a generate loop include net types, reg, andinteger. A task or function declaration, being allowed in a module, alsomay be located inside a generate block, but these declarations are not allowedinside the generate loop statement.

It is easy to understand how the unrolled naming works, if the above generateloop statement is to be used with externally generated vectors of local wiring. If so,then the genvar index has to be referenced within the loop block for connectivity.For example,


...wire[7:0] QWire;//generategenvar i;for(i=0; i<=7; i=i+1)

begin : IxedBus // Here is the block name.DFFa FF (.Q(QWire[i]), .D(DBus[i]), .Clr(Rst), .Clk(Clk) );notif1 Nuf(DBusBuf[i], QWire[i], QEna[i]);end

endgenerate

In this example, the first, i=0 (inverting), buffer would be named,IxedBus[0].Nuf. It would be driven by IxedBus[0].FF.Q through exter-nal Qwire[0] and would drive external bit DBusBuf[0]. It would be enabledexternally by QEna[0].

The following example shows an equivalent generate fragment which doesnot use an external wire to drive Nuf:

for(i=0; i<=7; i=i+1)begin : IxedBus // Here is the block name.wire Qwire;DFFa FF (.Q(QWire), .D(DBus[i]), .Clr(Rst), .Clk(Clk) );notif1 Nuf(DBusBuf[i], QWire, QEna[i]);end

An alternative approach to this whole buffered FF example would be to put theDFF and its buffer in a new module, perhaps named FF, and then to generate (orinstance-array) on the module.

Finally, here is an example of a combinational decoder implemented bygenerate:

parameter NumAddr = 1024;...generategenvar i;for (i=0; i<NumAddr; i=i+1)begin : Decodeassign #1 AdrEna[i] = (i==Address)? 1’b1: 1’b0;end

endgenerate

This would create a structure with an input bus of the same width as the vari-able Address, which may be assumed to be log2(1024) = 10 bits wide, minimum,and an AdrEna bus with 1024 one-bit outputs. Whenever the value of Addressequalled that of some number n in the range 0–1023, the n-th bit in AdrEna would


go high after 1 unit of delay, and all other bits would go (or stay) low. This could beused to enable output in reading a word from a RAM.

An alternative procedural (RTL) way of implementing this decoder, which is verysimple and includes no generate or component instance, would be somethinglike this:

parameter NumAddr = 1024;integer i;...reg[NumAddr-1:0] AdrEna;always@(Address)begin : Decoderfor (i=0; i<NumAddr; i=i+1)#1 AdrEna[i] = (i==Address)? 1’b1: 1’b0;

end

12.1.7 A Decoding Tree with Generate

Let’s look further at the design of a big decoder. If the designer had a preferredlibrary component such as a 4-to-16 decoder available, the desired fanout tree mightbe shown abstractly as in Fig. 12.5.

Fig. 12.5 Fanout representation of a 10-to-1024 decoder composed of 4–16’s


The leftmost, level 1 decoder, selects one of the four level 2 decoders. Each ofthem selects one of 16 others, totalling 64. The 64 level 3 decoders each selects oneof 16 addresses, for a total decode of 1024 addresses (10 bits on the address bus).

For this arrangement to work, decoders with an enable input must be available;a disabled decoder should put ‘0’ on every one of its output pins, thus decodingnothing. A simplified schematic showing this idea is given in Fig. 12.6.

Fig. 12.6 Detail of the select logic for A = 1, showing only 2–4 decoders for simplicity

This arrangement easily is mapped to verilog concepts: In the full-sized tree, thelevel 1 selection can be represented by the state of the two LSB’s on a 10-bit bus.

Each subsequent level then adds 4 bits of information, because the number ofoutputs is just 4 times the number of (binary) inputs. Therefore, the level 2 statesmap to the next 4 bits, bits 2–5; and, the level 3 to the remaining bits, 6–9, whichinclude the MSB.

To fix the idea, Fig. 12.7 shows the path of enabled decoders in the tree whenaddress 0 is selected:

Fig. 12.7 Address 10’h0 decoded. Large arrows indicate the enabled branch


The trick in doing this kind of thing easily is to ignore whether the numeri-cal value of the address which is decoded happens to equal the bit position onthe decoded address bus; in almost all applications, it won’t matter, because gen-erating an address to write data always will result in the same decoded valuewhen a read is intended at the same address. What is important is that each de-coded bit position map uniquely (1-to-1) to the encoded input value. For exam-ple, looking at Fig. 12.7, which shows the decode of address 10’h0, we can seethat address 10’h1 would enable Level 2 decoder 1; with all other address bits0, this would select output location 256. But, the fact that address 1 is schemat-ically decoded to location 256 would be completely irrelevant in the majority ofapplications.

Given that the level 1 decode above requires less than one 4-to-16 decoder, thereisn’t any need for a generate to implement it. Let’s assume an optimized 4-to-16 decoder component is available in the synthesis library named, Dec4 16.We’ll keep track of the process with the help of mnemonic identifiers; each inputwire will be named in something, and each output wire will be named Decodedsomething.

// Level 1 decode is trivial:wire[3:0] DecodedL1;wire[1:0] inL1;wire[15:4] DecodedUnused; // Stub to suppress warnings.//// Get Address LSB’s:assign inL1 = Address[1:0];// The level 1 decode, using verilog concatenation:// output[15:0] , input[3:0]Dec4 16 U1({DecodedUnused, DecodedL1}, {2’b0, inL1}); // Done!...

Next, we’ll see that generate comes in handy for the level 2 decodes, but thatthey still could be cut-and-pasted conveniently.

What is valuable about using generate at level 2, is that the level 2 generateprovides a template for implementation of the level 3 decode, which, with 64 de-coder instances, would be a difficult thing to do correctly any way but by generation.This is how the level 2 decode might be done:


// The Level 2 decode, which requires 4 decoders, fully utilized:

wire[16∗4-1:0] DecodedL2;

wire[3:0] inL2;

assign inL2 = Address[5:2];

//

generate

genvar i;

// Generate the 4 4-16’s:

for(i=0; i<=3; i=i+1)

begin : DL2

wire[15:0] tempL2;

Dec4 16 U2(tempL2, inL2); // Each gets bits 2 - 5.

//

// Compose the decoded address from L1 and L2, and assign the bit:

assign DecodedL2[(16∗(i+1)-1):16∗i] =

(DecodedL1[i]==1’b1)? DL2[i].tempL2: ’b0;

end

endgenerate

The level 3 decode, requiring 64 decoders, is hardly any more complicated whendone with generate; it assigns everything to the 1024-bit address-enable bus,which might represent logic in a memory chip used to select a word for read orwrite:

// The Level 3 decode requires 64 x 4-16’s:reg[(16∗4)∗16-1:0] AdrEna;wire[3:0] inL3;assign inL3 = Address[9:6];//generate

genvar j;// Generate the 64 4-16’s:for(j=0; j<=4∗16-1; j=j+1)

begin : DL3wire[15:0] tempL3;Dec4 16 U3(tempL3, inL3); // Each gets bits 6 - 9.//// Compose the decoded address from L2 and L3, and assign the bit:assign AdrEna[(16∗(j+1)-1):16∗j] =

(DecodedL2[j]==1’b1)? DL3[j].tempL3: ’b0;end

endgenerate

The most important rule to keep in mind when writing a generated structure, isthat the result will be structural; nothing in the generated structure will be allowedto change during simulation time. Everything is a data object or a connection.


12.1.8 Scope of the generate Loop

What difference do these make?

generatefor (i=0; i<Max; i = i+1)begin : Stuffreg temp;and A(Abus[i], InBus[i], InBus[i+1]);always@(posedge Clk) #1 temp <= &InBus;assign #1 OutBit1 = temp;end

endgenerate

or,

reg temp;always@(Abus) #1 temp <= &ABus;assign #1 OutBit1 = temp;generatefor (i=0; i<Max; i = i+1)begin : InAndand A(Abus[i], InBus[i], InBus[i+1]);end

endgenerate

or,

generatereg temp;for (i=0; i<Max; i = i+1)begin : InAndand A(Abus[i], InBus[i], InBus[i+1]);end

always@(Abus) #1 temp = &ABus;assign #1 OutBit1 = temp;endgenerate

12.2 Generate Lab 15

Work in the Lab15 directory, and be sure to save the results of Step 1 and Step 2separately, so they can be compared in Step 3.

12.2 Generate Lab 15 225

Lab ProcedureFirst, we’ll do some generate . Then, we’ll reimplement our Mem1kx32

memory using generated structure; we then can use this memory in our FIFO.

Step 1. Arrayed instances. Implement the second generate example in the dis-cussion above, the one with notifs (“IxedBus”), using arrayed instances andwithout using generate. Simulate your result to verify it. Synthesize a netlist andexamine the resulting schematic. Keep the netlist for the next Step.

Step 2. Generated instances. Put the earlier (“BuffedBus”) generate ex-ample in the discussion above (the one with bufifs and nots) into a modifiedmodule with parameters defining all indexed widths. Use a parameter value in thegenerate loop to ensure consistency of bus widths with the genvar variable.Simulate the result to verify it. Synthesize a netlist with your same constraints as inStep 1, and compare the netlist schematic with the one from Step 1. Do the resultsdiffer?

Step 3. Area vs. speed in a small design netlist. Pick a design from Steps 1 or 2,synthesize it to compare areas when optimized for area (no speed constraint) insteadof speed (no area constraint).

Step 4. RAM generate exercise. In our Memory Lab 7 (Week 3, Class 1), wedesigned a verilog RAM behaviorally; it was called Mem1kx32. Below the instruc-tions here in this lab is a slightly modified copy of the RAM requirements.

Implement this RAM again, this time in a module named Mem1kx32gen. Usea generated vector of 33 D flip-flops (33rd for parity) for storage at each address.Simulate to verify your work.

Consider using the following procedure for this exercise:A. Delete the old memory array declaration. Don’t copy anything but module

I/O’s and their associated variables from Mem1kx32; the old design is not verycompatible with a structural generate, so attempting extensive reuse may wasteconsiderable time. However, your testbench should work with this lab’s design, sokeep a copy of the Lab 7 testbench handy.

B. Implement a simple behavioral D FF with D and Clk inputs, and a Q output.No clear or inverted Q. Put it in a module named Bit.

C. Use a loop generate statement with genvar j to generate a single vectorof 33 FFs.

Fig. 12.8 The j-th element ofthe vector generate loop

Every Bit instance should be connected to a common clock, as shown inFig. 12.8. One bit (j) of a 32-bit data-in wire vector should be connected to each


D; likewise connect a data-out wire vector to each Q. The unrolled word structureshould be as in Fig. 12.9.

Fig. 12.9 The j-generate’d vector structure

After getting the 32 bit vector of storage to simulate, add a bit and implement theparity check. At this point, you should have a generate which creates a single33-bit vector.

D. Use another loop in your generate statement with genvar i to generatejust four RAM storage locations (words), each one consisting of the unrolled vectorloop generate statement from Step 4 C. To select one word rather than another,the simplest way is to add a buffer to each read bit, so that words not addressed willhave their read output (flip flop Q pin) disabled. The disable can be by a decode ofthe word address.

This will be a prototype of the final 32-word memory. Your testbench can exer-cise all corner cases conveniently on just four words. A schematic representation ofthe prototype is shown in Fig. 12.10.

Fig. 12.10 Sketch of the 4-address generate’d memory prototype. Parity is not shown; namingis descriptive, only


E. Now, stop to make sure things are under control. If you haven’t already, definea module-level parameter, DHiBit, to control indirectly the value of genvarj. You may do this somewhat as follows:

If the data bus is declared directly by use of the parameter DHiBit as areg[DHiBit:0], the number of data wires (bits) then may be calculated by thecompiler as

localparam DWidth = DHiBit+1;

Then,

genvar j; ... for (j=0; j<DWidth; ...

Likewise, you may define AHiBit to set the number of address lines. The com-piler then can calculate,

localparam AWidth = AHiBit+1;

Given this, the number of storage locations (addresses) is,

localparam NumAddrs = 1<<AWidth; // Same as 2∗∗AWidth.

You then may use the simulator compiler to set the upper value of genvar i,

genvar i; ... for (i=0; i<NumAddrs; ...

For example, with AHiBit at 1, the address bus width is 2, and the number ofaddresses is 1<<2 = 4. It’s also fine to calculate AWidth-1, etc., for the upperlimit of the memory array index range.

F. Add functionality according to the requirements below, until your RAM modelseems complete. Simulate with just a few addresses for verification. Then, setAHiBit up to 4 to simulate a full 1k × 32 RAM (see Fig. 12.11). Synthesisand optimization of the full RAM may take quite a while (see the answer synthesisscript files for why); so, defer synthesis until after including the RAM in the FIFOof the next Step.


RAM Mem1k×32gen RequirementsA verilog 1k×32 static RAM model (32×32 bits) with parity. Call it,

“Mem1kx32gen”.Create a structural memory core from D flip-flops using generate state-

ments, one for generating the vector of FFs, and another, operating on the first,to generate the addressable array of vectors. You may use behavioral or RTLverilog to manipulate this core (which may be put in its own submodule, evenin a separate file, if you like).

Use verilog parameters for the 5-bit address bus size and total address-able storage. Parity bits are not addressable and are not visible outside thechip.

Use two 32-bit data ports, one for read and the other for write. Supply aclock; also an asynchronous chip enable which causes all data outputs (readport) to go to ‘z’ when it is not asserted, but which has no effect on stored data.The system clock should have no effect while chip enable is not asserted.

Supply two direction-control inputs, one for read and the other for write.Changes on read or write have no effect until a positive edge of the clockoccurs. If neither is asserted, the previously read values continue to drive theread port; if both are asserted, a read takes place but data need not be valid.

Assign some time delays for data and address bus changes, and supply adata ready output pin to be used by external devices requiring assurance thata read is taking place, and that data which is read out is stable and valid. Don’tworry about the case in which a read is asserted continuously and the addresschanges about the same time as the clock: Assume that a system using yourRAM will supply address changes consistent with its specifications.

Also supply a parity error output which goes high when a parity error hasbeen detected during a read and remains high until an input address is readagain.

Fig. 12.11 Cursory simulation of the generate’d 32-word Mem1kx32gen RAM

Step 5. Implement a FIFO using a generate’d RAM. Create a new subdirec-tory in the Lab15 directory, and copy your FIFO model from Lab11 into it. This


model should consist at least of three files, FIFO Top.v, FIFOStateM.v, andMem1kx32.v.

Use your new verilog generate’d RAM to provide the storage for a FIFO bysubstituting Mem1k×32gen for Mem1k×32. This might be too big for the demoversion of Silos.

Simulate. After verifying the functionality, it would be feasible to synthesize thememory alone; but, why not synthesize the whole FIFO as follows:

First, with area optimization only, and no design rule constraint (fanout, load, ordrive limits) or speed constraint of any kind. This is a fairly large module for syn-thesis (about 30,000 transistor equivalents); a register file this large usually wouldbe implemented as a hard macro rather than being synthesized in random logic.

Notice that the synthesizer will report that none of the parity bits are connectedto anything. We have omitted parity detection in this model, so we should expectthese warnings.

Second, synthesize with no design rule, a zero-area constraint, a 500 ns clock pe-riod, and just a modest 50 ns max delay speed constraint on the outputs. A substan-tially stiffer speed constraint or any design rule constraint will prolong optimizationtime noticeably in this design; the cause is the generate, not the design itself. Donot apply an input or output delay constraint on clocked data, because this also willlengthen the optimization time greatly.

Third, optionally, if time permits, try synthesis using this procedure: Define a500-ns clock and apply a zero-area constraint, then compile with no other constraint.Then, in the same synthesis script run, set a don’t-touch on the memory module,apply the design rules and other constraints just below, and compile a second timewith incremental mapping, only.

Optional rules and additional constraints for incremental compile:

set drive 10.0 [all inputs]set load 30.0 [all outputs]set max fanout 30 [all designs]#set max delay 50 [all outputs]set output delay 1 [all outputs] -clock Clockerset input delay 1 [all inputs] -clock Clocker

The generated FIFO netlist will not simulate correctly – not because of thegenerate but because we have not yet designed a synthesizable FIFO state ma-chine. We shall postpone netlist simulation of the FIFO until later in the course.


Contrast the use of index values in arrayed instances vs. generate’d instances.Can a genvar name conflict with that of an integer in the same module?Does a generate block create its own name space?



Read Thomas and Moorby (2002) section 3.6 on scope of names and hierarchicalnames.

Read Thomas and Moorby (2002) sections 5.3–5.4 on arrayed instances andgenerate blocks.


Look over Appendix C for the various verilog compiler directives. You may be ableto guess what half of them do just by their names.

Read section 7.8, on generate.


13.1 Serial-Parallel Conversion

We put aside our serdes project a while ago to strengthen our understanding ofverilog; now we return to it to implement another part of the deserializer.

13.1.1 Simple Serial-Parallel Converter

We have implemented the serialization frame encoder (Lab 6, Step 8; Week 2Class 2), and we studied the processing, which is to say, the decoding, of the in-coming serial stream when we developed a way of synchronizing a PLL with ourframed packet protocol; this was toward the end of Lab 10 (Week 4 Class 1).

So, in terms of our project, we have only a little more to do beyond puttingtogether some things we have done already, and modifying a couple of modules forsynthesis. However, let’s look some more into the parallelizing of a serial stream atthis point.

Fig. 13.1 Genericserial-parallel converter

The general functionality is given in Fig. 13.1. A parallel bus of output latches orflip-flops should be present, as well as a clock or other synchronizer, and at least oneserial input. A purely combinational deserializer is possible, but it would be difficultto use for anything without output synchronization.

There has to be a serial clock, but this may be the same as the parallel clock. Theserial clock, if not the same as the parallel clock, may be generated either from theserial side directly, or, as in our serdes project, derived from the serial data.



There should be an internal register to hold partially deserialized data, but thisshouldn’t be the same as the one latching the parallel output, unless the design is toallow the fully parallelized data to be available on the output for a duration of lessthan one clock.

Optional functionality might include a flag announcing when parallel data onthe output bus are valid; however, this functionality in principle could be achievedby clock-counting. There should be a SerValid input from the serial side to flagwhen serial data are available to convert. This SerValid flag might toggle witheach serial bit or byte received; or, instead of such a flag, or we might provide aserial clock based on the assumption that the serial stream can be synchronized withit. If we want to be able to start up the device with a zeroed shift register, we mayprovide an optional parallel-side reset to do this. We shall assume that the convertercould have no control over the serial line’s transmitter, so it would not be meaningfulto provide a serial reset.

Summary of Serial-parallel Conversion Functionality

• Latched parallel data out.• Parallel-side clock in.• Serial data in.• Serial-side clock in (optional).• Internal deserialization register.• Parallel-valid output flag (optional).• Serial-valid flag (optional).• Parallel-side reset (optional).• No serial-side reset.

In our serdes project, the serial clock is embedded in the serial stream and mustbe decoded by the receiver without any SerValid flag. This has the obvious dis-advantage of requiring some startup overhead before the clock phase and frequencycan be established; it has the more-than-redeeming advantage of not requiring anyseparate clock, with attendant differential phase-lag, separate cross-talk, or relatednoise issues. So long as the data are usable, the embedded clock will be usable, too.This permits a data stream independent of any other signal in the design; in prac-tice, this is an important factor permitting GHz frequency-range data transmissionalmost with a zero error rate.

13.1.2 Deserialization by Function and Task

If we don’t worry about frame boundaries, parallelization is almost trivial in verilog:We shift in the serial data until we have enough for a parallel word; we unload theshift register onto the parallel bus; and, we continue shifting. We may assume thatparallel clocking is fast enough that no serial data will be lost.

13.1 Serial-Parallel Conversion 233

Here’s a verilog representation of a simple, generic deserialization:First, we do the shift. This can be implemented by a function, which can return

the shifted value at some reasonable delay time, but which can be allowed to performthe shift itself in zero simulation time. Verilog has a built-in shift operator, so all thefunction has to do is be sure to append the new bit being shifted in:

parameter ParHi = 31;...function[ParHi:0] Shift1(input[ParHi:0] OldSR, input NewBit);reg[ParHi:0] temp;begintemp = OldSR;temp = temp<<1; // MSB goes lost.temp[0] = NewBit;Shift1 = temp;end

endfunction

When calling the Shift1 function, we pass it the current shift register and thenew serial bit to be shifted in. This function can be simplified; it is written as abovefor instructional reasons.

Second, we do the conversion. Unloading the shift register onto the parallel businvolves a ParValid flag and probably a few delays, so why not implement it asa task?

parameter ParHi = 31;...reg ParValidFlagr;reg[ParHi:0] ParSR, ParBusReg;// rise, fallassign #( 1, 0 ) ParValidFlag = ParValidFlagr;...task Unload32; // Copies the parallel SR to the output bus.

begin // Also clears the SR for the next word.ParValidFlagr = 1’b0; // Lower the parallel-valid flag.ParBusReg = ParSR; // Transfer the data.

#5 ParSR = ’b0; // Clear the SR.ParValidFlagr = 1’b1; // Raise the flag.

endendtask

The delays are included just for illustration. Keep in mind that delays in proce-dural code should be avoided for design purposes; module output lumped delays, ifany, should be estimated to account for delays synthesized from procedural code.

Third, we regulate the shift. We have to put things together in a way that theshift-register shifts in the bit on the serial line on each clock, provided the serial bitis valid and the device is not being reset.


Actually, there was no design reason in the second step above to clear the shiftregister after each conversion; a shift-counter will be used to determine unloading ofnew data, not left-over data, to the parallel bus. However, starting each conversionwith a clear register does make it easier to see new data arriving during simulation.

The temp register in the Shift1 function above was included for the samereason of clarity during simulation; Shift1 could be written more minimalisticallythis way:

function[ParHi:0] Shift1(input[ParHi:0] OldSR, input NewBit);beginOldSR = OldSR<<1;Shift1 = {OldSR[ParHi:1],NewBit};end

endfunction

An always block to assemble our code fragments is shown below.

always@(posedge ParClk, posedge ParRst)begin : Shifterif (ParRst==1’b1)

beginN <= 0; // N counts the bits shifted.ParSR <= ’b0; // The shift register.ParBusReg <= ’bz; // The parallel out bus.ParValidReg <= 1’b0; // ParValid.end

else if (SerValid==1’b1) // Ignore the serial line if 0.beginParSR <= Shift1(ParSR, SerIn); // function called.N <= N + 1;if (N>ParHi) // If 32 bits shifted.beginUnload32; // task called.N <= 0;end

end // SerValid.end // Shifter.

13.2 Lab Preface: The Deserialization Decoder

The next lab will begin with a couple of very important exercises on deserialization.After that, there will be a first cut at implementing the deserialization decoder of ourserdes project.

13.2 Lab Preface: The Deserialization Decoder 235

It is essential to understand what the deserialization decoder (DesDecodermodule) is supposed to do: It identifies data frames which were encoded in theserial input data stream. It also generates a 1 MHz clock meant to be synchronizedwith the sender’s clock.

In our design, each frame is 16 bits wide and ends with an 8-bit pad pattern. TheDesDecoder decodes the packets; it does not do a simple logical decode, suchas did the 4–16 decoder of our previous lab. In decoding the packets, it also ex-tracts the sender’s clock, which is implied by the format of the incoming serial datastream.

Our deserializer spans two independent clock domains. The serial stream comesinto the deserializer at an approximately known frequency depending solely on theclock in the sender’s clock domain. However, the deserializer also clocks data outof its FIFO using a clock in the receiver’s clock domain. The FIFO input is in thesender’s domain; the FIFO output is in the receiver’s domain.

Looking at our Deserializer’s PLL, it is clocked by a free-running 1 MHzclock which the DesDecoder attempts to synchronize to the sender’s clock. Thisclock emulates, in verilog, an on-chip clock controlled by a variable-capacitanceoscillator. This clock is concerned only with the deserialization and with the FIFOinput. The PLL’s comparator receives two different 1 MHz clocks: One is the free-running one which is multiplied by 32 to clock in the serial data; the other is aclock extracted by the DesDecoder from the serial data stream. If the free-runningclock was perfectly synchronized with the serial data stream coming in, it wouldbe perfectly in the sender’s clock domain, and every packet could be identifiedcorrectly.

Any DesDecoder failure to decode a packet means that there is no guaran-tee that the free-running clock is indeed in the sender’s domain. The PLL com-parator keeps track of any frequency difference between the incoming streamand the free-running clock; however, the PLL does not adjust its frequency un-less the DesDecoder can confirm that it has identified a packet. Whenever theDesDecoder identifies a packet, it allows the PLL to adjust its frequency slightly,and it synchronizes an edge of the free-running clock with the identified packetboundary.

Our design uses integer arithmetic to equate the two sender-domain clocks, theone actually in use by the sender, and the free-running clock of the DeserializerPLL. Because 1/32 of the period of a 1 MHz clock is not an integer value in ns butonly can be rounded near to an integer, our PLL never will be truly locked in theway an analogue PLL could. However, it can be locked in approximately, and thiscan be close enough that several packets can be extracted correctly before the serialstream drifts out of synchronization.

Before beginning the lab, you may wish to review our original plan for the serdesproject as described for Week 2, Class 2. In the present lab, we shall not extract thesender’s clock (that will be later); however, we shall decode the packets being sent.

Here again, in Fig. 13.2, is the block diagram of the conceptual serial-parallelconverter (deserializer) as described in Week 2 Class 2. We shall implement a sim-plified, first-approximation Deserialization Decoder in this lab.


Fig. 13.2 Week 2 serdes Deserializer data flow

Recall what was done in Step 4 of Lab 6 of Week 2 Class 2: We decided then ona packet format in which a frame of 16 bits would be used to represent each 8-bitbyte of serial data.

The data bits would be contiguous and bounded below (later in the stream, whicharrives MSB first) by a pad byte consisting of three ‘0’ bits, then a 2-bit countlocating the preceding data byte in its 32-bit word, then another three ‘0’ bits.A packet of 32 bits of data then would look like this, each ‘x’ representing a data‘1’ or ‘0’:

64’bxxxxxxxx00011000xxxxxxxx00010000xxxxxxxx00001000xxxxxxxx00000000,

with serial arrival being from left (earliest) to right (latest).

13.2.1 Some Deserializer Redesign – An Early ECO

According the block diagram of Fig. 13.2, which was meant to be conceptual, wecan count on the incoming data’s having been deserialized and collected into 16-bitwords before the Decoder. However, given our serial protocol, no framing can bedone in the Frame Buffer until the serial clock has been decoded; so, as shown, itis not obvious that the 16 bit buffer would not contain data crossing encoded-byteboundaries.

The Frame Buffer, then, interpreted as a serial-parallel converter, first has to bealigned on data+pad boundaries. This might be done in the diagram of Fig. 13.2 byincluding a PLL feedback from the Decoder to the Serial Receiver.

But, it is unclear how this feedback could be separated from deserialization; so,probably, for data flow purposes, the best plan is to fold the Frame Buffer into theDecoder logic and no longer view it as an independent block.

The Decoder should manipulate the digitized serial stream using its own registersand buffers. Our Deserializer design data flow block diagram then should beamended as shown in Fig. 13.3.

13.2 Lab Preface: The Deserialization Decoder 237

Fig. 13.3 Amended serdes Deserializer data flow

We are not interested in PLL clock synchronization in this exercise. If we assumethat the PLL is located in the Serial Receiver block, we can delete the Frame Bufferblock; we then have no reason to worry about distribution of the PLL clock at theblock level, and we need not show any feedback from the Decoder.

This block diagram, after all, is just data flow and need not include any represen-tation of the clocking scheme.

Because it takes 64 bits to encode 32 bits of data, it will take two ParClk cycles,at 1 MHz, to process each 32-bit decoded word. These considerations are discussedin detail in Step 4 of Lab 10. We only assume here that we can shift in the serialdata from port SerIn at 32 Mb/s, using an externally supplied, 32 MHz serial clocknamed SerClk.

13.2.2 A Partitioning Question

Going beyond data flow, another question here is one of second thoughts: Shouldwe indeed establish frame synchronization here, in the Deserialization Decoder, orshould we leave it to PLL-related clock-extraction code in the Serial Receiver block?The verilog already mostly has been written (see the final code example in Lab 10of Week 4 Class 1). Frame synchronization is essentially equivalent to serial clockextraction.

If we leave serial clock extraction in the Serial Receiver, then the PLL will belocalized entirely there, and there will have to be considerable digital structure,maybe a shift register or small register file, to handle the pad-pattern extraction.But, serial data arriving at the Deserialization Decoder (DesDecoder) will be ac-companied by frame-boundary flags from the Serial Receiver, and parallelization inthe DesDecoder block could be very simple and generic.

On the other hand, if we put serial clock extraction in the DesDecoder, the PLLmay be located either in the Serial Receiver block or in the DesDecoder. If theformer, synchronization feedback will have to be provided by the DesDecoderto the Serial Receiver; if the latter, we have mixed digital and analogue in the


DesDecoder block, requiring special layout procedures and other DesDecoderdesign overhead.

The answer we shall adopt here, is to put the PLL in the Serial Receiver block,and to do the frame synchronization (serial clock extraction) in the DesDecoder.This will isolate analogue functionality to the Serial Receiver block and will reduceduplication of digital effort in deserializing the data. The DesDecoder will haveto extract a (nominally) 1 MHz clock from the serial stream being parallelized andprovide it to the PLL. This is shown in Fig. 13.4.

Fig. 13.4 Final Deserializer detailed block diagram

A 1-MHz clock is a low-frequency signal in our design, so no special precautionwill have to be taken, except possibly to account for the transport distance delay, asmall phase lag, from the DesDecoder to the PLL.

ECO completed; now we move on to the lab exercise

13.3 Serial-Parallel Lab 16


Lab Procedure

After some design warm-up, we’ll discuss, and then implement, a first-cutDeserializer for our serdes project.

Step 1. Generic deserializer. Using the lecture material as desired, implementa deserializer which fulfills the generic description above and realizes the aboveSerToParallel block diagram’s I/O’s. The generic requirements are just toclock in the serial data until 32 bits have been acquired, and then to copy themonto the parallel bus, using the same clock (possibly the opposite edge). Assumethat the data are unframed and should be grabbed in 32-bit words so long as theSerValid flag is asserted.

13.3 Serial-Parallel Lab 16 239

Simulate the design at least for three deserialized, 32-bit words. Use $randomfor serial data (see Figs. 13.5 and 13.6). Do not spend time synthesizing thisdesign.

Fig. 13.5 The generic deserializer

Fig. 13.6 Generic deserializer close-up of the serial bit counter wrap-around

Step 2. Deserialization data-stream synchronization. Modify your generic dese-rializer from Step 1 so that whenever the serial stream contains 12 successive ‘0’bits in a row, those bits should be rejected, and deserialization should cease. Afterceasing, with any incomplete word data (prior to the 12 0’s) saved, the device thenshould wait until 12 ‘1’ bits in a row arrive; after the 12th ‘1’, the saved datashould be revived and deserialization should resume with the shift register where itleft off.

One good approach to Step 2 would be to start by just shifting in a stream ofbits and identifying the two bit-patterns by setting two flags, Found stop andFound start.

After you have simulated identification correctly, create a parallelizing registerand copy incoming bits to it until Found stop is asserted. After Found startsubsequently is asserted, continue copying. The code asserting Found stop coulddeassert Found start, and vice-versa. Every time 32 bits have been parallelized,offload them to an output bus and set a parallel valid flag.

There are numerous different possible ways of implementing this design. Prob-ably, there should be a shift-register which continues shifting while deserializationto the parallel bus is stopped.


This design also should be simulated (Fig. 13.7) but not synthesized.

Fig. 13.7 Synchronizable, but unsynthesizable, generic deserializer, as described in the lab text

Step 3. Deserialization Frame Decoder. We return now to our serdes project.To establish data framing, we’ll allow, optionally, a less rigorous criterion than

when we locked in the PLL in Lab 10:As a synchronization criterion, the current packet may be considered locked in

on the data byte received immediately after a pattern of 8’b000 00 000 has beenshifted in on SerIn. Deserialization may be performed throughout every period ofsynchronization. This criterion optionally might be stiffened to allow lock in onlywhen all four pad patterns are in the shift register.

However, regardless of synchronization, the ParClk should be toggled on every16th bit received on SerIn following the most recent synchronization, so that therewill be 4 edges and thus two ParClk cycles for every 64 bits received, regardlessof current synchronization state.

The synchronization loss criterion optionally may be that the embedded, 2-bitsequential pad number (. . ., 2’b11 -> 2’b10-> 2’b01-> 2’b00, . . .)is found to miscount, or just that the first pad (pad 00) can’t be located 64bits after the previous one. Either way, synchronization loss should have no ef-fect on ParClk; the receiving PLL clock is free-running unless commanded to(re)synchronize.

Deserialization should be stopped during desynchronization, and incomplete datapackets should be discarded. Deserialization should resume with the first data bytefollowing resynchronization by either criterion above.

It is understood that resynchronization might cause a ParClk glitch. But, theDesDecoder can be designed not to malfunction because of such a glitch; and,for other design blocks on the digital side, that’s one reason we have a FIFO in thisdeserializer.

In summary, the DesDecoder block will be supplied a digital data streamclocked in by an externally-generated signal, SerClk, but this clock will not bemade available to the deserializer. The SerClk, and a simulated SerIn datastream, using our 64-bit packet-framing protocol, should originate in a testbenchmodule in the lab exercise. See Fig. 13.8.

13.3 Serial-Parallel Lab 16 241

Fig. 13.8 DeserializationDecoder: First-cut blockdiagram

So, implement the DesDecoder to extract a nominal 1 MHz clock from theserial data stream and use it to clock out the properly-framed, parallelized data in32-bit words. Do not consider PLL functionality; just provide a 32 MHz serial clockinput along with the stream of serial data: SerIn = serial data; SerClk = se-rial input clock; ParClk = output clock, extracted from the serial input stream;ParBus = 32-bit output parallel word, clocked by ParClk.

To complete this Step, use anything you wish from Lab 10, or previous work,and implement the deserialization decoder block as shown here. Keep in mind thatit may be simpler to start this design from scratch and only copy small fragmentsof your previous work. In particular, the shifting of the shift register can be muchsimpler here than for the Step 2 problem of this lab.

Consider breaking down the problem into several small tasks, setting up the call-ing of these tasks, and then implementing the bodies of the tasks.

Simulate your result, only (see Figs. 13.9 and 13.10); we shall synthesize thissubblock of our serdes later in the course.

Fig. 13.9 The deserialization decoder (DesDecoder), not yet synthesizable

Fig. 13.10 The DesDecoder, zoomed in on the serial bit counter wrap-around



Think about problems in changing the packet width (to 128 bits, 36 data bits withparity), etc.


Reread this lab’s instructions and review previous labs referenced in this one.


14.1 UDPs, Timing Triplets, and Switch-level Models

We again put aside our serdes project to study some verilog in depth. This time, weshall look into the basics of the design of small devices at and below the gate levelof complexity.

14.1.1 User-Defined Primitives (UDPs)

The UDP primitive is a verilog feature which gives a user a way to create SSI(Small-Scale Integrated) device models which simulate quickly and use little sim-ulator memory. The target component for a UDP is any specialized kind of latch(flip-flop) or a complex combinational gate.

UDPs exist at the same design level as modules, and they have no functionalitythat a module can not have. Usually, several UDP models will be placed in onelibrary file; they may coexist in such a file with models implemented as modules.However, often, these primitives are instantiated in a module wrapper to givethem timing, multiple outputs, or other essential functionality.

UDPs are allowed to have only one output and any number of inputs; the verilogStd 1364 specifies a limit of at least 9 inputs for combinational UDPs and at least10 for sequential. Every I/O must be one bit wide. The functionality is by a look-uptable.

UDP’s are not synthesizable.The verilog structure of a UDP is: primitive keyword, UDP name, I/O decla-

rations, a reg declaration if the device is sequential, (traditionally) an initializationblock, and finally a table defining the logic. If an ANSI header is used, initial-ization may be done in the header; however, here we shall initialize only in theinitial block. The table columns are somewhat different for combinationalvs. sequential UDPs.

In exchange for simplicity and speed during simulation, a UDP may not includedelays, ‘z’ states, or bidirectional (inout) ports. A ‘z’ on a UDP input is handled



internally as an ‘x’. Modern library components may be difficult to model as UDP’sbecause of the lack of delay or other technology-dependent functionality.

A typical table row for a combinational UDP has this organization:

(inputs in declared order) : (output) ;

For example, three UDP table rows for a three-input and gate:

table...1 0 0 : 0;1 1 1 : 1;1 1 x : x;...endtable

Of course, verilog already includes a primitive representing an n-input and gate;a more practical UDP implementing an and-or combination would be as follows:

primitive u2And1Or(output uZ, input uA1, uA2, uOr);//// Models uZ = (uA1 & uA2) | uOr.//table// Output on right; inputs in declared order:// and’ed inputs or’ed input// uA1 uA2 uOr uZ

0 0 0 : 0;1 0 0 : 0;0 1 0 : 0;1 1 ? : 1;? ? 1 : 1;

endtableendprimitive

Note on terminology: Compound combinational gates often are included inan ASIC library. Whether or not implemented as UDP’s, they are namedaccording to their functionality and number of logic-term inputs: “A” for and,“O” for or, “I” for inversion, etc. Consistent naming facilitates library main-tenance. For example, a cell evaluating (A&&B)||C might be named, AO21.A cell evaluating !((A&&B)||(C&&D&&E)) might be named, AOI23.

For a sequential UDP, there are two columns in the table delimited by colons.The left column, surrounded by colons on both sides, represents the current state ofthe one, 1-bit, storage register allowed and maps to the output port; the rightmostcolumn represents the next state, given the current state and all inputs.

14.1 UDPs, Timing Triplets, and Switch-level Models 245

For example, here is a UDP which may used to implement a D flip-flop:

primitive uFF(output reg uQ, input uD, uClk, uRst);//initial uQ = 1’bx; // Not same as a module initial block.//table// Output on right; inputs in declared order:// current next// uD uClk uRst uQ uQ

0 (01) 0 : ? : 0 ; // Clock in 01 (01) 0 : ? : 1 ; // Clock in 10 (0?) 0 : 0 : 0 ; // Default to keep same 01 (0?) 0 : 1 : 1 ; // Default to keep same 1

// Unclocked:? (1?) 0 : ? : - ; // Ignore negedge.

(??) ? 0 : ? : - ; // Retain state.// Reset asserted:

? ? (01): ? : 0 ; // Posedge reset? (??) 1 : ? : 0 ; // Ignore clock edge

(??) ? 1 : ? : 0 ; // Ignore clock state? ? 1 : ? : 0 ; // Ignore clock state

endtableendprimitive

The table row entries in any UDP should be separated horizontally by whites-pace for readability, but they need not be. UDP’s derive from the days in whichit was time-consuming work for a workstation computer to parse a few extra blankcharacters as spacers – and designers, used to punching and collating Hollerith cardsto enter a program, couldn’t read the input very well if they wanted to. Anyway, eachrow ends with a semicolon (‘;’).

Edge sensitivity in a sequential UDP is described by the two states of an edge,written inside parentheses, initial state to the left; thus, “(01)” represents a risingedge, and “(10)” a falling one. Each table row is allowed only one edge. Whenan input change affects both a level and an edge column, the edge is evaluated first,then the level; this means that the level prevails in the event of a conflict. As in acase statement, ‘?’ means a don’t-care input. A hyphen (‘-’) means no change onan output.

The UDP table should be complete, because a state or edge definition in onerow opens up possibilities for the simulator to misinterpret any other possiblepermutation of the values; thus, it is important to add a row for every possible fore-seeable event. This becomes complicated for sequential UDPs, in view of the edgepossibilities.

Several more rows would have to be added to the UDP above, if it became nec-essary to include ‘x’ output states selectively.

Like the verilog builtin primitive gates (and, or, etc.), UDPs may be instantiatedwithout an instance name. An instantiation may include a delay expression.


To summarize what we have said about UDPs:

UDP’s are look-up table-defined primitives. Keyword is primitive.Structurally interchangeable with modules.Advantages:

• May be instantiated without an instance name.• Accept delays when instantiated.• Fast compilation and simulation.

Limitations:

• Allowed only one, 1-bit output.• Only 1-bit inputs (up to at least 9 of them).• No timing or parameter declaration.• No specify or other internal block.• No ‘z’ or bidirectional functionality.

Not used in VLSI design.Not synthesizable.Sometimes used in verilog simulation library development.

14.1.2 Delay Pessimism

We move on, now, to a discussion of verilog delays. It is important to understandthat functionality and timing are almost-orthogonal features of a simulation.

Functional edges. When a ‘1’ level occurs after any other level, including an‘x’, the functionality is that of a rising edge, and an always event expression orany other edge expression will treat it functionally as a posedge event. Likewise,when a change results in a ‘0’, it is functionally a falling edge, and functionally itmeans a negedge event.

Timing edges. When a (rise, fall) timing expression is associated with a changefrom ‘0’ to ‘1’, the rising-edge (posedge) delay will be given by the rise value.The negedge delay for a change from ‘1’ to ‘0’ will be given by the fall value.But, this correspondence with functional edges ceases when ‘x’ (or ‘z’) levels areinvolved.

When the change is to an ‘x’ level, neither the rise nor the fall in a (rise, fall)or (rise, fall, to z) expression has a specific meaning. Instead, the smallest availabledelay value is used by the simulator. This quick change to an unknown lengthensthe duration of unknown states and thus is called “pessimistic” in regard to knowingthe hardware value, which must be ‘1’ or ‘0’.


When the change is from an ‘x’ level, the largest available delay value is used,again lengthening the duration of unknown states, and, thus, again, being “pes-simistic”.

Thomas and Moorby (2002) Table 6.6 (section 6.5.2) also explains delaypessimism.

14.1.3 Gate-Level Timing Triplets

In Week 4 Class 1, we saw how strength might be assigned to a gate output byputting one or two strength keywords in parentheses. For example, an NMOS orgate:

or (strong0, weak1) or 01(out or, in1, in2, in3, in4);

We also have seen the same approach to assign delays, except that the delayvalues were preceded by a ‘#’. For example,

or #2 or 01(out or, in1, in2, in3, in4);

Parentheses are optional around the single delay value. If both strength and delayare assigned, the strength specification is to the left of the delay specification:

or (strong0, weak1) #(2,1) or 01(out or, in1, in2, in3, in4);

As we have seen, delay values also may be assigned in multiples of two or threein parentheses. The interpretation of such values is as follows:

1 value:#t inst name(. . .);

Every change on the output(s) isscheduled with this delay.

2 values:#(tr,tf) inst name(. . .);

Every rise is scheduled with the firstdelay value tr; every fall with thesecond delay value tf.

3 values:#(tr,tf,tz) inst name (. . .);

The first two are the same as for 2 val-ues; the third delay is the delay to‘z’ state for gates which are capa-ble of it.

To repeat the verilog delay scheduling rules, when selecting multiple delay val-ues, as for wires, a rise in the table above only is a change from ‘0’ to ‘1’; a fall onlyis a change from ‘1’ to ‘0’. For changes involving ‘x’ or ‘z’, the values specifiedare used “pessimistically” by the simulator: When only two values are given and thegate is capable of ‘z’, the delay to ‘z’ is the shorter of the two. In all cases of twoor more values, the recovery from ‘z’ or ‘x’ is the longest value, and the delay to‘x’ is the shortest value.


A Second Dimension of Delay. In the old days, when designs were mostlyboard-level assemblies of relatively small digital IC’s and discrete passive compo-nents, logic simulators had to account for temperature, supply voltage, and fabrica-tion differences across the board. Minimum and maximum delays were estimatedseparately for each chip. The resulting timing differences were simulated by as-signing ‘x’ wherever the simultaneously-estimated minimum vs. maximum delaysallowed it (=pessimistic). See Fig. 14.1.

However, most modern designs use VLSI IC’s structured in blocks with latched(clock-synchronized) outputs, so this kind of pessimism has not been useful dur-ing simulation, even in recent deep-submicron designs. The chip itself tends to bemore or less uniformly fast, typical, or slow; so, only one condition or extreme hasbeen applicable in any one simulation evaluation. The min-max approach, however,usually is used during static timing verification, which we shall touch upon brieflylater.

Fig. 14.1 Board-level simu-lation represents edge delayspreads as unknowns (top);IC-level simulations are runwith no spread and each delaystate separately (lower three).The sloping edges representindividual uncertainty inter-vals caused, for example, byskew or jitter

In recent VLSI designs, with pitch down to about 90 nm, the simulator has notbeen used to choose the condition, best, typical, or worst case; the designer hasmade this choice. Gates in libraries for logic synthesis still have to be characterizedfor temperature, supply voltage, and process variations, so there still is a need fora simulator to handle multiple possible delays on a gate. But now, devices beingsimulated almost always are assigned just one, single, specific, global range of delay,minimum, typical, or maximum, representing the expected global uniformity of agiven, operating IC. The simulation then is repeated several times, each time witha different global alternative, min, max, or typ. VCS, as other verilog simulators, istold which global value to use by setting a command-line option when it is invoked.The default is typical.

There is some indication that verification of very large designs (90 nm and below)may require two, or all three, of each triplet to be used in a single simulation runin order to display local uncertainty of timing in a simulation waveform. This maybecome necessary, for one reason, because of the temperature differences whichcan develop across a chip as a function of operating time or mode of operation. Themore that is put on a chip, the more likely that different functional blocks will beoperating under different conditions. Thus, the old board-level simulation displays,showing a range of unknowns around every edge, again may become common.


Regardless of simulator design or practice, in the verilog, to assign these tripletvalues to a gate output, they simply are substituted, separated by colons, for thesingle values which we have been using up to now. So, when indicated by char-acterization of the synthesis library, the table above may be modified to replace twith t min:t typ:t max; and, each of tr, tf , and tz in that table above then will betriplicated correspondingly.

The first or gate example above then may be changed to,

or #(1:2:3) or 01(out or, in1, in2, in3, in4);

Likewise, if a bufif1 were used to model a gate with a relatively slow turnoff,instantiation might require a delay specification given by,

bufif1 #(1:3:4, 1:2:4, 6:7:8) triBuf 2057(OutBit, InBit, CtlBit);

The simulator does not impose or require order in the values in a triplet; in prin-ciple, one might find min:typ:max specified with min ≥ typ > max.

This isn’t all there is to the calculus of delay in verilog; but, we shall put offinternal delays in library components and other devices until later in the course.

14.1.4 Switch-Level Components

We introduced the different verilog strengths, and some of the different types ofnets, in Week 4 Class 1. The net types and built-in gate types were studied furtherin Week 5 Class 2. We now shall go beyond this in modelling devices at the switchlevel, which is to say, at a level in which gates are treated as made up of individualsubstructures (transistors) which may be simulated as switching on and off.

To model at the switch level, we require switch-level primitives. These are sup-plied in verilog as the MOS, CMOS, bidirectional, and source switches.

Here is a list of all the available switch-level primitives:

MOS switches:nmos, rnmos (like bufif1)pmos, rpmos (like bufif0)cmos, rcmos

Bidirectional pass switches:tran, rtrantranif1, rtranif1tranif0, rtranif0

Switch-level net:trireg

Power sources:pullup, pulldown


MOS Switches. MOS stands for Metal-Oxide-Silicon, the main layers used tofabricate devices in this technology. It represents an advance in semiconductor tech-nology over the more power-hungry bipolar (P-N junction; current-operated) semi-conductors. MOS transistors have a source-drain potential which supplies energy foramplification, and a gate which turns source-drain current on or off electrostatically.

A P-channel transistor (pmos) conducts more current when its gate is more neg-ative; an N-channel transistor (nmos) when its gate is more positive. Furthermore,P devices, which have hole majority carriers, are slower than N devices of the samesize; so, in CMOS technology, the P side generally is made larger than the N side tomake the response speeds more nearly equal. Thus, CMOS P devices are fabricatedto operate geometrically nearer the high supply voltage rail than ground, because,there they can be operated with delay comparable to that on the N side. Anyway,for a variety of reasons, P devices are fabricated near the supply1 rail and N devicesnearer the supply0 rail. Because of this device gravitation, the elementary logic gatesin CMOS technology tend to be nand and not rather than and and buf.

The MOS primitives are nmos, pmos, rnmos, and rpmos. The r means “re-sistive”, and the r* primitives are meant to represent physically higher-impedancedevices than the others.

All the MOS devices individually are functionally identical to the bufif1(nmos) or bufif0 (pmos) primitives we have used already in lab, except thatthe r* primitives have outputs which always reduce the strength applied at their in-puts. Only small-capacitor or highz strengths are not reduced. See Week 4 Class1 for a table of verilog strengths. The rnmos and rpmos strength-changing rulesare given in section 10.2.4 of Thomas and Moorby (2002), or in IEEE Std 1364,section 7.12, as follows:

Resistive MOS Strength RulesStrength Strength Keyword

In Out In Out

supply pull supply0/1 pull0/1strong pull strong0/1 pull0/1pull weak pull0/1 weak0/1large cap medium cap large0/1 medium0/1weak medium cap weak0/1 medium0/1medium cap small cap medium0/1 small0/1small cap small cap small0/1 small0/1highz highz highz0/1 highz0/1

CMOS Switches. The name stands for Complementary Metal-Oxide-Silicon.These devices are composed functionally of paired nmos and pmos transistors, justas are actual CMOS gates fabricated on a chip. A cmos switch even has two controlinputs, just as a CMOS transistor on a chip would have two gates. Of course, rcmosswitches are composed functionally of paired rnmos and rpmos switches.


Verilog assumes a positive-logic regime (‘1’ = higher voltage; ‘0’ = lower), soan N device turns on when its gate is at logic ‘1’; a P turns on when its gate is atlogic ‘0’.

There are limits to the accuracy of a digital simulator at this level. For example,a cmos switch can be imagined to represent two MOS transistors in parallel. Thisdoes a fine job as a switch level model of a bufif1, as shown in Fig. 14.2.

Fig. 14.2 CMOS bufif1modelled as nmos and pmosin parallel

But, in verilog, the nmos and pmos primitives pull up their outputs (to ‘1’) withthe same strength as they pull down (to ‘0’); actual P vs. N devices differ in strengthat the two logic levels (N pulls low stronger than high, and vice-versa for P).

Section 10.2.1 of Thomas and Moorby (2002) gives a nice switch-level model ofa static RAM cell, and there is a model of a shift register in section 10.1, but there issome question as to what the purpose might be of such models, when a device of anycomparable size can be modelled easily in SPICE, with extremely high accuracy.

Even so, CMOS devices are much closer to being symmetrical than individualPMOS or NMOS devices; for this reason, they are preferred in modern designs.

In this vein, consider the two different arrangements of transistor elements whichcould implement an inverting buffer (verilog not) on a chip. Switch-level mod-elling does a fine job. By tying a pmos input to supply1, and an nmos input tosupply0, it is possible to invert the control input, which is just what an invertingbuffer does at the switch level.

Fig. 14.3 CMOS not gatemodelled as nmos and pmosin series


Looking at the CMOS not gate model in Fig. 14.3, one might think that byreversing the N and P devices, the gate would become a noninverting buffer. Thiscould be done, and the logic indeed would be noninverting; however, a P deviceoperating near supply0 is extremely slow and inefficient; likewise an N deviceoperating near supply1. Thus, for these technical reasons, simple noninvertingbuffers never are used in CMOS technology. If one requires a noninverting bufferin CMOS, the usual solution is to drive the input of a big inverting buffer with theoutput of a small one, the two inversions yielding a net noninversion.

The verilog cmos primitive approximates reality better than either pmos ornmos, in that a cmos also does not reduce the strength of an input (except thatsource becomes strong). Because a cmos has two control inputs, it has a dif-ferent port definition than a bufifx. The ports declared for a (r)cmos are asfollows:

(r)cmos optional inst name ( out, in, N ctl, P ctl)

In the event that one control is off and the other on, a cmos still will turn on;so, the second control presumably is intended to model requirements for on-chipconnectivity, rather than to model functionality.

Bidirectional Pass Switches. These primitives model charge-transfer gates (passtransistors) and thus are called tran (Fig. 14.4), tranif1, and tranif0, withthe corresponding high-impedance rtran, rtranif1, and rtranif0. They arenot allowed to be assigned delays. Unlike trireg primitives, they don’t store state.They merely pass a logic level applied at one terminal to the other terminal. Theyare used to model nets of transistors connected in arbitrary ways.

Fig. 14.4 Schematic symbolof a tran switch

Instantiation of a tran transfer gate follows the same format as buf, but withexactly two ports, both inout. A tranifx has two inout ports and a third,control port; thus, it is analogous to a bufifx, except that it is bidirectional. Moreinformation on applications of transfer gates may be found in the layout-relatedReferences and in the Additional Study suggestions below.

Source Switches. There are two of them, pullup and pulldown. Each ac-cepts just one output net as argument. The pullup drives its net high with constantsource1 strength; the pulldown drives it with source0. Whereas the tran-sistors being modelled in verilog at switch level usually are assumed to operatein enhancement mode (normally off), these would be assumed to be large tran-sistors in depletion mode (normally on), or direct connections to an IC power orground rail.


14.1.5 Switch-Level Net: The Trireg

The functionality of the trireg net type is described in IEEE Std 1364,sections 7.13.2 and 7.14. In a word, a trireg, like a capacitor modelled as a time-limited reg, just stores a logic state.

A trireg is unique in that it makes special use of the small, medium, andlarge strength values. For a trireg, these strengths are meant to represent thesize of a capacitor which stores charge whenever the drive to the trireg entersthe high-impedance (‘z’) state. However, a trireg driven at ‘z’ never enters the‘z’ state; instead, it holds its last non-z state. If and only if the trireg has beenassigned a delay, this last non-z state decays from ‘1’ or ‘0’ to ‘x’ as soon as thedelay lapses.

The strength of a trireg is used to determine which of several triregsin contention will determine the delay to ‘x’; this is the only use of the chargestrengths, small, medium, and large. When strengths are equal, the rule ofpessimistic simulation is used to determine trireg decay to ‘x’ just as it is todetermine the result of other timing conflicts.

A delay value may be assigned to a trireg net when the net is declared. Thedelay format is different in one way from that of a three-state component: Whenthree delay values are given, the first two refer to rise and fall, as for a three-statecomponent; however, the third value refers to time to ‘x’, not time to ‘z’. A triregcan not enter a ‘z’ state, although it could be initialized to one. A trireg with adelay value decays to ‘x’ after the given delay as soon as the net’s last driver hasturned off (to ‘z’).

If a trireg net has no delay associated with it, it continues forever to drive itsoutput(s) at the strength declared for it, at the logic level with which it last was beingdriven (‘1’, ‘0’, or ‘x’). Example of a trireg declaration:

trireg (medium) #(3, 3, 10) medCap1;

Connectivity of trireg nets may be established by continuous assignment, or byport-mapping to switch-level component instances.

Here is an example of the use of trireg nets:

pullup(vdd);trireg (small) #(3,3,10) TriS;trireg (medium)#(6,7,30) TriM;trireg (large) #(15,16,50) TriL;// Pass transistor network:tran (TriM, TriS); // left always wins.tran (TriM, TriL); // right always wins.// NMOS network:rnmos #1 (TriM, TriS, vdd); // input has no effect.rnmos #1 (TriS, TriL, vdd); // input controls output.rnmos #1 (TriM, TriL, vdd); // Contention on output.


14.2 Component Lab 17

Do the work for this lab in the Lab17 directory.

Lab Procedure

Keep in mind that many VLSI simulators will not simulate switch-level verilogcorrectly; Silos generally will work, except for resistive primitives.

Step 1. Combinational UDP. Design a UDP in a module named AndOr2Not4which evaluates this logic function: X = ( ∼a | ∼b ) & ( ∼c | ∼d ), inwhich a - d are input names, and X is the output name.

SSI discrete component databooks or large ASIC libraries might include such adevice.

Suggestion: Include a verilog comment line naming the table columns to helpreduce entry errors.

Instantiate your component in a testbench module and simulate it to verify func-tionality.

Step 2. Sequential UDP. Design a UDP named AndLatch which functions asa simple transparent latch but has two data inputs anded together before beinglatched.

Instantiate this latch in a module and simulate it to verify functionality.

Step 3. Switch-level model of an inverter. Create a module named Notting-ham. Give it one 1-bit input and two 1-bit outputs. Combine a pmos and nmosprimitive as described in the presentation above (Fig. 14.3) to model a not. Thisamounts to a CMOS inverter. Drive one of the two outputs with thiscomposed not.

Instantiate a verilog not gate and use it to drive the other Nottingham output.This will allow you to compare a verilog not with your nmos-pmos not in asimulation, both driven by the same input.

You could test such a design by feeding both outputs to the inputs of an xorgate: A ‘1’ on the xor output would indicate that the two gates were not identicallogically.

Step 4. The cmos control inputs. Add a third output to the module in the previousStep and connect a cmos to drive it from the module input. Declare local controlnets routed through new I/O’s from your testbench.

Simulate to fill in the cmos truth table below. Can you predict what the outputwill be under each of these conditions? (“unconn” = ‘z’ state)

14.2 Component Lab 17 255

cmos Truth Table

out in n-ctl p-ctl

1 1 11 1 01 0 01 0 10 1 10 1 00 0 00 0 11 x x1 x 01 x 11 1 unconn1 unconn 0

Step 5. Pass transistor mux model. The easiest demonstration of pass transistorlogic is a multiplexer: The select input is decoded to turn on just one pass tran-sistor; the turned-on logic level then is transferred to the output, where it wins thecontention against the other output(s), which must be in ‘z’ state.

For a 2-input mux, all it takes is one tranif1 and one tranif0 with outputstied together, and a one-bit select input to both control inputs.

Fig. 14.5 Two-input muxmodelled by tranif’s

The relevant verilog for the 2-input mux design shown in Fig. 14.5 would looksomething like this:

tranif0 UpperTran(Out, In1, Sel);tranif1 LowerTran(Out, In2, Sel);

For this exercise, create a module named TranMux4 and design a 4-input muxfor it, using nothing but pass transistors and nets. Simulate to verify your design.

After verifying your design, replace the tranifx switches with rtranifxswitches and simulate again. (This may not work in Silos).


Step 6. nand and nor gates. Create a module named Nand Nor. Give itthree inputs and two outputs. The outputs should be named Nand and Nor. Usethe schematic of Fig. 14.6 to enter a switch-level model of a 3-input nand gate anda 3-input nor gate. Simulate to verify your design (see Fig. 14.7).

Fig. 14.6 A CMOS 3-input nand gate and a 3-input nor gate

After this, add a CMOS inverter on each output to change the functions to andand or.

Fig. 14.7 A Nand Nor switch-level simulation

14.2 Component Lab 17 257

Step 7. Trireg pulse filter. Create a new module RCnet and use triregswitches as capacitors to implement the RC network shown in Fig. 14.8.

Fig. 14.8 RC network for digital simulation

Use an enabled rnmos to represent each resistor shown, in on left; out on right.A large trireg gets a #(15,15,50) delay, a medium #(7,7,30) and asmall #(3,3,10). Simulate an input pulse to see (Fig. 14.9) how well thesedigital switches approximate analogue functionality.

Fig. 14.9 Simulation of a delay line using enabled rnmos resistor models

Replace each rnmos with an rtran to specify delays but with less interestingstrength (Fig. 14.10).

Fig. 14.10 Simulation of a delay line using enabled rtran resistor models


What do the three delay values mean when specifying delay on the output of atrireg?

Can an rtran be used to simulate a delay?



Read Thomas and Moorby (2002) section 6.5.2 on delay conflicts and pessimism.Read Thomas and Moorby (2002) sections 6.5.3 and 6.5.4 on time units and

timing triplets.(Re)read Thomas and Moorby (2002) chapter 10 on switch-level modelling.

However, ignore the “minisimulation” code.(Optional) Read Thomas and Moorby (2002) chapter 9 on UDP’s.


Read chapter 5.2 on the basics of gate-level delays.Read chapter 11 on switch-level modelling.Read chapter 12 on UDP’s. Notice especially section 12.5 which summarizes

features of UDP’s as contrasted with modules. Do section 12.7, problem 1, a 2-1mux. Compare your result with the solution on the Palnitkar CD.

Study the section 11.2.3 model of a switch-level latch or flip-flop. There is amodel of a flip-flop named cff.v on the Palnitkar CD; simulate it to see how itworks.


15.1 Parameter Types and Module Connection

15.1.1 Summary of Parameter Characteristics

• Parameters are unsigned integer constants by default.

parameter Name = value;

• May be declared signed (but many tools reject it).

parameter signed Name = - value;

• May be declared real (but not synthesizable and often rejected).

parameter real Name = float value;

• May be typed by vector index range.

parameter[6:0] Name = 7 bits of value;

• Declaration allowed anywhere in module, but localparam preferred in body.

localparam Name = value;

• Must be assigned when declared. Width defaults to enough for the value assigned.

15.1.2 ANSI Header Declaration Format

module ModuleName #(parameter Name1 = value1,...)( port decs);



In the module header, we have advocated only the ANSI declaration of parame-ters, with pass of value by name. For example,

// Declaration:module ALU #(parameter DataHiBit=31, OutDelay=5, RegDelay=6)

(output[DataHiBit:0] OutBus, ...);...

// Instantiation:ALU #(.DataHiBit(63), .RegDelay(7)) // OutDelay gets the default.

ALU1 (.OutBus(ResultWire), ...);...

15.1.3 Traditional Header Declaration Format

module ModuleName (PortName1, PortName2, OutPortName1, ...);parameter Name1 = value1; ...direction[Name1:0] PortName1; // direction = output, input, inout.direction[range] PortName2; ...reg[range] OutPortName1; ... // output assigned procedurally....

The traditional module header ends with the first semicolon, just as does theANSI header. However, the rest of the traditional header does not end at any well-defined place in the module. This makes specification of the module interfaceerror-prone and sometimes ambiguous.

15.1.4 Instantiation Formats

ANSI and traditional instantiation formats are identical.

15.1.4.1 Instance Parameter Override By Name

ModuleName #(. ParamName1( value1),. ParamName2( value2)...)

moduleInstName(. PortName1( NetName1), ...);

15.1.4.2 Instance Parameter Override By Position

ModuleName #( value1, value2, ...)

moduleInstName(. PortName1( NetName1), ...);

15.1 Parameter Types and Module Connection 261

Override by position is not a recommended practice.By default, a parameter is like an unsigned constant reg. When such a parameter

is assigned to a variable, or used in an expression, it takes on the width and type ofthe destination. However, an index range may be specified when the parameter isdeclared, making it an object of a certain width. For example, parameter[7:0]CountInit = 8’hff; declares a specific width, keeping the designer aware ofwhat happens when the parameter is assigned to a variable of a given width.

Also, a parameter, like a variable, may be declared signed; if so, arithmeticinvolving it may become signed arithmetic, if the other operand(s) also are signedtypes. A width or signedness modifier in a module-header parameter declaration cannot be overridden, although the value may be changed in an instantiation.

Examples:

// Note: 377 = 12’h179; -377 = 12’he87.parameter signed[31:0] mul coeff = -120�Pi; // Gets -376.9911 = -377.//// If the next were overridden by -120�Pi, it would get 32’hffff fe87:parameter signed[31:0] div coeff = 32’h0000 0179;

Overriding a header-assigned default during instantiation is the only recom-mended way of changing the declared value of a parameter.

Note: The Silos demo simulator which came with Thomas and Moorby (2002)or Palnitker (2003) may not recognize signed parameters.

15.1.5 ANSI Port and Parameter Options

Using the ANSI format, it is legal to pass values by position or by name, but never bya mixture of both. When passing by position (again, not recommended), all valuesto the left of each position must be provided. Example (compare above):

module ALU #(parameter DataHiBit=31, OutDelay=5, RegDelay=6)(output[DataHiBit:0] OutBus, ...);

...ALU #(31,5,8) // Must supply first two to change third one.

ALU1 (.OutBus(ResultWire), ...);...ALU #(.DataHiBit(31),5,8) // Illegal.

ALU2 (.OutBus(ResultWire[63:32]), ...);

15.1.6 Traditional Module Header Format and Options

We briefly review here the traditional, verilog-1995 module header format. Thisformat is based on the old, pre-ANSI C language function header format (“K&R” C),


invented by software pioneers Kernighan and Ritchie. While obsolescent, it stillis generated by many automation tools such as netlist writers or file convert-ers. The format is essential to understand because of these tools; manual editingof a verilog netlist may be required to obtain a design which can be fabricatedcorrectly.

The header declaration lists the names, only, of the I/O’s, if any. Immediatelyfollowing the header, the parameters are declared, and the directions and widthsof the I/O’s are specified. After that, the types of the I/O’s are specified; primar-ily, this means that outputs not driven internally by wires are assigned to regtype. So, just as in ANSI format, parameter values may be used to assign widthsto I/O’s.

Example of traditional declaration:

module ALU (OutBus, InBus, Clock); // Parens & contents optional.parameter DataHiBit=31, OutDelay=5, RegDelay=6;output[DataHiBit:0] OutBus;input ...reg[DataHiBit:0] OutBus;...

The more modern ANSI format avoids redundant name assignments and reducesthe possibility of an entry error, so it should be preferred.

Override in instantiation is done the same way whether the header has been writ-ten in ANSI or traditional format.

15.1.7 Defparam

There is a defparam construct in verilog which permits override of any param-eter value in the design by any module, however distant or unrelated. The syntaxis just,

defparam hierarchical path to parameter = new value;

This is a very risky construct, and it may be removed from the verilog languagestandard at a later date. It makes no sense within a given module, because withinany one module, the parameter itself can be assigned just as easily (and under thesame allowed circumstances) as it can be defparam’ed.

The defparam is the main reason for the existence of localparam’s: Alocalparam is identical to a parameter, except that it can not be overridden bydefparam. A localparam is not allowed in an ANSI module header – becauseit also can’t be overridden there.

It is recommended never to use defparam in a design. Like a goto, or likehierarchical references in general, it tends to introduce more of defects than it doesof functionality.

15.2 Connection Lab 18 263

15.2 Connection Lab 18

Do this work in the Lab18 directory

Lab Procedure

Step 1. Traditional port mapping. Type in and rewrite the following module asnonANSItop with its header in traditional port mapping format:

module ANSItop #(parameter A=1, B=3, parameter signed[4:1] List=4’b1010)(output[3:0] BusOut, output ClockOut, input[3:0] BusIn, input ClockIn, input[1:0] Select);

reg ClockOutReg;assign #(2,3) ClockOut = ClockOutReg;...endmodule

Fill in some sort of module functionality – anything you want.At the end of this Step, you should have two modules, in files ANSItop.v and

nonANSItop.v, which will compile correctly in your simulator.

Step 2. Parameter overrides. Create a new verilog file named ParamOver.v andinstantiate both ANSItop and nonANSItop from Step 1 twice each in a newmodule ParamOver (you may use ParamOver for all of Steps 2 – 4). For eachof ANSItop and nonANSItop, override once by position and once by name asfollows: Override B to be 20 and List to be -2.

Fig. 15.1 ParamOver hierarchy views: Branch panes on right are for ANSI01 (left window) andNANSI01 (right)


View the hierarchy in the simulator (see Fig. 15.1). QuestaSim may be invokedfor a look at the parameter values, because VCS doesn’t display the compiled val-ues of parameters. For best results, assign each parameter value to a 64-bit regin an initial block in each module and use the simulator to display the regvalue at time 0. With a width of 64 bits, the entire parameter value should be easilyunderstood – aligned, of course, on the LSB of the reg. A reg used only in aninitial block isn’t doing anything in the design, so it will be removed duringsynthesis.

Side question: What was the default value of List in Step 1, expressed as asigned decimal number?

Step 3. Parameter width override. Instantiate ANSItop, overriding A by positionto be 8’hab. Check the result by assigning the value to a 64 bit reg in an initialblock and using the simulator.

Step 4. Parameter type override. Instantiate ANSItop again, this time declaringA and B signed, but otherwise with the same default values.

Override A by name to be 8’hab and B to be -120∗π (don’t bother to write outπ exactly). See Fig. 15.2.

Check the result by assigning the value to a 64 bit reg in an initial block andusing the simulator. Change the display formats to 2’s complement to see the signedinteger values.

Fig. 15.2 Parameter real-valued expression in 2’s complement wave display format


Step 5. ‘define and ordering problems. The text of a skeleton HierDefinedesign is located in the HierDefine subdirectory of Lab18. Change to theHierDefine subdirectory to find the files, which are named HierDefine.v,Level2.v, Level3.v, and Level4.v. The bidirectional data bus is configuredat each level so that its width is halved each time and it is distributed differentlyto each instance. This might be a datapath design, for example part of a Fouriertransformer.

A block diagram is given in Fig. 15.3 (see also the next Step):

Fig. 15.3 Block diagram ofthe HierDefine instancehierarchy

A. Set up a .vcs compilation file in correct order which would allow VCSto compile this design for simulation. The files contain “...”, indicating omittedfunctionality; you will have to comment these to compile the (nonfunctional) design(Fig. 15.4).

What happens if the order of two entries is reversed in the .vcs compilationfile? What if Level2 included a duplicated set of macro definitions, ‘defineWid 16 and ‘define ResWid 16?

B. Suppose the design would work in different contexts if we doubled or halvedall bus widths; how should HierDefine be modified, still using ‘define, tomake such reuse convenient?


Fig. 15.4 VCS display of the HierDefine hierarchy

Step 6. Parameter passing in depth. Assume a design HierParam with 4 levelsof hierarchy, an exact copy of the HierDefine of the previous Step, except thatparameters have been declared and there are no ‘define assignments.

The width of DataB must be halved at each level of hierarchy, as implied by thestructural fanout at each level. This is shown in Fig. 15.5.

Fig. 15.5 Block diagram of the HierParam instance hierarchy


All the module declarations have been collected into one file, HierParam.v inthe Lab18 directory, to make the problem more easily visible.

Fig. 15.6 VCS display of the HierParam hierarchy

Modify HierParam.v, overriding the parameter default assignments, so thatthe value of the width of the DataB bus is changed consistently in every moduleby assigning just one parameter value at the top level. Don’t change the defaultassignments declared in the file. Check your answer by compiling in a simulatorand looking at the hierarchy (compare Fig. 15.6).

15.2.1 Connection Lab Postmortem

What if a designer wants to assign delays and pass parameters to the same instance?


15.3 Hierarchical Names and Design Partitions

15.3.1 Hierarchical Name References

We have studied hierarchy in several contexts, mainly module instance hierarchyand hierarchical name references in generated instances.

The hierarchy in verilog begins where the compiler (simulator or synthesizer)sees it, whether we are preparing for simulation or synthesis. If a submodule isbeing designed and tested, the names in it will be rooted at the top submodule;when this submodule is linked in to the larger design, and that design is loaded intothe simulator or synthesizer, its name references will be rooted elsewhere.

For this reason, hierarchical references (A.B.C . . . etc.) should be restricted tothe current module (as in a generate statement – recall Week 6 Class 2), or tothe submodules of the module in which they appear. A hierarchical reference fromthe current module downward, toward the leaf cells of the design, can be controlledby the designer. This direction of reference is reasonably safe, because the cur-rent module depends on its instances for its functionality; changing the leaf cell(instantiation) structure means a redesign; in this redesign, the hierarchical namesmay be modified as necessary.

On the other hand, upward references easily are broken beyond repair. Any mod-ule has some functionality as such; and, so, it may be reused not only in its specifiedlocation (if any is specified) but also elsewhere. If reused this way, it is likely thatthe new instantiating module will differ from the previous one, probably invalidatinghierarchical upward name references. The extreme case would be a verilog modelof a library component; it should be designed for use anywhere; so, any hierarchicalupward reference in it would render it nonfunctional.

15.3.2 Scope of Declarations

Here’s a review of the scope of identifiers of all sorts. We introduced some of theseconcepts in Week 6 Class 1 & 2.

By scope, or name scope, is meant that region, in the verilog, of visibility. Vis-ibility implies a name conflict when the visible object’s identifier is redeclaredor redefined. Objects of wider scope conceal the names of those of narrowerscope. For example, two different modules (wide scope) each may contain a dif-ferent net named Clock: The names of nets have narrower scope than namesof modules, so the net names declared in one module are not visible in anothermodule.

Inside a procedural block (including inside a task or function declaration), aname must be declared before its first use (“before” means above the use in the file).However, in a module, a named block declaration will be found by the compiler ifit is anywhere in the module.

15.3 Hierarchical Names and Design Partitions 269

Here is a listing of name scopes which are reasonably distinguishable:

• ‘define macro identifiers: No real scope limitation or hierarchy; definedeverywhere after the compiler encounters them, in compilation order (which maybe unrelated to design structure).

• module names: Global scope, shared with UDP’s (primitives) only. Onelevel of name scope; no hierarchy.

• module instance names. These may be extended to any number of levels of hi-erarchy. The top instance must be in a module; but, with this exception, nothingbut instances may exist in a module instance hierarchy.

• concurrent block names: These include any named initial or always blockswithin a module. The specify blocks to be covered later also fall here. Concur-rent blocks are allowed only one level of name scope, which means no hierarchy.However, generate blocks are in this scope and may contain a generatedhierarchy after unrolling; such a hierarchy may include named always andinitial blocks.

• procedural block names: These include tasks and functions, as well as anynamed procedural begin-end block, such as one in a procedural for loop.These objects may be mixed with one another and extended to any number oflevels of hierarchy.

• element names: Variables (reg, wire, etc.) and constants (parameters,localparams, specparams, etc.). These exist within a scope, but they de-fine no scope of themselves.

Other language constructs such as expressions or literals (including delay values)have no name; and, so, scope is irrelevant to them

We shall look briefly at verilog config blocks later. A config is assigned anidentifier (name) and specifies a collection of design objects, but it has no designfunctionality. A config may be located anywhere a module may be located. Aconfig is more like a makefile, or a system disc file or directory, than a namedblock with a meaningful relationship to a hierarchy. The names in a config gen-erally refer to objects in libraries external to the design.

15.3.3 Design Partitioning

Partitioning of a design may be done to reduce the complexity of the task of anyone design engineer, to allow concurrency in the design work to speed the design tocompletion, or to achieve some design-tool related goal.

The most important single consideration in partitioning is that of the interfacesbetween the parts. Each different part has to have stand-alone, specific functionalitydifferentiating it from all the others. When a part is to be used in several differentplaces in a design, all uses have to be taken into consideration when writing designspecifications for that part.

The interfaces have to be logical and intuitive, so that different designers, or thesame designers at different times, recognize and easily understand the objectives


of the partitioning. The interfaces also have to be well thought-out so that the netscrossing partition boundaries have completely specified types, widths, timing andsequential protocol. Any change in an interface must be considered in a systemcontext and should be made with great reluctance, because it may imply redesign ofseveral or all parts involved.

An elaboration of the verilog language, System Verilog, has a construct called aninterface as one of its features. This construct essentially predefines module headersand adds design I/O functionality; it may be applied unmodified to several modules,thus guaranteeing I/O consistency among them.

In verilog (or System Verilog), partitioning always should be done on moduleboundaries. A module is the smallest verilog design unit which can be compiledseparately. If a partition has to be redrawn, new modules separating the differentparts should be declared, so that the parts might be compiled and tested separately.This allows the designer working on one partition to write a behavioral model ortestbench representing the other(s), permitting effective debugging independent ofother parts of the design.

Some rules of thumb:

Clock domains. Partitioning should be done to separate clock domains. Partswith different clock rates or sleep-mode states should be designed separately fromone another. Synchronizers (see below) should be used wherever a foreign clock en-ters a new domain; this makes module inputs the logical place to insert such devices.

Voltage islands. Parts operating at different supply voltages should be separated;the gates generally have to come from different libraries and often different vendingorganizations. It is true that some CMOS libraries can be used in a range of voltages,for example 3 V to 5 V, but the power consumption or speed will be optimal only inone narrow voltage range. Separation also allows intelligent and efficient selectionof voltage level-shifters.

Level-shifters usually are located in their own partition, or in the voltage partitionof the shifted levels.

Output latches. Each part of any substantial complexity should have clocked,latched outputs, making its internal timing separate from, and independent of, thatof any other part.

Control timing. When different parts pass control signals to one another suchthat the sequence of these controls is significant (for example, wait states on a readfrom a memory), the sequence may have to be enforced by latches on inputs oroutputs, or by clocking on different cycles or opposite edges.

IP reuse. Large commercial IP (“Intellectual Property”) blocks may be pur-chased separately and make up a large fraction of any modern VLSI IC. Micro-controller cores and memories are typical examples. Each of these blocks should beallocated its own partition in the design.

Test considerations. The partitions usually should allow for internal scan in-sertion and possibly boundary scan; in any case, parts should allow observabilityof crucial intermediate results. Latches on outputs make ideal components to be re-placed by scan components with little or no performance decrement; this means thatthe same simulation test vectors may be used before and after scan insertion.

15.3 Hierarchical Names and Design Partitions 271

Synthesis considerations. In general, logic optimization will work best onblocks of random logic in a certain size range, somewhere between about 300 to50,000 transistors. This roughly would be about the same as 30 to 5000 instances,depending on the library, or 75 to 15,000 gate-equivalents. A block too large maytake too long to optimize, lengthening the debug cycle; a block too small will notgive the optimizer enough to work on, so that the designer’s original input will notbe improved. Partitioning should be planned to combine or split such logic, with thesynthesizer’s capabilities an important factor in the decision.

It’s probably best for the designer to assume reliance on automation in the firstapproximation – the initial cycle of write, debug, and synthesize. After seeing thefirst working area and timing result, automated tuning followed by manual con-straint tweaking would be a typical progression. After fully constrained synthe-sis, manual edits occasionally may be necessary. Manual editing of a synthesizednetlist can produce results better than anything the tool can accomplish; but, on theother hand, optimization by the tool may be obstructed by too many “don’t touch”,manually-inserted structures.

A synthesizer usually allows for optional shifting of gates across sequential logicboundaries, thus combining random logic across the boundary and improving op-portunities for optimization. In a flattened netlist, this kind of retiming may improvecertain module instances more than others, providing refinements not available ina partitioning context. Except within a single module, this feature should be usedonly in the final design stages, because breach of the design partitioning is irre-versible, and localization of defects or manual corrections may become very diffi-cult after logic has been shifted in or out of what originally were separate verilogpartitions.

However, keep in mind that a back-end tool such as a floorplanner can be made toreconstruct in the physical layout the original verilog source hierarchy, even from aflattened netlist. This generally is possible because the naming convention imposedduring uniquification and flattening by the synthesizer or optimizer is consistentwith the original verilog module names.

15.3.4 Synchronization Across Clock Domains

A clock domain is defined by a clock which runs independently (by a differentoscillator) of other clocks. A derived or generated clock created by PLL or frequencydivider may be said to be in a different domain from the original clock, but we donot use this definition in the present lecture. The reason is that derived or generatedclocks are phase-locked to their original clock, and problems of synchronization arelimited merely to considerations of skew and jitter.

When data have to be transferred from one clock domain to another, problemsarise which do not occur in a single-clock synchronous design. Specifically, whentwo clocks run at different rates, and are not derived from one another, any clockstate can be simultaneous with any other at a given component input. This meansthat data (or control) from one clock domain can be sampled in an intermediate logic


state, far enough away from a ‘1’ or ‘0’ that the gate’s response (slew rate) may bevery slow. Actually, intermediate-state sampling in general would be expected tooccur frequently, and at random, given typical GHz clock rates in modern designs.

A simple mechanical analogy may help conceptualization of this problem: If afencing foil is dropped 1 m at a random, almost vertical angle onto a hardenedsteel floor, there is almost no chance it will happen to balance upright on its tipfor one full second before falling over. However, if such a drop was repeated abillion times, by simple chance several of the foils will have balanced upright forone second. Repeated enough of times, a foil or two will be found to have balancedfor 10 seconds or even for a minute.

On a fine enough time scale, an input always can be sampled so that it leaves theoutput undetermined, no matter how long the wait. See Fig. 15.7 for an illustrationof oscilloscope-like, progressively increasing, time resolution.

Fig. 15.7 Each V spans a range such that the output will go to ‘1’ at the top and ‘0’ at the bottom.Figure not to scale. If ∆t0 = 100∆t1 and ∆t1 = 100∆t2, then V1 ∼= V0/100; V2 ∼= V1/100, whichimplies approximately that the probability p that the gate output will be intermediate during therespective ∆t will be p1 = 100p0 and p2 = 100p1

So, from time to time, probably many times per second, a gate in one domainwill sample an input voltage so exactly centered in its input range, that the gate willbe almost perfectly balanced and can not switch to propagate either a ‘1’ or a ‘0’before the next clock cycle in its own domain. As time passes after the balancedsampling, the gate eventually will switch one way or the other, but this may be toolate for the correct logic level to be propagated.

This problem is overcome by latching inputs in the receiving clock domain. Asshown in Fig. 15.8, an output flip-flop or latch is triggered by the sending clock,the data are latched, and the latched logic level then is sampled in a synchronizingflip-flop on a receiving-clock edge. However, as explained above, there is a remotechance that the synchronizing flip-flop will not have settled before its value is prop-agated, perhaps ambiguously, into the receiving logic. This chance is eliminated byusing two synchronizing flip-flops in place of one, creating a little two-stage shiftregister.

15.4 Hierarchy Lab 19 273

The input balance point of the second flip-flop almost certainly will not be atthe same voltage as the balanced output of the first. Then, if balanced, the old datain the first flip-flop has almost an entire clock cycle to settle before being sampledby the second one, which is in a well-defined state anyway. The probability thatthe second flip-flop will be balanced on the balanced output of the first one is re-mote enough to be ignored or to be compensated, if necessary, by occasional ECCoperations.

Fig. 15.8 Synchronization across two clock domains. Clocks A and B run at different and indepen-dent frequencies. Domain B must be prevented from clocking indeterminate data values from A

15.4 Hierarchy Lab 19


Lab Procedure

Step 1. Create a file named MiscModules.v and type in the following emptymodules (headers only):

module Wide(output[95:0] OutWide, input[71:0] InWide);endmodulemodule Narrow(output[1:0] OutNarrow, input[1:0] InNarrow);endmodulemodule Bit(output Out, input In);endmodule

Step 2. Connecting a 1-bit signal to a 2-bit port. Using the file in Step 1, instantiateBit in Narrow, and connect it to the Narrow I/O’s:

module Narrow(output[1:0] OutNarrow, input[1:0] InNarrow);Bit Bit1( .Out(OutNarrow), .In(InNarrow) );endmodule


Do not use a part-select. What do you think will happen? Only one bit in theNarrow busses can be connected: Which one?

Test your assumptions by implementing something trivial in Bit, for example,

assign Out = 1’b1;

Simulate MiscModules to see which bit is assigned.

Step 3. A hierarchy paradox. Without changing anything else in Step 2, instantiateNarrow in Bit; leave Narrow with no port map (I/O connection). Now you havetwo modules, each instantiated in the other. Do you think this is legal? Check it bycompiling MiscModules for simulation.

Step 4. Connecting a 2-bit signal to a 1-bit port. Comment out the Bit instan-tiation in Narrow from Step 2, leave Narrow instantiated in Bit, and connectthe Narrow ports analogously to the opposite connections in Step 2 (with no part-select or bit-select). Again, try to predict which bit will drive the output. Add asimple statement to Narrow, and simulate to test this.

Step 5. Repeat Step 4, using Narrow and Wide instead of Bit and Narrow.You should be able to predict the results. Optional: Simulate to see this result.

Step 6. Clock synchronization simulation. Set up a testbench module namedClockDomains which has two clock generators, one with half-period of 2.011 ns(the slow clock), and the other with half-period of 1.301 ns (the fast clock). To re-solve these periods in the simulator, set ‘timescale in the testbench to 1ns/1ps.

As shown in the schematic of Fig. 15.9, write a model of a 3-bit RTL up-counterin a module, Counter3. Be sure to include a reset, and assign all three bits tothe Counter3 outputs. For this exercise, make the counter count clock edges,not cycles, using always@(Clk) rather than @(posedge Clk). Instantiate thismodel twice in the testbench module. One instance will represent logic in the fastclock-domain, and the other, logic in the slower clock domain.

Fig. 15.9 Schematic representation of ClockDomains. Two uncorrelated clock domains arecombined raw in UnSyncAnd. A synchronizing latch on SyncAnd makes slow-domain datamore predictable when sampled in the faster domain. Delays are approximate


Clock one counter instance with the fast clock, and the other with the slowclock. Use a continuous assignment and-reduction operator to assign the and ofthe three counter bits of each counter to its own net variable. Attach a small de-lay to these two continuous assignment statements, maybe 200 ps or so. Thesetwo nets are to be compared in the fast clock domain; when both are ‘1’, thefast domain will have to do something (we leave the operation undefined in thisexample).A. To see the raw, unsynchronized coincidence of these ands, just and them bothin another continuous assignment onto a net called UnSyncAnd. Attach anothervery small delay to this and statement, perhaps 1/5 of the other delay. Use decimalfractions (1/5.0 rather than 1/5) to force evaluation as reals and not integers.Simulate your model to verify it. You should see the UnSyncAnd occasionallygoing to ‘1’ or glitching high. The different positive pulses will vary considerablyin width because of the incompatibility of the clock frequencies (both half-periodvalues are prime numbers).B. To add a synchronizing latch (=flipflop) in the fast domain, use a simple RTLstatement to sample the 3-input and from the slow clock domain when the lo-cal (fast) clock is high. Give this component a slightly shorter delay than that ofthe ands.

We don’t require a synchronizer for the fast clock and, but we do require a latchto hold its value; so, also write an RTL latch just as above for the fast-clock andvalue:

localparam AndDelay = 0.200; // 200 ps 3-input and gate delay.localparam LatchLagDelay = AndDelay/1.5;...reg HoldSlowAnd; // The synchronizing latch storage.always@(posedge FastClock)if (FastClock==1’b1)#LatchLagDelay HoldSlowAnd = SlowAnd; // Sample the slow-domain and.

...

Then, complete the exercise by anding both latched 3-input statements, one fromeach clock domain:

assign #(AndDelay/5.0) SyncAnd = HoldFastAnd & HoldSlowAnd;

Compare SyncAnd with UnSyncAnd in a simulation (see Figs. 15.10 and15.11). You will find that the SyncAnd data are much cleaner and are glitch-free. Latching the fast side of the UnSyncAnd input makes little difference in theglitches and irregular pulses: Garbage in; garbage out.


Fig. 15.10 The ClockDomains simulation, comparing the two ands

Fig. 15.11 Zoom in on ClockDomains narrow pulse

Given that the clock frequencies are fixed, tuning the delays can introduce oreliminate potential glitches in the synchronized data. Also, some glitch-like pulsesin the UnSyncAnd data can be admitted in the synchronized data by shifting ortuning the sampling windows. These are arbitrary issues when dealing with disjointclocks, because synchrony is essentially arbitrary when the clocks are independent.It is the interface that defines synchrony, not the clock domains taken either sepa-rately or jointly.


We have recommended keeping delays out of procedural blocks for synthesis rea-sons and, instead, putting the resultant delays in continuous assignments to moduleoutput or inout ports. How does this affect the partitioning practice of latchingall data outputs?



Review parameters, hierarchical names, and connection rules in Thomas and Moorby(2002) sections 3.6 and 5.1–5.2.

Palnitkar uses traditional module header declarations extensively; Thomas andMoorby (2002) describes the differences briefly in Preface pages xvii–xix.


16.1 Verilog Configurations

The majority of designs begin with C models or behavioral (bus-transfer) modelsand, after verification of the partitioning and interfaces, proceed to further detail atRTL or gate level. This means that a module’s implementation may change radicallyduring the design process. Obviously, somewhere, there has to be a way of obtaininga complete list of the files comprising a design, or the design could not be usedfor anything. To substitute different versions of a module, designers have relied onmakefiles or shell scripts. As an alternative to depending on the filing system or othernonverilog functionality, here we introduce a way of managing design versions andformats within the verilog language.

16.1.1 Libraries

The definition of a “library” of modules or technology-specific gates is vendor spe-cific: Synopsys has its own definition, and other EDA vendors have theirs. Althoughfile formats and byte-ordering make any compiled object nonportable, still, there hasbeen some demand for a way within the verilog language for organizing a design.Verilog provides for a library mapping file as a way of mapping a library to thefiling system. An example of this, for Synopsys synthesis, was given at the startof the generic DC compilation script provided at the beginning of this course. Inthe simplest case, the name of the library simply is paired with a directory contain-ing the library contents; all subsequent references to the contents then are by thelibrary name.

16.1.2 Verilog Configuration

The verilog configuration, new in verilog-2001 (IEEE Std 1364, section 13), ismeant to make module versions easily interchangeable, and to make libraries moreportable.



A configuration is bounded by the keywords config and endconfig. It de-scribes a library comprised of everything in a design useful for a regression test orother design activity. It is stored in a design file at the same level as a module.

The format of a config is as follows:

config config name;design design top name;default list of libraries;cell cell name use library name;instance inst name [use] instance liblist [.cell];endconfig

Keywords are in bold above; the other words indicate design-specific identifiersor lists. Only the design and the default are required. Notice that the keyworddefault (case statement) is reused here in a very different context. The referentsintroduced by the keywords in a config are as follows:

• config. This introduces an identifier for this config statement. The inten-tion here is that interchanging configs allows the design to be compiled orotherwise used with different collections of library elements or module versions,for different purposes. For example, a config early in a CPU design might benamed CPU BusXfer; later, a new config, with different references to thedesign library, might be named, CPU RTL; then, later, CPU FloorPlanned,and so forth. Any of the configs might be used to obtain information on thedesign at any config-identified stage.

As another example, different configs could be used for simulation than forsynthesis.

• design. This specifies the top-level module in the design assumed archivedin the library involved. Verilog hierarchical names are used when a library isinvolved. For example, if the library name was “IntroLib”, then the top-levelmodule might be identified by,

design IntroLib.Intro Top;

• default. This specifies the library, or libraries, in search order, from whichthe instances in the design shall be taken, unless otherwise specified in theinstance statement. For example, if we used the Synopsys synthesis class li-brary’s class.db file for everything, we could complete this config for thatexercise simply by stating,

default class;

• cell. This specifies the library from which the named cell (module) shall betaken. The use clause is used to name the library cell itself.

• instance. If more than one library was involved, and a given module namewas present in more than one, then the module or gate instances not to be foundin the default library list are specified individually here. A use clause optionally

16.2 Timing Arcs and specify Delays 281

may be used to pick a specific cell. The name starts with the name as given inthe design statement. For example, recall that we used an xor expression (ˆ) inthe Lab 1 exercise. Suppose we had a special synthesis model of an xor gate,in library file special.db, which we wanted to be configured for our presentpurposes. Then, the xor in a previously synthesized gate-level netlist of ourintroductory lab exercise might be specified this way:

instance IntroLib.Intro Top.XorNor.xor 01 special;

Using the synthesis output library, the instance statement could be just,

instance Intro Top.XorNor.xor 01 special;

Library and design configuration maintenance is a very tool-dependent and spe-cialized topic, and we shall not dwell more on it here.

The verilog configuration adds almost no functionality to the language, becausemodern designs always are configured already in the filing system. The language-based configuration thus may increase design failure by establishing conflictingmultiple points of configuration control. No tool known to the author has imple-mented the verilog config, and so there is no lab exercise on the topic of verilogconfigurations.

16.2 Timing Arcs and specify Delays

We move on to study timing arcs inside a module.

16.2.1 Arcs and Paths

Delays across a module in verilog are said to be distributed in space along tim-ing arcs; the arc refers to a phasor angle swept out during a clock period. Therationale is that eventually the module will correspond to a geometric object ona chip, and the timing between distinct, identifiable places on that chip will beimportant.

A timing arc is equivalent to a delay between two points in a module. Thesepoints are viewed as structural, so they can not be represented solely by expressionsor assignment statements; they have to be locations. Timing arcs represent scheduledverilog events and also a temporal distance between them. The distance may spanjust a single net, or it may span a network of multiple gates of combinational oreven sequential logic. A timing arc may exist between a clock and a data pin. In astructural model, timing arcs may be defined between ports or internal pins of anymodule. At any level, each of these timing arcs may be assigned a path delay, thedelay of the timing arc mapped to a path.


Fig. 16.1 Timing arcs are defined only between ports or pins. Assume A, B, E, and G are veriloginputs and the others are outputs

In Fig. 16.1, a timing arc may be defined between module ports A or B andC or D. However, in general, if there is an arc between A and C, there is no sin-gle arc between A and both C and D. Timing arcs also may be defined betweenA or B and E or G, between F or H and C or D, and between F and G. No arcis visible between E and F or G and H in the module shown; however, such arcsmay be defined in the modules which were instantiated as Instance 1 or Instance 2.The delay on the paths between E and F or G and H may be calculated to deter-mine arcs between, say, B and D and passing through one or both of the instancesshown.

Although not shown in Fig. 16.1, any path must be represented by a causal con-nection, usually a simple wire or cloud of combinational logic – but possibly bylayers of sequential logic, too.

The delay specifications we shall discuss here can be used by simulators or statictiming analyzers.

16.2.2 Distributed and Lumped Delays

In our labs, we so far have assigned both distributed and lumped delays. A dis-tributed delay is one which separately sums two or more delay values along atiming arc. Or, alternatively, a timing arc passing through two or more pins, eachassigned a delay on a subarc, represents a distributed delay. A lumped delay isone which is summed by the designer and assigned solely to the final point on atiming arc, the intermediate subarcs not being assigned any delay (also not beingassigned #0).


A simple comparison of distributed vs. lumped delays is as follows:

module DistributedDelay (output Z, input A, B);wire Node;assign #1 Z = Node; // Output port delay.and #(2,3) (Node, A, B); // Pin delay.endmodule//module LumpedDelay (output Z, input A, B);wire Node;assign #(3,4) Z = Node; // Total delay lumped on output port.and (Node, A, B);endmodule

The distinction between distributed and lumped delays is made only in the con-text of modules with substructure; the distinction is not very relevant to simple be-havioral or RTL models. As a more elaborate example, consider an RTL model of aregister composed of four three-state flip-flops and having no substructure:

‘timescale 1ns/100psmodule FourFlopsRTL #(parameter DClk = 2, DBuf = 1)

(output[3:0] Q, input[3:0] D, input Ena, Clk);reg[3:0] QReg;wire[3:0] Qwire; // Not used yet.//always@(posedge Clk)

#DClk QReg <= D;//assign #DBuf Q = (Ena==1’b1)? QReg: ’bz;//endmodule

There is a delay from Clk to QReg, and another from QReg to Q, but the onlytiming arcs in this model are from D, Ena, or Clk to Q.

The (parameter) default delay on the D−>Q path is 2+1 = 3 ns; the defaultdelay on the Ena−>Q path is 1 ns; and, the default delay on the Clk−>Q path is2+1 = 3 ns. Although the values can be summed, none of these defaults representeither distributed or lumped delay in any meaningful sense.

However, consider the more structural design below. This design,FourFlopsStruct, is functionally and timing equivalent to FourFlopsRTL;it contains some substructure implemented as arrayed instances:


module FourFlopsStruct #(parameter DClk = 2, DBuf = 1)(output[3:0] Q, input[3:0] D, input Ena, Clk);

wire[3:0] QWire;//DFF #(.DClk(DClk)) DReg[0:3](.Q(QWire), .D(D), .Clk(Clk));assign #DBuf Q = (Ena==1’b1)? QWire: ’bz;endmodule // FourFlopsStruct.// ------------------------------------------------------module DFF #(parameter DClk = 2) (output Q, input D, Clk);reg QReg;always@(posedge Clk) QReg <= D;assign #DClk Q = QReg;endmodule // DFF.

The FourFlopsStruct model has the same timing as FourFlopsRTL, butthe delays as calculated for FourFlopsStruct are distributed delays. The DReginstance output delay value DClk could be considered a lumped delay, viewed byitself.

It is possible to distribute delays on nets, too, by assigning a delay when each netis declared. This is analogous to the way we have seen a strength assigned to a net.For example, “wire #3 DataOut;” schedules delayed changes in DataOutthe same as though DataOut was being assigned from a temporary variable by acontinuous assignment statement with the given delay.

The author never has seen net delays used in a design, although, in layouts intoday’s deep submicron pitches, the net capacitances have come to account for asignificant fraction of the delay in a design, the remainder coming from the internalgate delays. Gates have locations, so practice has been to assign timing in the ver-ilog only to arcs between gates. The net delays in practice are accounted for whensimulating with back-annotation from a floorplanned or placed-and-routed netlist.

Using what we know from previous models, FourFlopsStruct could berewritten with lumped delays as follows:

module FourFlopsStructL #(parameter DClk = 2, DBuf = 1)(output[3:0] Q, input[3:0] D, input Ena, Clk);

wire[3:0] QWire;localparam DTot = DBuf + DClk;//DFF DReg[3:0] (.Q(QWire), .D(D), .Clk(Clk));assign #DTot Q = (Ena==1’b1)? QWire: ’bz;endmodule // FourFlopsStructL.// ------------------------------------------------------module DFF(output Q, input D, Clk);reg QReg;always@(posedge Clk)

QReg <= D;assign Q = QReg;endmodule // DFF.

Of course, as we might recall, the lumped delay DTot could be written to ex-press rise and fall transitions separately, this way: (DTotR, DTotF). Furthermore,


technology considerations might be taken into account to provide the verilog simula-tor with minimum, typical, and maximum delay estimates; from previous study, weknow this might be written, (DTotRmin:DTotRtyp:DTotRmax, DTotFmin:DTotFtyp:DTotFmax). But, this is as far as we can go with what we learned upto now.

16.2.3 specify Blocks

A specify block begins with the keyword specify and ends with the keywordendspecify. A specify block is at the same concurrent level in a module asan always or initial block. A specify block has no functionality, but it maybe used to determine the details of timing required for (a) an accurate gate-levelverilog library model or (b) any other module with precisely known, and preciselyrequired, timing.

Timing in a specify block can be more precisely and flexibly controlledthan timing added to declarations, statements, or instantiations. A specify blockmakes it possible to model timing without knowing or defining any functionality;functionality can be added later with no further timing editing at all. A specifyblock may be used for a static timing model of an encrypted IP block without re-vealing any functionality.

We shall study timing checks later in the course; but, for now, it’s important toknow that a specify block is the only place in a verilog model where a timingcheck may be stated. Possibly to prevent confusion over the ‘$’ which begins thename of every timing check, system tasks, which also start with ‘$’ ($display,$monitor, $dumpvars, etc.), are forbidden in a specify block.

A specify block may contain specparam definitions, timing checks, certainpulse filtering conditions, and module path delay specifications. We shall concen-trate on the specparams and module path delays for now.

Names of ports declared in a module are visible within a specify block inthat module; nothing in a specify block can reference declared reg variables orthe port pins of components instantiated in a module. However, a net, parametervalue, or localparam value declared in the same module may be used in aspecify block.

In summary, specify blocks are at the same level in a module as always,initial, or generate blocks and are delimited by the keywordsspecify . . . endspecify.

A specify block may contain:

• parameter or localparam references• module port or net references• specparam definitions• module delay specifications• timing checks.


A specify block may not contain:

• any other block• any declaration other than of a specparam• any assignment to a reg or net• any instance or other design structure• any task or function call, including system tasks or functions.

16.2.4 specparams

A specparam is a verilog parameter which usually is used only in a specifyblock. Although many tools will not allow it, a spacparam also may be declaredoutside a specify block.

This special kind of parameter exists for convenience of parsers and other toolswhich extract timing information from a model. There is only one unique featureto a specparam: It may be assigned multiple numerical values, for example in adeclaration of the form,

specparam Name = (x,y,z);

in which x, y, and z are numbers or timing triplets representing delays.When creating a timing specification for a module, good practice is not to assign

verilog literals to the delays; instead, the literals or other constants should be as-signed to specparams, which then may be referenced in timing expressions laterin the specify block. The specparams may be given mnemonic names and, ofcourse, used in any number of timing expressions within that specify block.

For example, we introduce our first path delay:

module ALU (output[31:0] Result, input[31:0] ArgA, ArgB, input Clk);...specifyspecparam tRise = 5, tFall = 4;...(Clk �> Result) = (tRise, tFall); // A simple full-path delay.

endspecify...endmodule

A specparam also may be assigned a timing triplet for simulator min-typ-maxalternatives. For example,

specifyspecparam tRise = 2:3:4, tFall = 1:3:5;...(other stuff ; maybe complicated)...

(Clk �> Q,Qn) = (tRise, tFall);endspecify


16.2.5 Parallel vs. Full Path Delays

There are two main kinds of path delay statement, full-path (“∗>”) and parallel-path (“=>”). A full-path delay applies to all possible arcs between all bits of theports named. A full-path delay may imply a timing-arc fanout (as in the Clk ∗>Result example above) or a fanin. Bit-select or part-select delays usually wouldbe specified by full path.

A parallel-path arc only can exist between endpoints with equal numbers of bits;no fanin or fanout of the arc delay is allowed. This kind of path is most usualbetween scalar (single-bit) ports. When the ports are vectors, the bit delays aremapped one-to-one, in parallel and in declared order.

Path delay specifications of any kind are illegal when fanned-in logic such as awor or wand net directly drives a port; such fanin’s must be replaced by logicallyequivalent gates with single outputs if path delays are to be assigned.

Examples of path delay specifications:

module FullPath (output[2:0] QBus, output Z, input A, B, C, Clock);... ( functionality omitted) ...specify

specparam tAll=10, tR=20, tF=21;(A,B,C ∗> QBus) = tAll;(Clock ∗> QBus) = (tR, tF);

endspecifyendmodule// --------------------------------------------------module ParallelPath (output Z, input A, B, C, Clock);... ( functionality omitted) ...specify

specparam tAll=10, tR=20, tF=21;(Clock => Z) = tAll;(A => Z) = (tR, tF);(B => Z) = tAll;

endspecifyendmodule

An interesting feature of specify delay values is that there may be as manyas six delay values for a path capable of turnoff. As stated in Thomas and Moorby(2002) section 6.6, a specify block assignment may assign six different delays toa path in the order, (0 1, 1 0, 0 z, z 1, 1 z, z 0). This level of precision rarely is usefulin modern design, because synthesizer or floorplanner back-annotation generallywill be more accurate and less effortful than this level of designer guesswork in thesource verilog.


16.2.6 Conditional and Edge-Dependent Delays

Any delay may be assigned to a path conditional on the nonfalsity of an expression.The usual verilog operators may be used in this expression, and the evaluation isnonfalse if a ‘1’, ‘x’, or ‘z’. If the expression is on a vector object, the (non)falsitydepends only on the LSB. However, only simple statements are allowed; no else,case, or other constructs with alternatives are allowed. The keywords posedgeand negedge have their usual meaning. For example,

// output[3:0] Z, input[3:0] A, input Clk, Clear are declared ports....specifyspecparam ClkR=2, ClkF=3, ClearRF=1, AThruR=4, AThruF=5;//if (Clk && !Clear)(A => Z) = (AThruR, AThruF);if (A[0] && A[3]) (A[1], A[2] �> Z) = AThruR; // Lists are OKif (A[1] && A[2]) (A[0], A[3] �> Z) = AThruF;if (!Clear) (negedge Clk �> Z) = ClearRF;if (!Clear) (posedge Clk)�> Z) = (ClkR, ClkF);

(posedge Clear �> Z) = ClearRF;endspecify

A path destination may be assigned a polarity, so that the delay is associated witha certain edge direction at the related port. This allows the data path to be used todetermine the delay, as well as the output change. For example,

// output Q, Qn, input D, Clk: Q1 and Q2 are logically equivalent....specifyspecparam tR Q = 5, tF Q = 6.5;...( posedge Clk => (Q1 +:D) ) = (tR Q, tF Q );( posedge Clk => (Q2 -:D) ) = (tR Q, tF Q);endspecify

In the first statement above, Clk clocks in D to Q; if Q1 was ‘0’ and D is ‘1’,then tR Q is the delay; if Q1 was ‘1’ and D is ‘0’, then tF Q is the delay.

In the second statement, the ‘−’ means that the D polarity is inverted, so theeffect of D, when compared with the explanation of the first statement, is to choosethe other delay. So, if Q2 was ‘0’ and D is ‘1’, then tF Q is the delay; if Q2 was ‘1’and D is ‘0’, then tR Q is the delay.

Similarly, both parallel and full path delays may be assigned with polarity depen-dence. The ‘+’ means noninversion; the ‘−’ means inversion. So, “−=>” refers toa change on the left which is in the opposite direction from the resultant, parallel-path delayed, change on the right. The same principle applies to full path statements,using “−∗ >” and “+∗ >”. For example,

16.3 Timing Lab 20 289

...specify(clk -�> Q1, Q2) = t ClkTog; // Delay when Q1 | Q2 goes opposite clk.(clk +�> Q1, Q2) = t ClkReg; // Delay when Q1 | Q2 goes same way as clk.endspecify

These special features might be useful in switch-level modelling.Delays may be given explicitly for transitions to and from ‘x’ or ‘z’ as well as

the other logic levels. This leads to timing specifications not just of the two value(rise, fall) or three value (rise, fall, turnoff ) variety, which we have seen, but also ofsix values (all transitions with ‘z’) or twelve values (all transitions with ‘x’ or ‘z’).See Thomas and Moorby (2002) appendix G.8 for a list of these specifications.

16.2.7 Conflicts of specify with Other Delays

As mentioned in Thomas and Moorby (2002) section 6.6 and specified in IEEE Std1364, section 14.4, when a specify block contains a delay assigned to a path alsodelayed outside the specify block (in a delayed assignment or delayed primitiveinstance statement), the greater of the two delays will be the one simulated.

This said, keep in mind that individual simulators are likely to be equipped withinvocation or configuration options to select among the different possible sources ofdelay, overriding the default specified by the verilog language.

16.2.8 Conflicts Among specify Delays

The verilog 1364 Std, section 14.3.3, says that when several active inputs changewhich individually would schedule the same event at different delays, the shortestdelay is the one used.

16.3 Timing Lab 20

Do this work in the Lab20 directory, using the verilog provided.

Lab Procedure

The verilog for this exercise has been provided, minus timing, in the Lab20/SpecIt subdirectory. Before beginning this lab, copy every file in Lab20/SpecIt up one level to Lab20. The top-level schematic for the SpecIt designis given in Fig. 16.2; the two lower-level schematics are given in Figs. 16.3 and16.4. As previously discussed, module names and instance names are in differentname spaces, so naming an instance exactly the same as its module, while not oftenrecommended, is allowed.


Fig. 16.2 SpecIt top-level schematic for timing path exercises. Port names and block instancenames (= module names) are shown

Fig. 16.3 The Comboschematic, showing gateinstance and port names.Inputs begin with ‘i’; outputsend with ‘o’

Fig. 16.4 The Flipperschematic, showing gateinstance and port names.Inputs begin with ‘i’; outputsend with ‘o’

Step 1. Instantiate the SpecIt module in a testbench and supply test vectors bymeans of a 4-bit up-counter. Use the count bits as stimuli on the SpecIt inputs inthis order: {MSB, . . ., LSB} = { Reset, Ena, Nor, And }. Clock thecounter with a 100 ns (10 MHz) clock. Verify correct syntax by loading the designin the simulator (see Fig. 16.5).

Fig. 16.5 SpecIt simulation to verify design correctness


Step 2. Lumped and distributed delays. In the Nand, Nor, and Nuf gate instan-tiations in Combo, assign output delays of 3, 5, and 7 ns, respectively. This willestablish a distributed delay totalling 12 or 15 ns between iNand and Nufo. Checkthis by simulating briefly (Fig. 16.6).

Fig. 16.6 SpecIt simulation with distributed delays

What happens when two inputs change at the same time and are distributed dif-ferent delays?

Try setting a lumped delay of, say, #10, on the Combo instance in SpecIt.Combo has just one output, so this delay should be unambiguous. What happens inthis potential conflict?

Step 3. Specified path and distributed delay conflict. Make a new copy ofCombo.v (with the distributed delays of Step 2) in a file named ComboSpec.v,but don’t change the module name. Change your SpecIt file list so ComboSpec.vis read by the simulator instead of Combo.v.

A. In ComboSpec.v, add a specify block to the Combo module which as-signs a full-path delay of 10 ns from any input to the output. Use a specparam forthe time value. Simulate to see what happens.

B. Change the specify block delay to 20 ns rise, 21 ns fall, 22 ns turnoff. Whathappens?

According to the verilog 1364 Std, when delays conflict this way, they must beresolved pessimistically: As previously mentioned, in a change between ‘1’ and ‘0’,the longest delay is used. As usual, in a change to ‘x’, the shortest delay is used;and, in a change from ‘x’, the longest delay is used. All the new specify delaysare longer than the distributed delays on the Combo instances.

C. Leaving the Step 3B specify values alone, define a Combo localparamnamed InstTime and assign it a value of 25. Use InstTime to assign each ofthe distributed delays on the gate instances to 25 ns. Simulate. What happens?

D. Leaving the Step 3C distributed delays as-is, define another localparamtZ= 30, and use it to assign the value to your specparam for turnoff time. Sim-ulate. This shows another benefit of using parameters to change the timing of amodel.

Note: Your simulator may refuse to use a generic named constant (parameteror localparam) to define a specparam; this is incorrect and a minor nuisance.A workaround might be to avoid specparams and use localparams instead.


E. In the Step 3D model, try changing the full-path assignment to a parallel-pathassignment without changing anything else. What happens? Your simulator shouldissue an error or at least a warning when you do this.

Step 4. Specified parallel and full delay conflict. Copy Flipper.v to a newfile FlipperSpec.v, keeping the module names the same. Use your originalCombo.v (no instance delay) in combination with FlipperSpec.v for simu-lation in this Step.

A. Use specparams in a specify block in the Flipper module (not theDFFC module) to assign a parallel-path delay to both outputs of Flipper: Use asum in the specify block to represent that Bufo adds a delay of 1 ns to any changeon its input or control, and that clocked delays are 5 ns to a rise on Flipo and 8 nsto a fall on Flipo.

For example,

specparam tClkQR=5, tClkQF=8, dBuf=1, tClr=2;...(posedge iClk => Bufo) = (tClkQR+dBuf, tClkQF+dBuf);...

B. But, assume that a negative clock edge turns off Bufo; add a turnoff delay of1 ns to the specify block. Use only rise and fall delays.

C. Also assign a parallel-path delay of 2 ns to a clear of Flipo, appropriatelyadding delay to Bufo for that clear. Simulate to see the result.

D. With the changes in A – C imposed, add a full-path delay of 30 ns inFlipper from iClk to both output ports (using a list). Simulate; 30 ns is largeenough that you easily should be able to see a 30-ns shift in the waveforms, if oneoccurs.

E. Replace your 30 ns full-path assignment in D with one to 0 ns. Simulate to seethe effect. Question: When there is a conflict between different delays to the sameport pin, which delay is used by the simulator, the shorter or the longer one?

Step 5. Using your Step 4E design for Flipper, reinstall the delays into Combothat you had at the start of Step 2, and simulate. You should be able to see waveformsthe same as in Fig. 16.7.

Fig. 16.7 A SpecIt Step 5 verilog source simulation


Synthesize SpecIt. Be sure to compile the correct files. Read the timing report;what happened to the timing in the modules?

Write out an SDF file (recall Week 1 Class 1) and look at it briefly in a text editor.Notice that there are timing triplets everywhere. We shall study this kind of file laterin the course. If you should simulate the resulting netlist, you would see waves aboutthe same as in Fig. 16.8.

Fig. 16.8 The SpecIt Step 5 verilog netlist simulation with SDF back-annotated timing


How does the simulator resolve conflicting delay scheduling times when the delaysdiffer but the final state is the same?

What about when the final states differ?


Read Thomas and Moorby (2002) section 6.6 on specify block delays.


Read sections 5.2 and 6.2 on gate-level and assignment-statement timing.Read chapter 10 on path and other delays.Try section 10.6, problems 1–3.


17.1 Timing Checks and Pulse Controls

17.1.1 Timing Checks and Assertions

Timing checks have the appearance of system tasks (both begin with ‘$’), but theyare not system tasks (IEEE Std 1364, section 15). System tasks are simulated inprocedural code; timing checks are concurrent and are allowed only in specifyblocks. However, some system tasks, such as $display, can be used to createassertions which resemble the result of a timing-check violation.

As we saw a while ago (Week 4 Class 2), simple system tasks can be used toconstruct assertion checks in the verilog. While assertion often is taken technicallyto refer to something specialized, and while in some languages such as VHDL orSystem Verilog there is a builtin assertion mechanism, the functionality is just tocheck some condition on variables or other design objects during a simulation, andto create a warning message, and perhaps an error condition, when the assertion isnot fulfilled. An error condition may stop the simulation.

In principle, a synthesizer or static timing analyzer could be provided with anassertion mechanism to issue a message or stop the program when some structureor other condition was encountered during a traversal of the verilog input or duringcreation of the output.

The Liberty library format used for netlist synthesis includes its own timingchecks very much the same as those in verilog. However, these checks can not berun by a verilog simulator; they are used (a) during synthesis and optimization toavoid violation of constraints and (b) during static timing verification.

The difference between assertions, debugging, and code coverage is that asser-tions represent a designer’s insight into what might go wrong; debugging proceedsafter a flaw has been uncovered; and code coverage estimates the need for debug-ging. In this context, a timing check can be viewed as a builtin, specialized assertionmechanism. All a timing check does is issue a message to the computer consolewhen some design constraint has been violated, or some device operating parameterhas gone out of range. Simulators typically include an option to copy timing checkmessages to a log file.



A timing check itself does not change simulator waveforms or the values as-signed to variables during simulation. However, there does exist a notifier mecha-nism in all verilog timing checks which may be used by the designer to change thecourse of the simulation as a result of a timing violation.

As already mentioned, timing checks are allowed only in verilog specifyblocks; they can not be put in always blocks or in procedural code such as tasksor functions. Timing checks sit in their specify blocks, wired into the restof the design, and they issue messages when the simulation goes wrong in somesubtle, timing-related, way. The conditions they report often would go unnoticed bysomeone looking for violations in the simulator output waveforms.

To summarize the features of timing checks:

• They all use the syntax, $name of timing check (argument list);• Functionally, they are predefined assertions.• They differ from procedural assertions:

◦ They are allowed only in specify blocks.◦ They include builtin triggering logic.◦ They run concurrently.◦ They are part of the simulator and so incur little runtime overhead.

17.1.2 Timing Check Rationale

Timing checks usually are the best ways of enforcing device hardware operatingspecification limits during simulation. In terms of the functionality of the designbeing simulated, all timing checks are based on a reference event and a data event.These events may be scheduled on one, or on more than one, variable in the simula-tion. There is no connection here between usage of the word data, and “data” in thedesign as contrasted with “control” or “clock”. The timing check imposes certainconditions on these two simulation events; and, so long as the conditions (equiva-lent to a logical expression) are true, the timing check does nothing. Generally, thereference event is conceived of as fixed in time, and the data event is conceived ofas varying within or beyond a violation limit.

Two related technical terms are timestamp and timecheck. In the timing check,whenever the first one of the events is scheduled in simulation time (it may be eitherthe reference or the data event), a timestamp is recorded. If and when the other eventis executed, a timecheck is done to check the conditions on the two times.

The time limit in these checks often defines an open interval in simulation timethe endpoints of which do not trigger a violation. A time limit of 0 provides a con-venient way to disable any timing check, with the sole exception of the skew-relatedchecks.

By default, a timing check triggers only once per timestamp event, even whenmore than one violating timecheck event has been simulated. Some of the timingchecks can be enabled optionally to print multiple messages for repeated violations,for example on different bits of a timechecked bus.

17.1 Timing Checks and Pulse Controls 297

The limits controlling timing checks must be constants and usually would be de-fined by specparams. The design variables may be vectors; if so, any bit change(s)in the vector mean(s) the same as that change in a one-bit variable; so, only one vi-olation can be triggered (by default) for each such vector change.

Parameters and other inputs may be passed to the timing check by position, only.

17.1.3 The Twelve Verilog Timing Checks

Here is a complete list, grouped according to expected applicability; however, thesechecks work the same way regardless of the design functionality of the variable(s)checked:

Clock-ClockChecks

Clock-DataChecks

Clock-ControlChecks

Data Checks

$skew $setup $recovery $width$timeskew $hold $removal $period$fullskew $setuphold� $recrem� $nochange

� Avoid these, if possible.

Each of these timing checks is detailed below. Only the required inputs are listedbelow, except for $width, which includes an optional glitch parameter; with thatone exception, all optional inputs such as notifiers follow the listed ones. Thenotifier is explained below, but the reader should refer to the IEEE 1364 Std forfull documentation of features available in timing checks, such as edge specifiers orremain-active flags.

In QuestaSim, timing checks are treated as assertions, and verilog warning asser-tions must be enabled explicitly for timing checks to be run.

17.1.3.1 Clock-Clock Checks

$skew. This check triggers on an excessively long delay between events on twovariables. The delay value must be nonnegative. Typically, the reference (timestamp)event is a clock, and so is the data (timecheck) event. If the data event never occurs,there is no timing violation.

For example,

ref. event data event limit expression$skew(posedge Clock, posedge GatedClock, MaxDly);

$timeskew. This differs from $skew in that, by default, if the specified timelapses, a violation occurs whether or not there ever is a data event.$fullskew. This check is the same as $timeskew, except that it allows twononnegative delays. The first delay specifies a limit when the data event follows thereference event; the second, when the reference event is second.


For example,

ref. event R data event D R-D limit D-R limit$fullskew(posedge Clock, posedge GatedClock, MaxRD, MaxDR);

17.1.3.2 Clock-Data Checks

$setup. This check triggers a violation when the reference event, usually a clock,occurs too soon after the data event. The time limit must be nonnegative. The trigger-time window begins with the data event, which is the timestamp event for this check.

For example,

data event ref. event time limit$setup( D, posedge Clk, MinD Clk );

$hold. This check triggers a violation when the data event occurs too soon afterthe reference event, which latter usually is a clock. The time limit must be nonneg-ative. The trigger-time window begins with the reference event, which thus is thetimestamp event for this check.

For example,

ref. event data event time limit$hold( posedge Clk, D, MinClk D );

$setuphold. This check requires two time limits, either of which may be neg-ative; otherwise, the syntax follows that of $hold, with the setup limit first. Itcombines the functionality of a $setup and a $hold check when both limits arenonnegative. Because of error-prone complexity and thus likely false alarms, routineuse of this timing check is not recommended. See explanation below.

17.1.3.3 Clock-Asynchronous Control Checks

Conceptually, these checks depend on internal delay characteristics of componentsto which several signals are applied; typically a clock train and an asynchronouscontrol signal are assumed. The timestamp event is the reference or data event,whichever occurs first in the simulation. Refer to the waves in Fig. 17.1.

$recovery. Recovery refers to recovery time from deassertion of an asyn-chronous control such as a set or clear, and effective occurrence of a clock edge.Given that the control has been deasserted, how much time must be provided torecover clock functionality? The time limit must be nonnegative.

For example,

ref. event data event limit$recovery(negedge Clr, posedge Clk, MinClr Clk);

$removal. Removal refers to time of occurrence of an effective clock edge anddeassertion of an asynchronous control. Given that a clock edge has occurred, how


long before that edge does an asynchronous control have to have removed itself topermit the clock to be effective? The time limit must be nonnegative.

For example,

ref. event data event limit$removal(negedge Clr, posedge Clk, MinClk Clr);

Note: $removal doesn’t seem to work in the Silos demo version.

Fig. 17.1 Recovery and removal time limits. Clear is asserted high; violating edges are shownwith ‘x’; passing edges are shown with arrows

$recrem. This check permits two time limits, recovery limit first and then re-moval. When both limits are positive, it has the same effect as a $recovery and a$removal check, each with its respective limit. It allows for negative time limits.Because of error-prone complexity, routine use of this timing check is not recom-mended. See the discussion of negative limits. below.

17.1.3.4 Data Checks

$width. This check verifies a minimum width of a pulse on a single variable.Whichever edge is provided is used as the timestamp event, and a timing violationis triggered unless enough time has lapsed before the opposite edge occurs. Thewidth value must be nonnegative. A second timing parameter is optional, the glitchthreshold, which suppresses the timing violation if the timestamped pulse is foundto be narrower than specified by the glitch threshold.

For example,

ref. edge width glitch thresh.$width(posedge Reset, MinWid, MinWid/10);

$period. This check triggers a timing violation if the same edge recurs on avariable within too short a time. The time value must be nonnegative.

For example,

ref. edge period$period(posedge Clk, MinCycle);

$nochange.This check requires an edge on the reference event (= timestamp) todefine a level during which no data event (= timecheck) should occur. If a data eventoccurs during the reference level, a timing violation is triggered. Two offset times


also are required; the first shifts the violation level start event (timestamp edge)and the second shifts the violation level end event (timecheck). The offsets may benegative; a positive value increases the duration of the violation window; a negativevalue decreases it.

Examples:

// Require DBus constant during entire positive phase:ref. edge data event lead shift lag shift

$nochange(posedge Clk, DBus, 0, 0);

// Violation starts 1 before negedge; ends 2 after posedge:specparam MinSetup = 1, MinHold = 2;$nochange(negedge Clk, EBus, MinSetup, MinHold);

// Shift the check 1 unit later:specparam MinSetup = -1, MinHold = 1;$nochange(negedge Clk, FBus, MinSetup, MinHold);

Note: $nochange apparently is not implemented in any simulator with whichthe author is familiar.

17.1.4 Negative Time Limits

We have recommended against any use of the two timing checks, setuphold andrecrem, which permit specification of negative limits. This is because of complexity:A check on design timing should not be more complicated than the design; because,if it were so, an error in the check might cause time lost over false violations or evenperhaps an overlooked malfunction in the design.

However, when including a block of IP accompanied by a verilog model whichlacks adequate internal timing checks, it may be necessary for the designer to addtiming checks on variables at the boundary of the IP block. The internal variablesmay not be accessible. Under these circumstances, a skewed or even negative timelimit may be necessary for the check.

To see why this might be so, as an example, compare the timing in regard to thereference event for simple setup and hold on an internal sequential element such asis shown in Fig. 17.2.

Fig. 17.2 An IP block withan inaccessible flip-flop.Delays within the block areshown as delta-delays. Anaccessible timing-check dataevent occurs on the blockboundary at time tData;a reference event (clock) attClk


There are three different conditions possible: (a) No differential delay throughthe IP logic up to the flip-flop; (b) additional delay on the flip-flop clock, perhapsbecause of a buffer tree; and, (c) additional delay on the flip-flop data, perhaps be-cause of combinational processing.

The effects for a mild skew are shown in Fig. 17.3.

Fig. 17.3 Rationale for skewed time limits. Shaded regions represent fixed-width requirements ofthe sequential component. If the additional delays are equal, normal setup and hold timing checkshave times equal to the limits. When the data are delayed more than the reference, the setup limittime must be increased and hold must be decreased; when reference is delayed more than data, thehold limit time must be increased and setup decreased

When the additional IP delay exceeds a setup or hold limit, one of the times goesthrough zero and becomes negative. This is shown in Fig. 17.4.

Fig. 17.4 Rationale for negative time limits. When the data or reference is delayed more than thesetup or hold time limit for the isolated sequential component, the time at the IP boundary goesthrough zero and becomes negative

The present author argues against using $setuphold or $recrem, eventhough they permit direct entry of negative limits: When the internal delay differ-ences are known, as they have to be to use $setuphold or $recrem negative val-ues, the delay difference simply should be cancelled by delaying the timing-checkdata or reference event on a temporary net to cancel the difference. For example, if∆tData >> ∆tClock, then, delay the clock to the timing check by the difference, thuscancelling it, and use the databook requirement directly in the timing check:

wire ClockToHold;assign #tCancel ClockToHold = Clock; // tCancel from IP vendor or experiment.specify$hold(posedge ClockToHold, DataIn, tHold); // tHold from data book....


The tailored, delayed net (in the example, ClockToHold) will not be usedelsewhere in the design and will be removed by the logic synthesizer (along withnotifier reg’s, if any – see below). A delayed net may be used with the recom-mended, nonegative value, timing checks even where it does not cancel the entireIP delay – just so long as the cancellation is enough to avoid the need for negativelimits.

17.1.5 Timing Check Conditioned Events

Certain logical conditions are allowed with the variables named in a timing check.These are applied by the operator &&&, which logically is the same as && in othercontexts. The event must be “true” by &&& if the check is to be run. The operators,==, !=, ===,!==, or ∼ are allowed in the expression applied by &&&.

In a timing check, the conditioning expression RHS must be a scalar (1-bit) vari-able expression. A vector is allowed, but only the LSB value will be expressed.

For example, here we don’t want a violation if Ena is low while D changes:

$setup( D&&&(Ena==1’b1), posedge Clk, MinD Clk );

17.1.6 Timing Check Notifiers

It is possible to make simulator behavior depend upon a timing check. Perhaps, onewould want the simulation to stop on a violation; or, maybe, a detailed messagemight be issued based on an assertion.

All timing checks allow at least one optional input in addition to those shownabove, and the first optional input not given above always is the name of a notifierreg. This is a one-bit reg type declared visible to the timing check specifyblock, which is to say, in the module containing that specify block. This reg,of course, becomes part of the simulation model of the design; thus, anything pos-sible on change of a reg value can be initiated by a timing violation through anotifier. Typically, a simulator system task such as $stop would be the onlyaction a designer would want; or, perhaps an assertion failure might be programmedon notifier change.

When a timing violation is triggered by a timing check, the notifier value istoggled between ‘1’ and ‘0’. If the notifier was ‘x’ when the violation occurred, itis toggled to ‘0’; if it was ‘z’, it remains ‘z’. This last feature provides a way todisable notification without disabling the timing check itself.

The reg passed to the timing check must have been declared, but it need not beused anywhere else in the design.


Example:

reg Notify;...always@(Notify) $stop;...specify...$setup( D, posedge Clk, MinD Clk, Notify );endspecify

17.1.7 Pulse Filtering

Timing checks do nothing to alter the simulation, unless perhaps by the optionalnotifier feature. However, pulse filtering is an essential part of the simulatorbehavior.

In normal, default inertial delay, pulses shorter than a gate delay are removedfrom the schedule of events on the gate input. In verilog, this default pulse filteringis viewed as occurring because two different time limit settings, the error limit andthe rejection limit, happen, by default, both to be equal to the gate delay.

The verilog error limit always is greater than or equal to the rejection limit. If therejection limit is different from (= less than) the error limit, three possible thingsmay happen to an input pulse with an effect delayed on an output:

(a) the pulse is wider than the error limit: It is delayed but otherwise is left un-changed;

(b) the pulse width is between the error limit and rejection limit: It is delayed andnarrowed to a width equal to the rejection limit; or,

(c) the pulse width is below the rejection limit: It is cancelled and never affects theoutput.

These limits may be set by simulator invocation options or in an SDF file; but,here, we ignore this and present only the specify block special specparam,PATHPULSE.

PATHPULSE is the only reserved word in verilog not in lower case. In a pecu-liar twist of syntax, it is prepended to the two variable names involved, which mustname a timing path, with ‘$’ delimiters. It changes the type of the timing path in-volved. The result is used as the name of a specparam. For example, to applyPATHPULSE to the path between ports Ain and Bout, one writes,

specparam PATHPULSE$Ain$Bout = (r limit, e limit);

in which r limit is the rejection limit value and e limit is the error limitvalue. An example is shown in Fig. 17.5. Notice that PATHPULSE, like otherspecparams, may be assigned more than one numerical value. Specifying the er-ror limit is optional; assignment of one value in the line of code above would assign


the rejection limit value only, the error limit value being taken to be equal to thegate delay involved. The names in the path must be declared names and may not bebit-selects or part-selects.

Fig. 17.5 Effect of PATHPULSE limits on pulse filtering. Assume PATHPULSE$ = (2, 3).With no PATHPULSE specification, all pulses would be rejected by simple inertial delay becauseof the gate delay of 5

It is possible to omit a path and write generically, “PATHPULSE$= (r limit,e limit);” or “PATHPULSE$ = r limit;”. This assigns the limit(s) to everypath in the entire module. When both this and a PATHPULSE naming a path arepresent, the one naming the path overrides the generic one on that path.

Example:

module NewInertia #(parameter r limit = 3, e limit = 4:5:6)(output Z, input A, B, C);

...specify... (delays; timing checks) ...specparam PATHPULSE$ = r limit ; // Module default.specparam PATHPULSE$B$Z = ( r limit, e limit );

endspecifyendmodule

A specify block may contain full path delay statements, omitted from theexample code above, which name multiple paths; when this is so, the “wildcarded”PATHPULSE$ is applied only to the first path (first input to first output) in any suchdelay statement. Attempting to apply PATHPULSE selectively to any other path insuch a delay statement, other than the first one, has no effect. Thus, PATHPULSE ismost easily used for individual paths when those paths are one bit wide and whenthe timing specifications are parallel-path rather than full-path.

Note: PATHPULSE does not seem to work in Silos or QuestaSim, and it produces‘x’ outputs instead of no change in VCS, when pulses are narrower than the rejectionlimit and the +pathpulse invocation option in force. Under these conditions, theVCS ‘x’ outputs are accompanied by warning messages.


17.1.8 Improved Pessimism

Recall the discussion of delay pessimism in Week 7 Class 2. Pessimism improveschances of success, but it can be wasteful.

In addition to PATHPULSE, there are four other relevant reserved specparamtypes: pulsestyle onevent, pulsestyle ondetect, showcancelled;and, the negation, noshowcancelled.

Unlike PATHPULSE, these are used by declaring lists of output variables as-signed to them which are to be simulated with improved pessimism, which is to say,with increased use of ‘x’ scheduling. They must be declared in the specify blockbefore any path which is assigned a delay including one of the variables listed.

pulsestyle onevent. This is the default behavior: When a contentionyielding an ‘x’ exists, the contending events are scheduled normally; and, at thesimulation time at which the contention first occurs, an ‘x’ is scheduled.

pulsestyle ondetect. This is more pessimistic: As soon as the simulatorhas computed the contention, an ‘x’ is scheduled. This shifts the leading edge ofthe ‘x’ level to some time earlier than when it would have been had the default hadbeen in force. Essentially, the output of a gate goes to ‘x’ as soon as an input eventarrives which would cause that ‘x’. The on-detect edge of the ‘x’ tells the designerhow early a possibly uncontrolled state has occurred in the simulated design. In a bigdesign, triggering a $stop assertion on detection, rather than on event occurrence,can save considerable debugging time.

showcancelled. Sometimes, different rise and fall delays may cause the lead-ing edge of an ‘x’ event to occur at the same time as the lagging edge, or even be-fore, creating a zero or negative duration of the ‘x’ level. Such events are cancelledsilently by the simulator, by default. Assigning an output pin to this specparammeans that the simulator will schedule a zero or negative-width output pulse of ‘x’at the original leading-edge output time; this pulse will be of the input-pulse width.If, in addition, this pin has been assigned to pulsestyle ondetect, the outputleading edge of the ‘x’ will be advanced to the detection time, widening the output‘x’ pulse.

For debugging purposes, it is possible to select variables in a showcancelledlist to be restored to default behavior by assigning them to the noshowcancelledspecparam.

Note: None of these pessimism improvements seems to work in Silos, Ques-taSim, or VCS. However, these simulators have invocation or interactive optionsproviding similar functionality.

17.1.9 Miscellaneous time-Related Types

The verilog standard includes data types intended to be used in specify blocks.In addition to specparams, which are commonly used, these are time andrealtime types.

A time is an unsigned reg type of predetermined width which is guaranteedto be at least 64 bits. It is meant to be used in testbenches or in conjunction with


long-duration timing checks. Assigning a time reg to a wide wire type yields avector value which may be referenced in a specify block.

A realtime reg is identical to a real in all respects except name. It is rem-iniscent of the tri net, which is identical to a wire except in name.

Both time and realtime may be used anywhere in a module where areg type is allowed. However, use of these types in design should be consideredcarefully. They add nothing to functionality and make the syntax a little morecomplicated. Instead of invoking a type with time-related associations, it is bet-ter to name the declared variables so that every use of their identifier recalls thatthey are time-related. Also, these types, having no unique functionality, may not beimplemented in all design tools.

17.2 Timing Check Lab 21

Work in the Lab21 directory. A subdirectory named PLLsync has been preparedfor you there; it contains the entire PLLsync design from Lab10 (Week 4 Class 1).

Lab Procedure

Be sure that your simulator is enabled to process PATHPULSE assignments andtiming checks.

Recall that the PLLsync design has the following component blocks: At the toplevel, PLLsync and Counter4; both in .v files named for the modules. There is atestbench in PLLsync.v namedPLLsyncTst. In a subdirectory named PLL, thePLL resides in five files: An include file and four others, the latter ones containingmodules named PLLTop, ClockComparator, MultiCounter, and VFO.

The VFO in this model will exhibit its delay-adjusting functionality even with aconstant clock frequency; we shall use this to demonstrate timing checks.

Step 1. Preliminary simulation. Change to the PLLsync subdirectory. Usingthe testbench provided, run the PLLsync simulation for a very long time, say25,000 ns. Notice that with a ‘VFO MaxDelta of 2 (in the PLL include file), theVFO period oscillates within a 4 ns range. If you display the VFO Delay variablevalue in VFO, you will see it switch in increments of 1 ns between 14 and 18 ns (thevalue should average 15.625 ns for a PLL clock with frequency of 32 MHz). Thismeans that the counter counts with a delay of between 28 and 36 ns.

Step 2. Setup check. Suppose we wish to use a positive edge on the Samplecommand to the VFO to sample the count value in the counter. We want to be sureto allow a setup time of 5 ns between these signals. Our recently renamed module,PLLsync, originates the Sample command as SyncPLL, and it also receives thecounter count output as Behavioral. So: Install a $setup check in PLLsyncwhich issues a violation at 5 ns. Make the limit depend on a specparam value.Simulate and watch the simulator console output window. Set the limit to 0 ns todisable this check.

17.2 Timing Check Lab 21 307

Step 3. Hold check. Add a hold check, requiring 8 ns, to the Step 2 timing prob-lem. Simulate. After seeing the violations, set the limit to 0.

Step 4. Recovery and removal. Modify your testbench to apply two successiveClearIn pulses to PLLsync; they should be separated by 50 ns, and each shouldlast 500 ns. The separation should be centered approximately on a clock edge arounda simulation time of 1000 ns or so.

Add a recovery check of 100 ns for ClockIn recovering from ClearIn on thefalling edge of ClearIn; add a removal check of 100 ns on these edges. These val-ues should trigger both timing violations. Simulate. Then, shorten the check timesso no recovery or removal violation is reported.

Step 5. Width and period checks. In PLLsync, add a width check to issue a vio-lation when ClearIn is low for less than 100 ns. Simulate to see the result; then,disable this check by setting the time to 0. Add a period check to be sure that thePLL clock output period is not less than 30 ns, based on a positive edge. Disable thischeck after viewing the result.

That’s all for the PLLsync design for now. For the rest of this lab, copy yourDFFC.v model (D flip-flop with clear) from the Lab08 exercises (Week 3Day 2) to the Lab21 directory.

Do the next Steps in order; they depend on one another:

Step 6. Pulse filtering. What happens when the input changes too rapidly?To answer this, first change the DFFC model:Remove the timescale setting from the DFFC file (there will be one in the test-

bench). Change the old continuous assignment delays of #1 on Q and Qn to negli-gible but nonzero ones of #0.001 (1 ps).

Add a specify block after the module declarations, setting delays as follows.Use mnemonically-named specparams:

Path delay from clock posedge to Q= 1000 ps rise and 800 ps fall; to Qn 1100 psrise and 850 ns fall.

Full delay from clear to Q or Qn = 700 ps rise and 900 ps fall.Instantiate the DFFC as a toggle flip-flop in the testbench (DFFC Tst) which has

been provided for you in the Lab21 directory: This testbench gradually decreasesthe clock period from 10 ns to 10 ps. It also provides a regular clear of 50 ns periodand 50% duty cycle. It sets ‘timescale 1ns/1ps to resolve events on a 1 psgranularity.

To make a toggle flip-flop, wire the Qn to the D input.A. Visual check. Simulate (the float calculations will make this take a noticeable

time), and examine the waveform. What is the clock frequency at which your D


flip-flop functionality becomes unreliable? Hint: Q and Qn both must work. Howdoes this relate to the delays in the model? See Fig. 17.6 and 17.7 for one result.

Fig. 17.6 Overview of the DFFC simulation with slowly increasing clock frequency

Fig. 17.7 Close-up of the first timing failure in this DFFC simulation

B. Reliability enforcement. Assume we want an error limit of 1100 ps and a re-jection limit of 500 ps for any pulse on any path to Q or Qn in our DFFC model. Addone or more PATHPULSE specparams to the DFFC specify block to accom-plish this. Simulate to see the result; then, comment out the PATHPULSE(s).

Step 7. Width. We would like to be sure that Q and Qn stay high for at least 1 nswhen they go high.

A. Add $width timing checks for this. Simulate.B. Change the minimum-width limit to 0.850 ns and simulate. This should pro-

duce a limited number of violations.C. Declare a new reg named Notify in DFFC and connect it to your $width

timing checks as the fourth calling parameter; assign a glitch reject of 0 as aplaceholder third calling parameter. Add the following always block below yourspecify block:

...$width(posedge Qn, twMinQQn, 0, Notify);endspecify//always@(Notify) $stop;

17.2 Timing Check Lab 21 309

Now run the simulation again. You can continue the simulation at your leisure,after each $width violation.

Change your minimum width specparam values to 0.500 ns to inhibit widthviolations, but leave in the width checks and Notify for now.

Step 8. Setup and hold. The pathological toggling at high clock frequency can berevealed several ways. One way is by adding setup and hold timing checks.

A. For the relationship between Clk and D, add a setup check for 1 ns and a holdcheck for 500 ps. Simulate (you may wish to stop early).

B. Connect the $setup and $hold notifiers to your Notify reg and simu-late again. Decrease your setup and hold specparam values until all setup andhold violations vanish. Obviously, if the clock is going down to 10 ps, then itwould be a good idea to start both limits here. What was the duration of the short-est one of each violation, and what was the earliest simulation time at which itoccurred?

C. In the testbench, change the timescale to 10ns/1ns. Run the simulation withsetup and hold at the lowest limits which previously were causing violations. Whathappens? Notice how quickly (in wall-clock time) the simulation finishes with sucha coarse time resolution.

Decrease the resolution limit to 10ns/100ps and then 10ns/10ps, simulat-ing each time. What is the effect of the resolution on timing violations and the waythey are reported?

Restore the timescale to 1ns/1ps, and disable the old $width check, as wellas $setup and $hold, before continuing.

Step 9. Skew check. Add a $skew check between clock and clear, so that a viola-tion will occur if clock ever goes high more than 49.99 ns after clear does. Simulate.Then, disable this check; to do this, you will have to comment out the check or setthe limit to 50 ns or more..

Step 10. Recovery and removal. These are perhaps the most easily misunder-stood timing checks. They are designed to check on the relation of an asynchronouscontrol, such as a set or clear, and a lower-priority synchronous control such as aclock.

A. Recovery check. Use $recovery to check when a posedge Clk hasnot been given at least 10 ps to recover after a negedge (deassertion) of Clr.Simulate. When does the first violation occur? Set the limit to 0 to disable thischeck.

B. Removal check. Use $removal to check that Clr has been removed(negedge) at least 10 ps before a subsequent posedge Clk. Simulate (seeFig. 17.8). Set the limit to disable the check when done.


Fig. 17.8 A $removal timing-check violation, close-up


Read Thomas and Moorby (2002) sections 8.1 and 8.4.4 for some insight into iner-tial delay.


Read the section 10.3 on timing checks.Do section 10.6, problems 6–8.


18.1 The Sequential Deserializer

Let’s review our serdes project and determine what remains to be done.We introduced our serdes project in Week 2 Class 2 and made significant progress

on the SerDes PLL in that Week’s PLL Clock Lab 6, including a serializationframe encoder in Step 8. We added serial frame synchronization in PLL Lock-In Lab10 (Week 4 Class 1). We also completed a FIFO in FIFO Lab 11; however, this FIFOcauses functionally incorrect synthesis because of unusual sensitivity lists and theresultant erroneous latch inference. Nevertheless, the FIFO does simulate correctly;and, because of this, we were able to refine the design of our deserialization decoder,DesDecoder, in Serial-Parallel Lab 16 (Week 7 Class 1).

In the present lab, we shall redesign the PLL for correct synthesis; we shallhold revision of the FIFO for later. We also shall assemble and simulate the entireDeserializer by the end of the present lab.

So far, Fig. 18.1 shows where we are in the overall design, mostly from a dataflow perspective.



Fig. 18.1 Data flow and clock distribution in the serdes project. The upper half is theSerializer; the lower half, the Deserializer. Overall design updated as in Lab 16.Hatched blocks were completed separately in previous labs. Neither the PLL nor the FIFO willsynthesize, but both simulate usably

We’ll first fix the PLL so it will synthesize correctly; then, we can complete theDeserializer for simulation and almost-correct synthesis. After the PLL, mostof the remaining tasks of today’s lab will be organizational.

18.2 PLL Redesign

The main problem with the current PLL is in the VFO: It depends on programmableverilog delays in an oscillator implemented by one, delayed nonblocking assignmentstatement. To get a predictable (but wrong) netlist out of this design, we added apreprocessor macro switch which forced the synthesizer to see a delayed blockingassignment instead of a delayed nonblocking one.

There is no really good way to design an analogue device such as a PLL in a dig-ital language such as verilog. We can not create a variable capacitor, something thatcharges gradually on each clock, to control a VCO (variable-capacitance oscillator)for our PLL. However, we can create a fast counter which adapts its count graduallyon each clock.

18.2 PLL Redesign 313

18.2.1 Improved VFO Clock Sampler

Recall that we used a Sample pulse input, originating external to the PLL, in our32x Lab06, Lab08, and Lab10 PLL designs to reduce the rate at which the VFOfrequency was adjusted. We did not use edge-averaging or any other technique forsmoothing or refining the ClockComparator’s output. The sampling pulse idea,shown in Fig. 18.2, was not essential, but it did prevent the VFO from changing itsfrequency because of every tiny clock misalignment.

Fig. 18.2 The old Lab06VFO comparator-samplingcommand

To make our PLL self-contained, we shall not any more use an external sam-pling pulse for the VFO: We can build into the PLL itself a sampling clock whichtriggers a Comparator-based adjustment, once every cycle of one of the available,approximately 1-MHz clocks.

Looking ahead to synthesis, to generate a properly set-up sampling edge, we mustallow our Comparator counters to settle at their current counts before each sam-pling. This set up may be provided easily by running the external input clock througha library delay cell (see Fig. 18.3) to retard the sampling edge for some reasonable,brief period. The counters then will be clocked by the other, VFO-generated, clock.

Fig. 18.3 The new VFOcomparator-samplingcommand

The verilog for the connection to such a library delay cell would be as follows:

module PLLTop (output ClockOut, input ClockIn, Reset);// ...Library DelayCell DelayU1 (.Z(SampleWire), .I(ClockIn));//// (dont touch DelayU1 synthesis directives)//VFO VFOU1 ( .ClockOut(MHz32), .AdjustFreq(AdjFreq)

, .Sample(SampleWire), .Reset(Reset) );


18.2.2 Synthesizable Variable-Frequency Oscillator

Attempting to synthesize the old, Lab06 PLL will produce nothing usable. As youmay recall, we used a verilog macro to prevent the synthesizer from seeing the de-layed nonblocking assignments and to substitute a nonoscillating delayed blockingassignment.

The netlist from the synthesized Lab06 VFO does have an oscillator, but it can’tmodify its frequency, and it will run at an unusably high frequency, depending onthe synthesis constraints. Typical constraints produce a nor gate oscillating at a con-stant 6 GHz. The delayed blocking assignment workaround in the old PLL VFOamounted to an erroneously inferred latch and was synthesized to logic correspond-ing to the schematic of Fig. 18.4.

Fig. 18.4 The Lab 6synthesized VFO netlist

For a redesign which will guarantee synthesis, we shall use the high speed com-ponents available in our synthesis target library to implement a very fast oscillatorbased on a chain of library delay cells. There will not be any verilog delay involved,except as characterized for the delay cells, plus the propagation delay of an invertinggate.

To control the fast-oscillator frequency and use it as the VFO, we can clock a fastcounter and use the counter overflow as the PLL output clock. By varying the countat counter overflow, we vary the VFO frequency. We require a lower frequency than6 GHz for a reliable fast counter.

Fig. 18.5 The new,synthesizable VFO internalFastClock

The frequency is controlled by a delay line, which can be created of any config-urable length by means of a verilog generate statement. Using several, calibratedlibrary delays of about 80 or 90 ps each will tend to reduce the delay variation infabrication of the required inverter. The oscillator output would be used internallyby the VFO and is named FastClock in the schematic of Fig. 18.5 (the nor gatereplaces an inverter, because it permits a clock phase Reset initialization).


The verilog for this oscillator follows:

reg FastClock;wire WireToDelay, WireFromDelay;assign WireToDelay = ∼FastClock; // oscillation here.// -----------------------------------------------------// The always block allows initialization:always@(WireFromDelay, Reset)if (Reset==1’b1)

FastClock <= 1’b0;else FastClock <= WireFromDelay;

// -----------------------------------------------------// The delays control the (fixed) fast oscillator speed:LibraryDelayCell Delay0( .Out(Wire1), .In(WireToDelay) );LibraryDelayCell Delay1( .Out(Wire2), .In(Wire1) );...LibraryDelayCell DelayN( .Out(WireFromDelay), .In(WireN) );//(synthesizer dont touch on all Delay� instances)

The delay line is shown unrolled in the code fragment above; it must be flaggedwith a synthesizer don’t-touch to prevent its removal during synthesizer optimization.Because the delay line is an integral part of the VFO, the don’t-touch is better doneby a comment in the verilog rather than by a command in the .sct file. Thisverilog is entirely synthesizable either as unrolled in the code shown above or asgenerated.

We can make a VFO based on the new oscillator by tapping the FastClockand using it to clock a fast, programmable counter, and using a selected count valueon VaryFreq to define the VFO clock edges; this is shown in Fig. 18.6.

Fig. 18.6 The new VFO is based on an internal fast oscillator (FastClock) and programmablecounter

The VFO internal comparator (not to be confused with the PLLClockComparator module) stores the latched value of a programmable limit


which is used to reset the VFO internal counters. A VaryFreq adjustment is al-lowed to increment or decrement the programmable limit. The reset also toggles aflip-flop which creates the actual VFO output clock for the PLL.

The verilog would be somewhat like this:

// Assume a VaryFreq vector declared which// sets the VFO frequency:reg[HiBit:0] Count;reg PLLClockOut;always@(posedge FastClock, posedge Reset)if (Reset==1’b1)

beginPLLClockOut <= 1’b0;Count <= ’b0;end

else beginif (Count>=VaryFreq) // Programmable limit.

beginPLLClockOut <= ∼PLLClockOut;Count <= ’b0;end

else Count <= Count + 1;end

For our 32 × PLL, we need a library fast enough to oscillate and count up to somesmall integer value with enough precision to vary the frequency of PLLClockOutreasonably around 32 MHz. If we assume a 16 ns half-period and 1-ns precision, weneed a counter which can count to about 20 in 20 ns, which implies a 5-bit counterclocked at around 1 GHz. This speed is attainable easily in 130-nm or lower librarytechnology.

We shall look at a synthesizable 1 × PLL later.

18.2.3 Synthesizable Frequency Comparator

We also require a comparator which does not have a strange, latching sensitivitylist. The current 32 × PLL is clocked by the input, approximately 1 MHz clockand samples the VFO (PLL output) clock independently in a change-sensitive blocksomewhat like this:


// OLD, unsynthesizable VFO:always@(ClockIn, Reset) // The input system clock.

if (Reset==1’b1)... (stuff) ...

else if (CounterClock==1’b1) // The PLL MultiCounter output clock.VarClockCount = VarClockCount + 2’b01;

else begincase (VarClockCount) // The comparator object.

2’b00: AdjustFreq = 2’b11;2’b01: AdjustFreq = 2’b01;

default: AdjustFreq = 2’b00;endcaseVarClockCount = 2’b00;end

Although this defined a PLL which simulated correctly, it is a classical case oflatch inference which the synthesizer can not interpret meaningfully.

We can avoid latch inference entirely by using edge sensitivity. A reasonable im-provement of the above, then, might be to write verilog for two very small counters,say, 2 bits each, one clocked by the external PLL 1 MHz input clock and the otherby the PLL internal approximately 1 MHz counter overflow clock; the counts couldbe compared on the edge of either clock to estimate the relative speed of the clocks.We could issue an adjustment of the VFO frequency only on a difference in the twocounts.

By delaying the external ClockIn as shown in Fig. 18.7, we can trigger a com-pare on the PLL internal clock edge, thus avoiding a clock-race condition and almostalways ensuring proper set up on every compare of the two, approximately 1 MHzclocks. Note that this delay is different functionally from the one used for the PLLinternal sample command.

Fig. 18.7 The synthesizable, edge-sensitive ClockComparator. Dotted bundaries indicatefunctional blocks within the one module. The external Reset distributed to all sequential logic isomitted for simplicity


Two 2-bit counters would count continuously the positive edges of the 32 × PLLMultiCounter output clock (about 1 MHz) and the input 1 MHz clock whichis used as a stimulus for the PLL. On every positive edge of the PLL clock, acomparison is made of the relative edge-counts of the two counters, generating acontinuously-updated adjustment decision to be used by the VFO. The Zeroermonitors the ClockIn count and reinitializes both edge counts simultaneously,every time it receives a count of 3.

The VFO adjustment decision logic may be implemented with nested casestatements located in an always block clocked by the PLL MultiCounter over-flow clock:

case (ClockIn EdgeCount) // The external ClockIn edges.2’b00: case ( VFO EdgeCount) // The MultiCounter overflow edges.

2’b00: AdjustFreq = 2’b01; // No change.default: AdjustFreq = 2’b00; // Slow the counter.endcase

2’b01: case (VFO EdgeCount)2’b00: AdjustFreq = 2’b11; // Speed up the counter.2’b01: AdjustFreq = 2’b01; // No change.

default: AdjustFreq = 2’b00; // Slow the counter.endcase

2’b10: case (VFO EdgeCount)2’b10: AdjustFreq = 2’b10; // No change.2’b11: AdjustFreq = 2’b00; // Slow the counter.

default: AdjustFreq = 2’b11; // Speed up the counter.endcase

default: case (VFO EdgeCount) // Includes 2’b11 for initialization:2’b11: AdjustFreq = 2’b10; // No change.

default: AdjustFreq = 2’b11; // Speed up the counter.endcase

endcase

18.2.4 Modifications for a 400 MHz 1 ××× PLL

All the above verilog is meant for the 1 MHz, 32:1 PLL of our serdes project. But,what about the 400 MHz, unsynthesizable 1 × PLL we discussed in Week 2, Class2 before Lab06?

The special concern is with the higher PLL output frequency. The internal coun-ters described above will work unchanged, except that the VFO FastClock mustbe speeded up by limiting its delay chain to fewer delay elements. Also, the fastestavailable 90-nm library inverter should be instantiated structurally as part of theVFO oscillator.

In verilog, with ‘NumElems set to 3 instead of 5, the new FastClock gener-ator becomes this:


reg FastClock;wire[‘NumElems:0] WireD;//generate

genvar i;for (i=0; i<‘NumElems; i = i+1)

begin : DelayLineDEL005 Delay85ps ( .Z(WireD[i+1]), .I(WireD[i]) );end

endgenerate//always@(Reset, WireD)

begin : FastClockGenif (Reset==1’b1)

FastClock = 1’b0;else // The free-running clock gets the output of the delay line:

FastClock = WireD[‘NumElems];end

// The instantiated inverter:INVD0 Inv75ps ( .ZN(WireD[0]), .I(FastClock) );

Our TSMC library DEL005 component provides a delay of about 85 ps. Therequired don’t-touch directives are omitted for brevity.

With the sampling setup delay element in the containing PLL module, the syn-thesizable VFO adjustment is as above:

always@(posedge ClockIn, posedge Reset) // ClockIn delayed for setup.begin : FreqAdjif (Reset==1’b1)

DivideFactor <= ‘InitialCount;else begin

case (AdjustFreq)2’b00: // Adjust f down (delay up):

if (DivideFactor < DivideHiLim)DivideFactor <= DivideFactor + ‘VFO Delta;

2’b11: // Adjust f up (delay down):if (DivideFactor > DivideLoLim)

DivideFactor <= DivideFactor - ‘VFO Delta;endcase // Default: leave DivideFactor alone.end

end // FreqAdj.

The ‘InitialCount is set to half of the maximum count in the FastClock-creating programmable counter.

In the faster 1 × design, the Comparator decision logic can to be biased to bemore sensitive to real changes:


always@(posedge PLLClock, posedge Reset) // The delayed PLL ClockIn.begin : EdgeComparatorif (Reset==1’b1) AdjustFreq = 2’b01;elsecase (ClockInN) // Count from the PLL external input clock.

2’b00: begincase (PLLClockN) // Count from the VFO output clock.

2’b00: AdjustFreq = 2’b01; // No change.2’b01: AdjustFreq = 2’b00; // Slow the counter.2’b10: AdjustFreq = 2’b00; // Slow the counter.2’b11: AdjustFreq = 2’b00; // Slow the counter.

default: AdjustFreq = 2’b01; // No change.endcaseend

2’b01: begincase (PLLClockN)

2’b00: AdjustFreq = 2’b11; // Speed up the counter.2’b01: AdjustFreq = 2’b01; // No change.2’b10: AdjustFreq = 2’b00; // Slow the counter.2’b11: AdjustFreq = 2’b00; // Slow the counter.



2’b00: AdjustFreq = 2’b11; // Speed up the counter.2’b01: AdjustFreq = 2’b11; // Speed up the counter.2’b10: AdjustFreq = 2’b10; // No change.2’b11: AdjustFreq = 2’b00; // Slow the counter.



2’b00: AdjustFreq = 2’b11; // Speed up the counter.2’b01: AdjustFreq = 2’b11; // Speed up the counter.2’b10: AdjustFreq = 2’b11; // Speed up the counter.2’b11: AdjustFreq = 2’b10; // No change.


default: AdjustFreq = 2’b10; // No change; allows initialization.endcaseend

In your Lab22 Ans directory, there is a subdirectory named PLL 1x Democontaining the synthesizable version of the optimized 400 MHz, 1 × PLL justdescribed. This model does not, however do any decision averaging, and so itcan not resolve delay times below that of one of its oscillator delay cells, (about80 ps). This means that it never can lock in to a drifting clock, but it can comeclose.

Some 400 MHz waveform results are given in Fig. 18.8–18.11.


Fig. 18.8 The source PLL 1× verilog model: Entire 20 us simulation

Fig. 18.9 The source PLL 1× verilog model: Lock-in detail near end of 20 us simulation

Fig. 18.10 The synthesized PLL 1× verilog netlist: Entire 20 us simulation

Fig. 18.11 The synthesized PLL 1× verilog netlist: Lock-in detail near end of 20 us simulation

18.2.5 Wrapper Modules for Portability

As a small digression, before starting this lab, it should be mentioned that mod-ule I/O names will be very important to keep straight for the rest of our project. Inyour work as a designer, it may be necessary to adapt I/O names arbitrarily. If it is


complicated or inconvenient to rename your own I/Os to match the required ones,a simple workaround is to instantiate your module in a ‘wrapper’ module with thecorrect I/O names. Keep the file name the same, but add the wrapper module dec-laration above the functional one. Just connect all ports together, and the wrappernames will be the only ones visible in the rest of the design.

For example, suppose you have implemented and tested a module declaredthis way:

module MyModule(output[31:0] OutBus, ..., input ClockIn);

But, suppose the project required that this output port be named “DataBus” andthe clock be named “Clock”. Just rename your module slightly and put a wrapperin your MyModule.v file. This is shown next.

// ---------------------------------------------------// This wrapper renames the MyModule ports as required:module MyModule (output[31:0] DataBus

, ..., input Clock);

MyModule WrapperU1 ( .OutBus(DataBus), ... (1-to-1 wiring) ..., .ClockIn(Clock) );

endmodule//// ------------------------------------// Begin original MyModule design (notice the underscore in the name)://module MyModule (output[31:0] OutBus, ..., input ClockIn);... (valuable, tested functionality) ...

endmodule

In a large design, a wrapper is one of the few exceptions to the rule of declaringno more than one module in any one verilog source file.

18.3 Sequential Deserializer I Lab 22


Lab Procedure

Step 1. Reorganize the Lab21 version of the PLL.Your Lab22 directory contains an answer subdirectory, and a symlink to a file

containing verilog models for the technology library we are using in this course.Change to your Lab22 directory; from the old Lab21/Lab21 Ans directory

provided, copy the PLLsync subdirectory and all its contents (cp -pr) to yournew Lab22 directory. The instructions in the present lab are somewhat complicated,so it will be best to use the answers provided rather than your own previous work.

18.3 Sequential Deserializer I Lab 22 323

The Lab21 file organization was as shown in Fig. 18.12.

Fig. 18.12 Old Lab21 PLLfile layout

In the copied files, move everything in the new Lab22/PLLsync subdirectoryup one level, into Lab22. In Lab22, delete Counter4.v, which won’t be usedany more, and remove the now-empty PLLsync subdirectory.

Rename PLLsync.v to PLLTopTst.v; rename PLLsync.vcs to PLLTopTst.vcs; also, move PLLTop.inc up to Lab22.

The Lab22 reorganized files should be as shown in Fig. 18.13.

Fig. 18.13 Lab22 PLLreorganized layout

Looking ahead, the design files in our Lab22/PLL directory will become thosein the PLL subdirectory of our deserializer design; so, “Lab22” mostly will become“Deserializer”. For this reason, rename PLLTop.inc to Deserial.inc,and change the reference in PLL/VFO.v from PLLTop.inc to Deserial.inc.For simulation in VCS, don’t put a path in VFO.v, because we plan to run VCSand DC from the Lab22 directory, so Deserial.inc will be in current contextwhenever we compile VFO.v.In Lab22, edit PLLTopTst.v as follows:

• Delete all "Step∗" defines, and change the ‘include to,

‘include "Deserial.inc"

• Delete the old PLLsync module, but keep its Sample-pulse generatingalways block, which should be moved into the testbench module.

• Rename the testbench module to PLLTopTst.• The device under test in the testbench now should be changed to an instance of

PLLTop, and it should have an output pin named .ClockOut; the 1-bit wiremapped to .ClockOut should be named PLLClockWatch.


• The PLLTop instance should have a reset pin named .Reset and a .Samplepin driven by the Sample-pulse generating always block now located in thetestbench.

Edit the PLLTopTst.vcs file so that it contains just the file namePLLTopTst.v and the (path)names of the verilog files in the PLL subdirectory.

You now should have a complete copy of the PLL design in your Lab22/PLLdirectory, with a PLL testbench in a separate file in Lab22. The includes for thePLL should be located in Lab22/Deserial.inc.

Verify briefly that you can simulate the entire PLL by invoking and running thesimulator of your choice in the Lab22 directory.

Step 2. Verify that the PLL is unsynthesizable to a correct netlist.Change to your PLL subdirectory. Copy one of your .sct files from an ear-

lier lab, say Lab15, into the Lab22/PLL directory. Modify the copy so that youcan use it to synthesize the PLL. The PLL design at this point should consist ofPLLTop.v, MultiCounter.v, ClockComparator.v, and VFO.v. Renameyour .sct file to PLLTop.sct and edit it so it saves its synthesized verilog netlistinto a file named PLLTopNetlist.v. Synthesize. The synthesis run should suc-ceed, but, of course, the netlist can not be simulated correctly.

Change back to your Lab22 directory to verify that the netlist does not simulatecorrectly. To do this, make a new simulator file named PLLTopNetlist.vcs inthe Lab22 directory with these contents:

PLLTopTst.vPLL/PLLTopNetlist.v-v verilog library 2001.v

in which verilog library 2001.v is a symbolic link to the library vendor’s verilogmodels (you may have to copy it from the CD-ROM misc directory); it has beenmodified for these labs to have nominally realistic delays. The -v option causes thefile to be processed as a library to resolve design netlist references; this means thatthe simulator will compile only those library modules used in the design. VCS willbe able to compile the resulting netlist; but, as we know, the synthesized VFO netlistcan’t adapt, because it will include a useless VFO more or less as in Fig. 18.14.

Fig. 18.14 Incorrect synthesisof the copied VFO withverilog delayed blockingassignment. The incompletesensitivity lists causedAdj Freq and SampleCmdsimply to remain unconnected


Step 3. Rewrite the PLL so it becomes synthesizable.Use the previous discussion in this chapter as a guide. You should rewrite sub-

stantially the ClockComparator, as described, and the VFO.The Sample input port to the PLL should be removed, and a Sample pulse

should be derived in PLLTop by passing ClockIn through a manually-instantiateddelay cell and thence directly to the VFO. You should choose a verilog library delaycell simulation model in the verilog technology library file. Use an embedded syn-thesis script with don’t-touch directives to preserve the delay cell instance(s), andtheir connections, from optimization.

The AdjustFreq code from ClockComparator to VFO should be modifiedso that 2’b00 requests a slowing down, 2’b11 a speedup, and other values nochange; this will add some inertia to the new VFO’s frequency adjustments.

Some things to be alert for are mixed edge and change sensitivity lists, and taskcalls which have to be implemented with change sensitivity, only. The synthesizerwill report this kind of violation very explicitly.

If you find this exercise to be consuming more than an hour or two, considerjust copying the files in the PLL subdirectory from Lab22 Ans to your newLab22/PLL subdirectory.

Your manually instantiated delay cell will have a delay of less than 100 ps; itwould be advisable to set ‘timescale in your .inc file to 1ns/1ps, for accu-rate simulation timing. The higher time resolution will slow the simulation some-what; however, resolution of the exact value of the delay-cell delay is required forgood verification.

Step 4. Verify that the PLL now synthesizes correctly.To do this, just synthesize PLLTop and simulate the resultant netlist by invoking

VCS in Lab22 again. If you have succeeded (see Fig. 18.15), you should find thatthe PLL netlist will generate a clock output about as good as did the verilog sourcedesign.

Fig. 18.15 Simulation of synthesized PLL netlist

Step 5. Organization of the Deserializer block. Our goal in the next fewSteps will be to implement a first-cut Deserializer as shown in Fig. 18.16.


Fig. 18.16 Schematic of our first complete Deserializer

Change to the Lab22 directory. The PLLTop� files there are no longer usefuland should be moved to the PLL subdirectory or deleted. In the Lab22 directory,create a new module named Deserializer in its own file. Prepare an emptytestbench, DeserializerTst, in a separate file.

In the Deserializer module, create empty-port instantiations of the FIFO(FIFO Top), the DesDecoder (of Lab16), and a new serial receiver modulenamed SerialRx.

The resulting file organization should be as shown in Fig. 18.17.

Fig. 18.17 Filereorganization forthe Deserializerimplementation

From the data flow Fig. 18.1 and the schematic Fig. 18.16, Deserializershould have one input named SerialIn, another input named ParOutClk, anda 32-bit output bus from the decoder to the FIFO Top named ParOut. Createthese ports.

Create an output port for the PLL’s decoded incoming clock, which originates atthe DesDecoder instance’s .ParClk pin and goes to the FIFO Top instance’sClocker pin; to avoid confusion of this sending-domain 1 MHz clock with theinput approximately 1 MHz parallel clock from the receiving clock domain, namethe Deserializer connecting wire between these pins, DecoderParClk (“de-coder’s parallel-bus clock”).


In view of the FIFO, we should require two more Deserializer outputs,FIFOFull and FIFOEmpty; these will not be useful to us in this lab, butthey would be important in any system of which our serdes was part. Add theseoutputs.

We should add an external SerValid input in Deserializer, routed to theDesDecoder module, for test purposes. This would flag valid serial data fromthe sender. We also should have some sort of checking to generate a ParValidoutput from Deserializer. Finally, a Reset input should be provided toDeserializer so that the testbench can reset all the submodules.

Prepare to set the depth of the FIFO in Deserializer by declaring aDeserializer header parameter, AWid (= “Address Width” = width of register-file address bus). Assign a default value of 5 (for 25 = 32 bits).

Set the width of the Deserializer output bus by another new header param-eter, DWid (“Data Width”), with default value 32. This parameter then should beused everywhere in the design where the width of the parallel output bus appears.There is no special reason to have a 32-bit output bus as opposed to a 16-bit or64-bit one, so, although we shall not alter the width in this course, our design willbe parameterized to allow DWid to be changed to any power of 2. We could, inprinciple, use a different value for our future Serializer input width, makingthe two ends of our serial lane independent in width. A very, very nice feature ofserial connection . . ..

The DeserializerTst.v testbench file should include Deserial.incfor global timing definitions. Add an include directive for this to the testbench file.

Step 6. The FIFO. Create a subdirectory FIFO in the Lab22 directory, and copyin the complete FIFO design from Lab11/Lab11 Ans; this should consist ofFIFO Top.v, FIFOStateM.v, and Mem1kx32.v. The Mem1kx32 should bethe one with separate input and output ports. Remove the timescale definitions fromall these files.

In FIFO Top, add AWid and DWid parameters as in Deserializer in theprevious Step. The DWid parameter may be used anywhere in the FIFO design todetermine the width of a register or of a parallel bus. Declare these parameters andreplace every width and depth in the FIFO and its submodules (FIFOStateM andMem1kx32) so that these parameters control the values.

Add 1-bit Full and Empty output ports to FIFO Top, too; these will bewired to the corresponding new Deserializer ports of Step 5, as soon as theFIFO can be instantiated. Also, to simplify the FIFO Top wiring, remove the oldE FIFO and F FIFO nets; instead, substitute the new output port names (Fulland Empty) into their continuous assignment expressions and the instance portmaps. Then, delete the three nets, ReadWire, WriteWire, and ResetFIFOand replace them with the net names already used in FIFO Top to assign tothem. If your Mem1kx32 instance is clocked with an inversion operator(“. . ., .ClockIn(∼Clocker),. . .”), remove the inversion operator (‘∼’ or‘!’), so that the memory responds to a rising-edge clock.


The FIFO Top.v file includes a testbench. Parameterize this testbench withAWid and DWid as above. Be sure that the FIFO Top instantiation has .Fulland .Empty output pins, properly mapped in the FIFO testbench module. The test-bench should have a ‘include of Deserial.inc. After these edits, separatethe testbench into its own file, FIFO TopTst.v, for possible future debugging.

You may have other testbenches commented out in Mem1kx32.v and inFIFOStateM.v; these testbenches should be deleted, because, if necessary, theFIFO TopTst testbench may be used to exercise these submodules. If you find abug in one of the FIFO submodules which requires a regression to a single submod-ule, you can recopy a local testbench from the earlier labs.

In Mem1kx32, add a Reset port and code which initializes all memory loca-tions to 0. This will suppress startup parity errors when we use the memory later insimulation.

Finally, simulate with FIFO TopTst briefly to verify that FIFO is complete andthat the include file is referenced correctly.

Step 7. The Deserializer file structure. Create a subdirectory DesDecoderin the Lab22 directory for the deserialization decoder, and another subdirectory,SerialRx, for the serial receiver.

Copy the DesDecoder.v from Lab16 Ans into the DesDecoder subdi-rectory. Remove the ‘timescale from DesDecoder.v, and pass the modulejust one parameter, DWid. Change the name of the output bus from ParBus toParOut, rename the ParRst port to Reset, and delete the testbench, if it stillis present. If they are present in your copy, remove the unused SHIFT and RESETlocalparams.

Then, create a new file, SerialRx.v, in a new Lab22 subdirectory namedSerialRx. This file should contain just the following:

module SerialRx(output SerClk, SerData, input SerLinkIn, ParClk, Reset);

assign SerData = SerLinkIn;//PLLTop PLL RxU1 ( .ClockOut(SerClk), .ClockIn(ParClk), .Reset(Reset) );//endmodule // SerialRx.

This finally shows where we shall instantiate our PLL: In the SerialRxmodule.

After all these changes, the resulting directory structure should put theDesDecoder, FIFO, PLL, and SerialRx blocks all directly under Lab22. Thisis shown in Fig. 18.18.

The Lab22 top level will become the Deserializer directory of our next labexercise.


Fig. 18.18 New Lab22 directory. Files are named near their directory (block)

In the Deserializer module, instantiate the FIFO Top module, and con-nect it with DesDecoder and SerialRx instances as well as possible for now,according to the design dataflow Fig. 18.1. Wire the SerialRx with port names asshown previously.

Because the width and name context of the DesDecoder and SerialRx in-stances strongly constrains their ports, create module templates for submodules ofDeserializer.v first by wiring up the instance port maps, postponing moduledeclarations. This way, you will have all connections immediately visible for yourwiring. The result will look something like this (before wiring):

module Deserializer...// -------------------------------------------------------// Structure://FIFO Top #( .AWid(AWid), .DWid(DWid) )FIFO U1( .Dout(), .Din(), .ReadIn(), .WriteIn(), .Full(), .Empty(), .Clocker(), .Reseter(Reset));

//DesDecoder #( .DWid(DWid) )DesDecoder U1

( .ParOut(), .ParValid(), .ParClk(), .SerClk(), .SerIn(), .SerValid(), .Reset(Reset));

//SerialRxSerialRx U1( .SerClk(), .SerData(), .SerLinkIn(), .ParClk(), .Reset(Reset));

//endmodule // Deserializer


Then, compose the module wiring declarations by including them (illegally)in the instance port maps, using ANSI format. For example, in the code above,you might next change the above FIFO Top .Din() to output[Dwid-1:0].Din(DecodeToFIFO), and the DesDecoder .ParOut() to output[DWid-1:0] .ParOut(DecodeToFIFO).

Do this any way you want, but developing some sort of systematic way to definemodule ports and declarations top-down is an important skill in VLSI design.

In other words, start by declaring modules where instances will be wired. Forexample, to instantiate the FIFO in Deserializer.v, start by cut-and-pastingthis into your Deserializer.v file:

module FIFO Top #(parameter AWid = 5 // FIFO depth = 2��AWid., DWid = 32 // Default width.)

( output[DWid-1:0] Dout(wire[DWid-1:0] FIFO Out), input[DWid-1:0] Din(wire[DWid-1:0] DecodeToFIFO), output Full(wire FIFOFull), Empty(wire FIFOEmpty)...

);

The port map wiring shown is illegal, because it includes width declarationscopied directly from the ports. However, this is just an intermediate step.

Next, copy these combined instance-wirings plus declarations into their separate,proper .v files in the Lab22 subdirectories. After copying, in the submodules theinstance wirings may be removed and the ports renamed or resized as necessary.

Finally, alter the code in the Deserializer module to change the module portdeclarations to simple instance pin-outs, leaving this in your Deserializer.vfile:

FIFO Top #( .AWid(AWid), .DWid(DWid) )FIFO Top U1 // Instance name.( .Dout(FIFO Out) // pin-out (= port map)., .Din(DecodeToFIFO), .Full(FIFOFull), .Empty(F IFOEmpty)

...);

Returning to the submodules, add reset inputs to all submodule declarations, aswell as any useful read, write, full or empty communication with the FIFO instance.Err on the side of adding ports; you always can remove unnecessary ports later.

Be sure to use parameters from the Deserializer declaration in all instancewidth declarations; pass the main submodules only DWid and/or the FIFO depth.

Finally, with only the FIFO and deserialization decoder actually defined, cre-ate a DeserializerTst.vcs file and load your mostly-unimplemented


Deserializer.v structural design into the simulator, just to compile and checkconnectivity. Correct connection problems before continuing. When in doubt, addanother port.

Step 8. The parallel output buffer (ParBuf) in Deserializer. For now, weshall implement the output register in Fig. 18.16 simply by declaring a reg namedParBuf in the top-level module, Deserializer.

Although our PLL generates a 1 MHz parallel-data clock from the serial stream,our packet format requires 64 bits per 32-bit word, so our serial line can deliverparallel-data words only at about 500 kb/s.

So, we shall assume an external 1/2 MHz clock ParOutClk to clock data intothis buffer this way,

always@(posedge ParOutClk, posedge Reset)begin : OutputBufferif (Reset==1’b1)

ParBuf <= ’b0; // To be wired to the ParOut port.else ParBuf <= FIFO Out;end

The ParOutClk may be generated in the testbench module. This code says thatwhen the FIFO is reset, only zeroes can be read from it. We also should provide aParValid flag; this can be set in the same always block with ParBuf.

Step 9. The deserialization decoder. Change to the DesDecoder subdirectory.The contents of your DesDecoder module should be copied from Lab16/Lab16 Ans, with the module I/O changes already made above (in this lab).

Make a second, new copy of the Lab16 Ans/DesDecoder.v file, renamingit to DesDecoderTst.v, and remove the design module; keep the testbench atthe bottom of the file. In the new testbench file, change the timescale to 1ns/1ps,remove the DC compiler directives, and change the parameter references to DWid.

Also, in the new DesDecoderTst.v file, if the serial data to be shiftedto the DesDecoder instance is shifted LSB first, reverse the order, for consis-tency, so that the data MSB goes out first. Remove all delay expressions fromthe testbench, except those in the main initial block and in the serial clockgenerator.

Now, modify DesDecoder.v as follows:

• In case we might want a command from here to the FIFO, add a 1-bit output portnamed WriteFIFO.

• Remove all #delay expressions and rephrase or remove all comments concerningthese delays.

• To put this module better under procedural control, change the runtime so that in-stead of four concurrent always blocks, there are just two: The ClkGen block,which should be sensitive to change in SerClk or Reset, and a new, secondalways block as follows:


always@(negedge SerClock, posedge Reset)beginShift1;Decode4;Unload32;end

The sensitivity to the negative edge of SerClock is to avoid a race with theserial data shift.

• The Unload32 task, as written for Lab16, includes an unnecessary edge-sensitive event control; replace this with a simple if test for ParClk==1’b0.

• The ClkGen task should be cleaned up a little more: Instead of having an if inthe nonReset condition controlling nothing but serial clock gating, put every-thing, including the other if, under control of if (SerValid==YES).

After these changes, make up a DesDecoderTst.vcs file and simulateDesDecoder alone, just to be sure the new DesDecoder testbench connectionsare correct. Because of the complexity of this module, some further debugging maybe useful at the unit level; but, this module should be almost usable without changesother than ones to permit synthesis, which will be made later in the lab.

Step 10. The DeserializerTst testbench. This may be based on the one fromthe DesDecoder of Lab16, although, if so, it will have to be modified to make thecomplete Deserializer work. The main problem is that there can be no serialclock in the Deserializer, whereas there was a testbench serial clock in theDesDecoder of Lab16. In the present lab, the Deserializer has to extractthe parallel clock from the serial input stream, and then use the extracted clock togenerate a serial clock synchronized with that same input serial stream.

So, we shall have to go easy on the Deserializer, at least at first. We shalluse the testbench to feed Deserializer a serial stream at a clock rate only veryslightly different from the base rate of the PLL. This will make extraction of theclock easy enough for us to validate the rest of the design. Later, we can see how farwe can go in providing a serial stream farther off the deserializer’s PLL base clockfrequency.

With all this, the testbench should create a padded serial packet just as it did inLab16; however, let us set the testbench serial clock half-period at 15.6 ns, whichis only slightly off the PLL base rate of 500/32 ns = 15.625 ns = 15 ns after verilogtruncation in integer division. The farther away from 15 ns we go, the more serialdata the Deserializer will have to drop because of loss of synchronization, atleast at this stage in the design.

Use the testbench to generate a separate 1/2 MHz clock to attempt reads from theDeserializer FIFO output register. Don’t worry now about synchronizationof Deserializer output reads with valid data in the ParBuf register. Then,instantiate Deserializer in this testbench and simulate it. It should be possible,


as shown in Fig. 18.19 and 18.20, to obtain some good data for the FIFO, althoughthere will be many losses of synchronization.

Fig. 18.19 Overview of source simulation of first working Deserializer

Fig. 18.20 Close-up of a parallel-bus unload of the first working Deserializer

Step 11. Tying up loose ends. When you have the Deserializer source ver-ilog design more or less working (it should be generating a parallel clock and at leastoccasionally correctly parallelizing the serial data), refine the design as follows:

A. Remove the DesDecoder’s WriteFIFO output; this was completely super-fluous. The serial-parallel input functionality will decide when to write to the FIFO,and it will be up to the external system to stop sending serial data if the FIFO shouldbecome full. Instead, in Deserializer, connect the DesDecoder instance’sParValid output to the FIFO Top instance’s WriteIn port.

B. Simulate. Your FIFO has no read requested, so it should fill up. If you usedyour own implementation, you may have to debug the FIFO to do this.

Run the simulation for enough time to fill the FIFO, and then to empty it at leastonce. To empty it, you will have to issue a read command in the testbench after theFIFO is full: Assert FIFOReadCmd to do this. See Fig. 18.21.


Fig. 18.21 Overview source simulation of a somewhat improved Deserializer

C. Check the simulated transferred parallel data against the Deserializertestbench input for correctness, allowing for dropped serial words, if any. One im-plementation yielded the result in Fig. 18.22.

Fig. 18.22 The first word out of this improved Deserializer was 32’h62ef 6263 on Dinat t = 63,942,550 ns; however, it was copied to the ParWordIn bus, and stored in the FIFO, as32’hb1ef 6263. Clearly, there is more work to be done here

Do not worry now about corner cases, but your FIFO should store data so thatthe value written to at least two different given addresses is the value later read out.Recalling Step 8 above, typical code in Deserializer to read into the paralleloutput buffer would be,

always@(posedge ParOutClk) // 1/2 MHz clock domain of the receiving system.if (Reset==1’b1 || F Empty==1’b1)

beginParBuf <= ’b0; // Zero the output buffer.ParValidReg <= 1’b0; // Flag its contents as invalid.end

else begin // Copy data from the FIFO:ParBuf <= (FIFOReadCmd==1’b1)? FIFO Out : ’b0;// Flag the buffer value validity:ParValidReg <= FIFOReadCmd && (∼FIFOEmpty);end


Two possible problems we shall address later: (a) Were FIFO addresses 0x00and 0x1f written and read? (b) Can the FIFO be read from and written to at thesame time, so that the FIFO doesn’t quickly fill up or go empty?

Step 12. Synthesis. Attempt to synthesize the entire Deserializer. At thispoint, the goal is just to get some kind of netlist written out. You may have to modifycertain parts of the design or conceal them from the synthesizer in ‘ifdef DCblocks. The DesDecoder may require several changes.

Recall that the PLL now is synthesizable, but that the FIFO is not. The FIFOpart of the netlist will be full of missing connections and dangling output pins.

To reduce the time, use a synthesis script which includes no design rule or speedconstraint; just optimize for minimum area.

The resulting hierarchical netlist should total over 35,000 transistor-equivalentsand may seem to be intact; however, it will not be functional, so don’t bother tryingto simulate it.

After this, use the same synthesis script again; but, just before the command torun the compile, insert a command to flatten the design. The synthesizer now will beable to remove unconnected logic across hierarchy boundaries, reducing the netlistsize further.

So, we have a reasonable idea of the gate-size of our Deserializer, and wecan see the extent to which flattening might reduce this size.


How can one constrain synthesis for minimum area vs. minimum delay?What is the difference between synthesis constraints and design rules?Think about local delay or area minima and the multidimensional nature of the

constraints.What is incremental optimization?


(Optional) Review parameter and instantiation features in Thomas and Moorby(2002) sections 5.1 and 5.2 (ignore defparam).


19.1 The Concurrent Deserializer

In this chapter, we’ll improve our deserializer implementation and shall assemble acomplete serial receiver with imperfect (incomplete) functionality.

The main change will be modification of the FIFO to permit concurrent (dual-port) read and write. Fig. 19.1 gives our working-sketch schematic of the top level.

Fig. 19.1 Schematic of the concurrent Deserializer (with FIFO in two clock domains)

To implement this part of our design, we shall modify our RAM to be dual-ported; also, we shall change the FIFO state machine (FIFO controller) so it findsthe difference between number of reads and writes on the current clock cycle andchanges state according to the preponderance of memory usage on that clock. Weread or write on one edge; we determine addresses and state transitions on the other.We’ll also fix various things to make the Deserializer synthesizable – althoughperhaps not entirely functional.



19.1.1 Dual-porting the Memory

This just requires enabling the memory to allow simultaneous (= same clock) readand write; of course, these will be to different addresses, as determined by the FIFOcontroller, not the memory. The FIFO may be in two different clock domains. Weshall call the new memory, “DPMem1kx32”.

To upgrade our memory, we shall do this:

• Wire our chip enable (Mem Enable) in FIFO Top to a constant ‘1’ so that theregister file always is enabled. This is because of our use of the memory, notbecause of the dual-port functionality.

• Install separate read & write address ports for the memory.• Supply separate read & write clock ports for the memory.• Make the memory read and write procedures independent.

19.1.2 Dual-clocking the FIFO State Machine

This is more complicated than dual-porting the memory; it may be summarized thisway:

• Supply independent read & write clocks (shared with the FIFO RAM).• Permit nonexclusive read and write commands to RAM.• Derive a single state-transition FIFO controller clock from the input read and

write clocks. This is because the FIFO read and write ports in general will be indifferent clock domains. Furthermore, it implies that an edge in either domainmight be effective as a state-transition clock.

• Safeguard against more than one read address or more than one write address perstate-clock in a command to the RAM.

• Determine state transitions on each clock solely from the resultant effect of allread and write requests made during that clock cycle.

The rest is explained in the lab instructions.

19.1.3 Upgrading the FIFO for Synthesis

The primary problems with synthesis of the FIFO as modified above for dual-portoperation will be:

• The design still may include delay expressions; these are useless for synthesis.

⇒ We shall remove all delay expressions.

• The incrRead and incrWrite tasks include edge-sensitive event controlsbut are called within a change-sensitive combinational block.

19.2 Concurrent Deserializer II Lab 23 339

⇒ We shall modify these tasks in two steps, as we do the lab: (a) First, we shallprevent these tasks from running more than once per state-machine clock cy-cle, but we shall leave the change vs. edge event control problem alone; (b)then, we shall remove the task event controls and break up the task function-ality to put some of it in an external always block. The final result willsynthesize correctly.

• The state machine clock generator includes an always block with a change-sensitive sensitivity list containing variables not used in the block; this causesdangling inputs to the synthesized (useless) clock-generator logic. The usual syn-thesizer latch-inference problem.

⇒ We shall derive the state machine clock directly from the mutually-independent FIFO read and write clocks.

• The counters are sequential elements in addition to the state register. Thus, thelogic in this particular machine never can be divided neatly into sequential logicfor the state register and combinational logic everywhere else. The combinationalblock also causes incorrect synthesis of latches.

⇒ We shall replace the state-machine combinational block with a sequentialblock clocked on the opposite edge from the state-transition block.

19.1.4 Upgrading the Deserialization Decoder for Synthesis

• We shall simplify the Shift1 always block by removing its temporaryregisters.

• We shall rewrite the ClkGen task as a rising-edge sensitive always block; thiswill prevent latch inference. Also, we shall remove the SerClock gate fromClkGen and put it in a combinational block.

• We shall replace all clocked blocking assignments, which are relicts of the oldDesDecoder tasks, with nonblocking assignments.

• We shall prolong the assertion of ParValid by adding a small counter(ParValidTimer) to the Unload32 always block. The duration ofParValid will be guaranteed to be at least 8 serial-clock cycles.

19.2 Concurrent Deserializer II Lab 23


Lab Procedure

Step 1. Duplicating the Lab22 Deserializer. Start by creating a new sub-directory, Deserializer, in the Lab23 directory. Copy the entire Lab22/Lab22 Ans/Lab22 Ans Step12 contents into the new Deserializer sub-directory.


After the copy, delete the copied synthesized netlists and netlist log files.This Deserializer should include a synthesizable PLL, a single-port RAM

which prevents FIFO simultaneous read and write, and a FIFO which, as it happens,will not synthesize correctly. You may wish to recreate the symlink to the verilogsimulation-model library (LibraryName v2001.v) in the Deserializer direc-tory, if it is not already valid. In Windows, a copy should be made.

Briefly simulate the copied Deserializer, using the copied Lab22 test-bench, to ensure a correct and complete copy. The PLL synchronization will notbe especially good, but the system should work.

Step 2. Deserializer corner cases and FIFO debugging. Change to the newDeserializer/FIFO subdirectory. You should have there the FIFO design, aFIFO testbench, and a simulation .vcs file.

Run the FIFO simulation, modifying it if necessary, so that long intervals ofwrites alternate with long intervals of reads. Verify that corner-case FIFO mem-ory addresses 0x00 and 0x1f both were read and written; don’t bother about thedata values, if any. Do not worry now about transition addresses 0x01 or 0x1e. Ifthe 0x00 and 0x1f addresses were not accessed both for read and write, fix theFIFOStateM module so that they are. Resimulate the FIFO again briefly, modify-ing, if necessary, the testbench so that both read and write are requested at the sametime. Then, exit the simulator.

We shall be changing the FIFO design in this lab. To prepare for a thoroughredesign, start by removing all #delay expressions from all FIFO design mod-ules. Also, to distinguish the new FIFO from the old, rename all the files to re-move the underscores, and change the module names accordingly. The result in theDeserializer/FIFO directory should be,

default.cfgFIFOTop.vFIFOTopTst.vFIFOTopTst.vcsFIFOStateM.vMem1kx32.v

Recall that all address parameters were renamed to AWid as part of the Lab22exercise. The default.cfg file is a VCS side file and is not essential to thedesign.

Step 3. Enabling memory simultaneous read and write. In the Deserializersimulation, you might have seen that write requests for the incoming serial streamwere ignored when the FIFO was full; write requests also were ignored when therealso was a read request from the parallel-bus Deserializer output. If you didnot notice this last, resimulate the FIFO briefly.

The original Mem1kx32 which we adapted to use with our FIFO has only oneaddress bus and thus can not execute a FIFO read and write simultaneously. The


single address made the RAM easier to implement and debug, but now it preventsgood FIFO operation.

It is the FIFO control logic that first of all enforces the precedence so that a readprevents a memory write. To prepare to remove this control, in FIFOTop.v, deletethe declaration and assignment to the Mem Enable net. Wire the ChipEna portof the Mem1kx32 instance in FIFOTop to a 1’b1, which will enable the memorypermanently. There is only one RAM in the FIFO, so we don’t need a chip enablecontrol, anyway.

Although now enabled at the chip level, the memory still can’t perform usefulread and write at the same time because it has only one address port, and because thememory internal logic itself makes read and write mutually exclusive by if-else(in my version of Mem1kx32). Furthermore, in FIFOTop, there is a continuousassignment which prioritizes read address over write address in determining theone possible memory address; this can’t be removed, given our current, one-addressmemory.

Clearly, we have to redesign the memory.In the following, keep in mind that the Deserializer will generate a FIFO

write request whenever ParValidDecode is asserted by the DesDecoder.

Step 4. Dual-porting the FIFO memory. The original module ports are shown inFig. 19.2.

Fig. 19.2 I/O’s of the original,single-port Mem1kx32

Start by renaming the Mem1kx32.v to DPMem1kx32.v (DP = “Dual-Port”).Then, change the module name to match the file name. Modify FIFOTop.v andthe .vcs file accordingly.

Next, edit DPMem1kx32 to add the second address port: Rename the existingaddress input port to AddrR, and add a second port named AddrW. Both shouldhave widths determined by the AWid parameter value.

We will want to clock data in and out with separate clocks, so likewise renamethe current, single clock to ClkR and add a second clock input port named ClkW.See the result in Fig. 19.3.


Fig. 19.3 Mem1kx32 con-verted to a dual-port memory

Internally, rename the old ChipClock to ClockR, and declare a new wirenamed ClockW.

Although we have disabled ChipEna in the FIFO, we still want the controlinside the memory. So, put both clocks under control of the ChipEna input bywiring them to internal nets and muxing them to 1’b0 this way:

assign ClockR = (ChipEna==1’b1)? ClkR : 1’b0;assign ClockW = (ChipEna==1’b1)? ClkW : 1’b0;

Fig. 19.4 ClkR clock gatelogic

The logic for the clock gating by the two preceding continuous assignments isthe same as in the ClkR schematic of Fig. 19.4. Also, keep the gating of Dreadyand DataO by ChipEna.

We still have read and write logically dependent. To remove this dependency,put the read functionality and the write functionality into two separate alwaysblocks as shown below. The continuous assignment gating is enough to control allactivity, so we don’t need a ChipEna either in the read or the write block. Becausememory reads will be clocked externally from the DPMem1kx32 output latches,reconvergence from the internally gated clocks will not be a concern.

There is no special reason to initialize memory in the hardware, but to have fail-safe parity checking, we should generate an error on any bit ‘x’ or ‘z’. This meansinitialization of all memory locations on Reset to prevent false parity errors duringsimulation. Reset should be applied both to the read and the write blocks, but thememory initialization should be included only with the write code. Be sure to avoid,always, assigning to any variable in more than one always block; the risk of a racecondition can not be overstated.


Thus, the separated read and write always blocks should be done this way:

...always@(posedge ClockR, posedge Reset)begin : Readerif (Reset==1’b1)

(init Parityr, Dreadyr, DataOr)else if (Read==1’b1)

(do parity check and read)end // Reader.

//always@(posedge ClockW, posedge Reset)begin : Writerif (Reset==1’b1) // Zero the memory:

for (i=0; i<=MemHi; i=i+1) Storage[i] <= ’b0;else if (Write==1’b1)

(do write)end // Writer.

...

As shown, name the newly separated blocks Reader and Writer. BecauseReader block has to do a parity check, it should initialize the parity error outputto 0, as in the old Mem1kx32 model. Writer probably should use its reset branchto do nothing but zero out the memory storage on Reset. The Writer blockshould not assign ‘z’ when it is inactive (this would be OK for a shared read-writebidirectional bus).

Finally, it’s about time we adopted a new reg naming convention. The syn-thesizer always renames design reg types, if they are preserved, to � reg in thenetlist; so, using our current convention of �Reg just creates �Reg reg names inthe netlist, which is redundant. So, in the DPMem1x32 and all other FIFO modules,change all of our declared reg names which end in �Reg to �r.

After these changes, modify FIFOTop to remove the SM MemAddr declarationand its assignment statement. Connect each of the two new DPMem1x32 addressinput pins directly to its proper state machine output address pin. Also, connect theone available clock to both of the DPMem1kx32 clock input pins. Then, verify thatthe FIFO simulates the same way with the new DPMem1kx32 instance as it didwith the original Mem1kx32.

After this, we can do no more in FIFOTop or DPMem1kx32. We have to modifythe state machine design, which itself includes code assuming single-port memoryoperation.

Step 5. Modifying the FIFO state machine I. This is going to be more complicatedthan the preceding, so we shall do it in two stages.


First, change to the Deserializer directory and change all references fromFIFO Top to the new FIFOTop, and all references from Mem1kx32 toDPMem1kx32. Simulate the Deserializer. You will notice that, during thememory reading, ParValid goes high, indicating a write from the incoming serialline. However, no write occurs, because the FIFO state machine has been designedto avoid “race conditions” by means of an if-else giving every read priority overwrite.

However, we now have separate read and write addressing of the memory, whichshould prevent data corruption. So, let us separate the state machine’s read and writefunctions, allowing them to occur concurrently.

Change to the FIFO subdirectory. In FIFOStateM, rename the Clk port toClkR and add a second clock input port ClkW. After this, in the same module,modify both of the address increment tasks with a latching semaphore which willprotect the address from being changed more than once per clock cycle. For exam-ple, the read-address task should be modified as shown next:

reg LatchR, LatchW; // Reset these to 1’b0 in SM comb. block....task incrRead; // Unsynthesizable!beginif (LatchR==1’b0)

beginLatchR = 1’b1;@(posedge ClkR)ReadCount = ReadCount + 1;

LatchR = 1’b0;end

endendtask

Revise both of the tasks, incrRead and incrWrite, in FIFOStateM to belike the one above.

We have a theoretical issue to address: A state machine can not be guaran-teed deterministic unless it has a unique clock. But, now we have two clocks intoFIFOStateM, so we must derive a single clock to determine the machine’s currentstate. This can be done by means of a simple, change-sensitive always block:

always@(ClkR, ClkW, Reset)if (Reset==1’b1)

StateClock = 1’b0;else StateClock = !StateClock;

This construct is not synthesizable (ClkR and ClkW go nowhere), and we shallrevise it later in the lab. For now, it will do as a quick simulation check; so, declare


a reg named StateClock and add the above to FIFOStateM. Change the statetransition clock name from Clk to StateClock. Also, change the state machinecombinational block’s enable to StateClock==1’b0. In the incrRead andincrWrite tasks, replace each posedge event control with one depending onStateClock.

Next, in the state machine combinational block, in the empty and full statecode, remove all mutual dependence of read and write. Specifically, in the emptystate, there is no reason for the if to test for a read request; in the full state, theif should test only for a read request.

Leave the a empty and a full state code alone for now.In the normal state code, the counters are checked, so state transitions can be

determined correctly on a simultaneous read and write in any order; therefore, in thenormal state, just separate the read and write conditions each into its own if by re-moving all dependence on the other request. Be sure to remove all statements whichdisable the opposite operation; for example, remove “WriteCmdr = 1’b0;”from the memory read statement.

Also, extract all transition logic from the two if’s, and collect it below the sec-ond if. Because the requests both may be processed now on one clock, we have towait for the last one to lapse before deciding on the transition to the next state.

The resulting transition logic should look about like this:

(read & write command logic)// Set the default:NextState = normal;//// Check for a full (R == W+1):tmpCount = WriteCount+1;if (ReadCount==tmpCount)

NextState = a full;//// Check for a empty (W == R+1):tmpCount = ReadCount+1;if (WriteCount==tmpCount)

NextState = a empty;

We now have independent read and write in the normal state. We are not finishedyet, but you may wish to simulate briefly to check for gross errors. With the changesabove, the FIFO register-file corner case addresses 0x00 and 0x1f again shouldbe found to be read and written, as should the neighboring 0x01 and 0x1e.

Step 6. Routing the new read and write clocks. We need two clock inputs forthe FIFO. In FIFOTop.v, rename the Clocker input to ClkR and add anotherinput named ClkW. For the FIFO in the deserializer, ClkR will be for the receiv-ing domain, and ClkW will be the parallel-data clock from the DesDecoder. InFIFOTop, connect the two clocks directly to the DPMem1kx32 and theFIFOStateM instances.


After this, in Deserializer.v, consistent with our FIFO naming convention,rename ParValidReg to ParValidr – in fact, now would be a good time tochange all remaining design names from �Reg to �r.

Also, rename the FIFOReadCmd input port to ReadReq; we want to be sureto differentiate a request for a read (originating external to the FIFO) from a readcommand to the register file (within the FIFO). Modify the DeserializerTsttestbench for the new port name.

In Deserializer, change the FIFOTop port map so that .Clocker is re-named to .ClkW, and a new .ClkR port is mapped to ParOutClk.

After this, simulate Deserializer briefly. The result should be changed, butthe result still should be reasonable, if only intermittently good. The FIFO I/O portsnow match those shown in the schematic of Fig. 19.1, but we haven’t finished yet.

Step 7. Modifying the FIFO state machine II. We should begin by modifying thea empty and a full states so they will work properly for independent read andwrite requests. Looking at the FIFOStateM combinational block, the problem isthat (a) as soon as one operation is triggered, the machine disables the opposite one;(b) the machine prioritizes one operation while mutually-excluding the other; and,(c) the machine does not check the counter values in these states and thus can notdetermine whether a read or write command, or both, has been issued on the currentclock.

So, open FIFOStateM.v and, for a empty and a full, remove the disablingassignment to ReadCmdr in the writing blocks and to WriteCmdr in the readingblocks. The rationale here is the same as previously, for the normal state, but leavethe transition rules alone for the moment.

For a empty, disentangle the transition logic from the memory command logicand collect the transition logic together at the end of the a empty block, so itcan be made to depend on the counter values rather than the memory commanddecision(s).

To ensure correct counter wrap-around, use an AWid-bit wide tempCountreg as shown here:

// In this state, we know W == R+1; ...// Memory command logic:if (WriteReq==1’b1) ...if (ReadReq==1’b1) ...//// Transition logic:// Set default:NextState = a empty;//// Check for change:tempCount = ReadCount+2; // Destination determines wrap-around.if (WriteCount==tempCount)

NextState = normal;else if (WriteCount==ReadCount)

NextState = empty;


Do similarly for the a full logic.At this point, let’s look again at the command logic in the combinational block.

Consider the normal state. In that state, the command logic should be somethinglike this:

// On a write:if (WriteReq==1’b1)

beginWriteCmdr = 1’b1;incrWrite; // Call task, which blocks on posedge StateClock.end

// On a read:if (ReadReq==1’b1)

beginReadCmdr = 1’b1;incrRead; // Call task, which blocks on posedge StateClock.end

(transition logic . . .)

With the code shown above, a write request would call incrWrite, whichwould block further execution until the next positive edge of the clock. This wouldcause loss of a read request on the low phase of the current clock.

A typical simulation solution to prevent this loss would be to put both the readand write requests together in a fork-join block; then, a simultaneous read andwrite both would be processed together, and both would block on the positive edge,which, when it occurs, would unblock the command processing and allow the tran-sition logic to update next state and the address counters for the next positiveedge. The result would be as shown next:

fork // unsynthesizable! But do it for now.// On a write:if (WriteReq==1’b1)

beginWriteCmdr = 1’b1;incrWrite;end

// On a read:if (ReadReq==1’b1)

beginReadCmdr = 1’b1;incrRead;end

join(transition logic . . .)

Make these changes in your code for the a empty, normal, and a full com-mand logic.

In our modified design, we already have made sure that both address-incrementtasks will exit simultaneously on concurrent read and write commands; we did this


by using only one posedge clock in the tasks, StateClock. From a previousStep, your tasks should have been rewritten this way:

task incrRead; // Still incorrectly synthesized (but not done yet).beginif (LatchR==1’b0)

beginLatchR = 1’b1;@(posedge StateClock) // Read must not block longer than write.ReadCount = ReadCount + 1;

LatchR = 1’b0;end

endendtask

The synthesizer doesn’t use fork-join blocks (netlists are concurrent any-way) and rejects them; so, each “fork” and each “join” in the combinationalblock should be surrounded by a DC macro test to prevent the synthesizer fromreading it.

One last refinement: Let’s change the state declarations to sized localparams.And, let’s rename the states from statename to statenameS, a final ‘S’ to makeclear that this is a state identifier and not a common English word. For example,“localparam[2:0] emptyS = 3’b000;”, etc. These name changes can bedone by a quick search-and-replace.

Simulate your Deserializer again. In this simulation, be sure at some pointfirst to disable reads until the FIFO goes full; then, read it long enough so that itgoes empty. The design should be much closer to being entirely correct than whenwe started this lab.

Step 8. Changing the FIFO state machine for correct synthesis. The current designwill synthesize to a netlist, but the netlist still will be incorrect. The new PLL willbe fine, but the FIFO remains a synthesis problem for two reasons:

• The StateClock generator is an oscillator implemented as an always blockwith a change-sensitive event control containing nets not expressed in the block.This risks that the synthesizer will allow these nets to dangle.

• The address-updating tasks include embedded edge-sensitive event controls.Although edge-sensitive, the tasks are not called on that edge; and, so, the em-bedded controls imply complex latches in the state machine combinational logic,which will not be synthesized correctly (as we know).

We shall resolve these problems in this Step.

Step 8A. Testbench for source and netlist. Before anything else, let’s be sure ourFIFO testbench is adequate for unmodified use either with our FIFO source verilogor with a synthesized FIFO netlist.

At the start of the tests, we want pulsed write and read requests, as in previousFIFO testbench versions. After this, we want a run of back-to-back writes whichfills the FIFO, followed by a similar run of reads which forces it empty. Finally, we


want a run of mixed writes and reads which includes at least a few simultaneousread and write requests.

Modify your FIFO testbench to achieve these goals. Use two identical clock gen-erators, one clocking the FIFOTop ClkR pin, and the other ClkW. If time is shortin the lab, consider just copying all or part of the FIFOTopTst.v testbench inLab23 Ans/Lab23 AnsStep08A. Don’t bother about details of design func-tionality at this point.

Step 8B. Synthesizable StateClock generator. In FIFOStateM, we can re-place the always block with one which is sensitive only to the two input clocks;this will be synthesizable. For example,

reg StateClockRaw;wire StateClock; // See code example below.//always@(ClkR, ClkW)StateClockRaw = !(ClkR && ClkW);

An xor (ˆ) would be more symmetrical than the and (&&), but it would not workif ClkR and ClkWwere closely in phase; so, we use a nand expression. An and alsowould be fine, but in CMOS technology, nand usually is simpler and faster than and.

The clock is named StateClockRaw, and not StateClock, for the follow-ing reason:

A possible new problem is that there is no HDL delay here any more, andtherefore no more inertial-delay filtering. This means that as the read and writeclocks drift in phase independently, the simulator will be allowed to produceStateClockRaw glitches of arbitrarily short duration – and there will be no clearrelationship to the performance in the synthesized netlist. This actually has been asimulation problem here, all along, ever since we deleted all the delay expressionsin previous lab Steps.

The solution is to connect the StateClockRaw to a library component delaycell which will filter out the narrowest glitches; the delay cell then can be used todrive the StateClock. The same glitch filtering then will occur during simulationboth in the source and in the synthesized netlist. For a delay cell, we might as welluse the same library cell as we used in Lab22 for the synthesizable PLL.

We can preserve the delay cell from removal during synthesizer optimization bymeans of an embedded TcL script near the verilog instantiation. The added codemay be written this way:

// Glitch filter:DEL005 SM DeGlitcher1 ( .Z(StateClock), .I(StateClockRaw) );////synopsys dc tcl script begin// set dont touch SM DeGlitcher1// set dont touch StateClock∗

//synopsys dc tcl script end


After commenting out the StateClock generator of previous Steps and de-riving it from the read and write clocks as above, you will have to add a verilogsimulation model of the delay cell to your .vcs file list to simulate the FIFO. Dothis by referencing the (modified) library file, LibraryName v2001.v, copied orlinked in your Lab23 directory. Precede the file name with -v in your .vcs file,so that the simulator will search for and compile the one model, instead of compilingthe whole library.

Don’t bother synthesizing yet; but, try simulating the FIFO briefly. You may haveto modify the testbench a little to get the FIFO to run all the way to full and to emptyat least once during the simulation. Don’t worry about simulation details; we shallbe changing the FIFO, and thus perturbing the functionality, again.

Step 8C. Synthesizable FIFO address-generating tasks. We have to reexamine andchange every state in our FIFO state machine combinational block.

We know that the event controls inside the old tasks will not synthesize correctly.So, let us separate the event controls from the task logic into new always blocks.Instead of embedding an edge-sensitive event control in each task, we can use in-dependently StateClock’ed always blocks, clocked on the positive edge, toupdate the address counters. An edge-sensitive always block can update a counterconditionally and be synthesized correctly.

With counter updates isolated this way, the tasks can be rewritten fully change-sensitive, so that they might synthesize correctly in the state machine combinationalblock.

Our new tasks (shown below) will latch address state and issue read or writecommands to the FIFO register file; the always blocks synchronously will changethe addresses used by the register file.

With this implementation, the new always blocks also should be made to per-form the fullS and emptyS address initializations on reset. Thus, all manipula-tion of register-file addresses can be encapsulated in the new posedge alwaysblock logic.

Now let’s get down to implementation details:Let’s adopt the convention that a task called with an input argument of 1’b1 will

request an address increment, unless one is scheduled already on the current clock.A task called with a 1’b0 input argument will deassert the command controlled bythat task. One way to write the required always blocks and their tasks would bethis way:

First, we declare the following reg’s for use by the tasks and by the next-statelogic:

reg[AWid-1:0] ReadAr, WriteAr // Address counter regs., OldReadAr, OldWriteAr // Saved posedge values., tempAr; // For address-wrap compares.

The ReadAr and WriteAr also should be used on the right sides of theusual continuous assignments to the corresponding FIFOStateM output ports,ReadAddr and WriteAddr. The Old∗ are to retain state across calls, becauseof loss of the sequencing capability we had with the discarded @ edge-sensitiveevent controls.


After this, we may declare each new always block and its respective task. Thetasks generally have to treat FIFO empty and full states specially.

For a register-file read, the always block:

always@(posedge StateClock, posedge Reset)begin : IncrReadBlockif (Reset==1’b1)

ReadAr <= ’b0;else begin

if (CurState==emptyS)ReadAr <= ’b0;

else if (ReadCmdr==1’b1)ReadAr <= ReadAr + 1;

endend

For a register-file read, the new task:

task incrRead(input ActionR);beginif (ActionR==1’b1)

beginif (CurState==emptyS)

beginReadCmdr = 1’b0;OldReadAr = ’b0;end

else beginif (OldReadAr==ReadAr) // Schedule an incr.

ReadCmdr = 1’b1;else begin // No incr; already changed:

ReadCmdr = 1’b0;OldReadAr = ReadAr;end

endend

else begin // ActionR is a reset:ReadCmdr = 1’b0;OldReadAr = ’b0;end

endendtask

If the FIFO is empty, its first value can be written anywhere, so the emptySstarting value of ReadAr in principle need not be specified. However, the counterarithmetic we have adopted for state transitions requires that both counters start froma known value in the empty state. Therefore, the transition to emptyS must initial-ize both pointers to the same value, which might as well be 0. The write pointer mustbe initialized in its own always block, if we want to conform with the synthesisrequirement that no reg may be assigned in more than one always block.


When the FIFO is full, its data occupies all storage locations, and the address ofthe first datum to be read is predetermined. Therefore, the writing always block,the only one allowed to modify the value of WriteAr, must be the one to initializeWriteAr to bring it from its undefined state to its one, correct value on the firstread in fullS. That value equals the value of ReadAr, which, of course, will beincremented on the next posedge of the StateClock.

So, for a register-file write, the new always block:

always@(posedge StateClock, posedge Reset)begin : IncrWriteBlockif (Reset==1’b1)

WriteAr <= ’b0;else begin

case (CurState)emptyS: WriteAr <= ’b0; // Set equal to read addr.fullS: WriteAr <= ReadAr; // Set equal to first valid addr.endcaseif (CurState!=fullS && WriteCmdr==1’b1)WriteAr <= WriteAr + 1;

endend

And, finally, for a register-file write, the new task:

task incrWrite(input ActionW);beginif (ActionW==1’b1)

beginif (CurState==fullS)

beginWriteCmdr = 1’b0;OldWriteAr = ReadAr;end

else beginif (OldWriteAr==WriteAr) // Schedule an incr.

WriteCmdr = 1’b1;else begin // No incr; already changed:

WriteCmdr = 1’b0;OldWriteAr = WriteAr;end

endend

else begin // ActionW is a reset.WriteCmdr = 1’b0;OldWriteAr = ’b0;end

endendtask


All task calls in the FIFOStateM combinational block now have to be modifiedto pass a 1’b1 or 1’b0. A call to both of incrRead and incrWrite shouldbe added to the reset block in emptyS and each passed 1’b0; all other task callsshould be passed 1’b1. LatchR and LatchW should be removed from the design.

All reference in the combinational block to ReadCount, WriteCount, ortmpCount should be replaced by ReadAr, WriteAr, or tempAr, which latteralso should be used in the continuous assignments to FIFOStateM output ports;none of these new identifiers should appear in the combinational block except intransition logic expressions.

We still have changes to make in the combinational block. First, why not deleteall the fork-joins? Consider them a temporary simulation kludge, and removethem all, with their ‘DC controls.

Because we are eliminating fork-joins, we have to simulate a simultaneousread and write request somehow. The individual task calls would be enough; but, wealso have to deassert read and write commands on clocks not requesting either. Wecan do all this explicitly by using a case statement in each branch of the combina-tional block, as shown below.

Our new tasks don’t block on simulation-scheduled events any more, so we cancall them individually if there is just one of read or write requested, or we can callthem both (procedurally). Here is how to do this in any of a emptyS, a fullS,or normalS:

...case ({ReadReq,WriteReq})2’b01: begin

incrWrite(1’b1); // On a write.ReadCmdr = 1’b0;end

2’b10: beginincrRead(1’b1); // On a read.WriteCmdr = 1’b0;end

2’b11: // On a read and write:beginincrRead(1’b1);incrWrite(1’b1);end

default: begin // No request pending:ReadCmdr = 1’b0;WriteCmdr = 1’b0;end

endcase...

In emptyS or fullS, there is only one operation possible, so we call justthe one task for those; but, we also should deassert the command when it is notrequested.


Finally, we have to break the rules to get the FIFO to synthesize. Normally, averilog state machine is designed with a clocked block to determine state transitionsand a combinational block to operate the transition logic and the rest of the machine.This usually works best. But, a FIFO is a strange animal.

Notice that we already have to use clocked logic to update the address counters,and that this is inconsistent with everything but state register updates in a purelycombinational block. Attempting synthesis with only the preceding changes stillwill result in pathological latch synthesis and incorrect functionality. We must pre-vent latches; we can do this by clocking what up to now was our “combinational”block.

To avoid race conditions, modify the combinational block to be sensitive to thenegedge of StateClock. This, incidentally, will allow us to associate an asyn-chronous reset to the erstwhile combinational block. Your new “combinational”block now should resemble something like this:

...always@(negedge StateClock, posedge Reset)

beginif (Reset==1’b1)

begin// Reset conditions:NextState = emptyS;incrRead(1’b0); // 0 -> reset counter.incrWrite(1’b0); // 0 -> reset counter.end

elsecase (CurState)emptyS:// (The other previously combinational logic goes here)...endcase

...

After these changes, simulate the FIFO again until it works well with your FIFOtestbench.

Optional: If time permits, synthesize a netlist from this design and sim-ulate the resulting netlist. For correct synthesis, some setup and holdproblems will have to be prevented; to see how to do this, examine theFIFOTop.sct file provided on the CD-ROM in Lab23/Lab23 Ans/Lab23 AnsStep08C/Deserializer/FIFO. This synthesis will takesome time, perhaps a half-hour.

To reduce the time, let the synthesizer run until it has been in the “Area Re-covery Phase” for about 5 minutes; then, press {control-C} once and use thetext-based menu (which will appear after a little while) to abort optimization.The synthesis script will run to completion, leaving you with a suboptimal butfunctionally correct netlist.


The FIFO netlist should simulate more or less as well as the source FIFO. Theremay remain timing issues, but don’t bother perfecting your FIFO testbench now; goon to the rest of the lab.

Step 8D. (Optional) Verifying that the Deserializer from Step 8C does notyet synthesize correctly. If time permits, try the synthesis described in this Step;otherwise, just read the description here of the result.

Without any attempt to simulate the Deserializer from the top, change tothe Lab23/Deserializer directory, run the synthesizer using the Lab23/Deserializer.sct script provided, and examine the waveforms resulting froman attempt to simulate the resulting netlist.

The netlist simulation will fail. You should be able to see that ParValid doesnot control FIFO writes correctly. The parallel-bus clock from the sending domainconsists of approximately 160-ps wide pulses, repeated at about 2 ns intervals.The problem therefore is in the DesDecoder, which generates this parallel clock.Even though the main DesDecoder synthesis problem, edge-triggering task calls,was fixed, and the source verilog simulated correctly, the synthesized netlist is notcorrect.

Step 9. Synthesizing the DesDecoder. Change to the DesDecoder subdirec-tory and read through the source code in DesDecoder.v. The ClkGen task defi-nitely is a problem: It is called in a selectively change-sensitive always block andthus implies nonstandard latching. As usual, such a latch is too complex to expectthe synthesizer to be able correctly to fulfill the design intent.

We have to rewrite the parallel-clock generator. We can start by cleaning up therest of the DesDecodermodule: Rename all ∗Reg to ∗r, if not done already; then,remove any remaining commented-out task remnants.

The clock-generator problem can be solved by rewriting the ClkGen task as analways block named ClkGen and sensitive to the edge of SerClock oppositethat of the other always blocks in this module; this means ClkGen should besensitive to the positive edge. For example,


always@(posedge SerClock, posedge Reset)begin : ClkGenif (Reset==1’b1) // Respond to external reset.

beginParClkr <= 1’b0;Count32 <= ’b0;end

else begin // If not a reset:if (SerValid==YES)begin// Resynchronize this one:if (doParSync==YES)

beginParClkr <= 1’b0; // Put low immediately.Count32 <= ’b0;end

else beginCount32 <= Count32 + 1;if (Count32==5’h0) ParClkr <= ˜ParClkr;end

end // if SerValid.end // not a reset.

end // ClkGen.

Edge sensitivity will eliminate complex latches, but further improvement is re-quired. Start by removing the SerClk gate from the ClkGen block and putting itin a conditional continuous assignment,

assign SerClock = (SerValid==YES)? SerClk : 1’b0;

After this, if the Shift1 still uses a temporary shift register, make it more com-pact by changing it to shift the FrameSR directly. The following should work well:

always@(negedge SerClock, posedge Reset)begin : Shift1// Respond to external reset:if (Reset==YES)

FrameSR <= ’b0;else begin

FrameSR <= FrameSR<<1;FrameSR[0] <= SerIn;end

end

Let’s get a little fancy here, like someone designing configurable IP. In thesimulation, you may have noticed the narrow, glitch-like pulses produced on theParValid output. This net should be asserted for longer durations, severalSerClock cycles, at least.


We cannot effectively program a pulse digitally, but, we can introduce a small,fast SerClock cycle-counting timer which can deassert ParValid after any con-venient count. Add this to the beginning of the DesDecoder module:

localparam ParValidMinCnt = 8; // Minimum number of SerClocks// to hold ParValid asserted.

localparam ParValidTWid = // Width of ParValidTimer reg.( 2 > ParValidMinCnt )? 1

: ( (1<<2) > ParValidMinCnt )? 2: ( (1<<3) > ParValidMinCnt )? 3: ( (1<<4) > ParValidMinCnt )? 4: ( (1<<5) > ParValidMinCnt )? 5: ( (1<<6) > ParValidMinCnt )? 6: 7; // Thus, width is declared automatically.

//reg[ParValidTWid-1:0] ParValidTimer;// ...

This approach will make it possible to adjust the width of the ParValid asser-tion, should system considerations require it.

Then, the Unload32 block may be rewritten along these lines:

always@(negedge SerClock, posedge Reset)begin : Unload32if (Reset==YES)

beginParValidr <= NO; // Lower the flag.ParOutr <= ’b0; // Zero the output.ParValidTimer <= ’b0;end

else beginif (UnLoad==YES)

beginParOutr <= Decoder; // Move the data.ParValidr <= YES; // Set the flag.ParValidTimer <= ’b0;end

else beginif (ParValidTimer<ParValidMinCnt)

ParValidTimer <= ParValidTimer + 1;if (ParValidTimer==ParValidMinCnt && ParClk==1’b0)

ParValidr <= NO; // Terminates assertion.end

end // UnloadParData.end

After simulating the rewritten DesDecoder, synthesize it and verify that thenetlist simulates the same way as did the verilog source. Do not synthesize or sim-ulate the entire Deserializer yet.


Step 10. (Optional) Synthesizing and simulating the Deserializer netlist. Wecomplete this lab by demonstrating that the Deserializer from Step 9 now canbe synthesized to a netlist which functions correctly at least occasionally.

Under proper constraints, with a 32-word FIFO, the Deserializer will syn-thesize as-is in about 15 hours to some 75,000 transistor-equivalents. However, thisnetlist probably will not simulate correctly. If you wish to run this synthesis, there isa .sct file available in the answer directory for your reference. It might be a goodidea to start this synthesis just before leaving for the day.

A preliminary synthesis already has been done for you at this stage, producing abad netlist in about an hour (ten or fifteen minutes with incorrect constraints – seethe Step 8C box for a trick to get a quick and dirty netlist in 5 minutes or so). Theresult is in a BadNetlist subdirectory in your answer directory for Step 10.

There probably will be some debugging to do before the Deserializer willsimulate reasonably even before synthesis. The main simulation problem probablywill be synchronization of the serial clocks: The frequencies must stay within about1/32 (close to 3%) before a complete 64-bit arriving packet will be aligned correctlyand decoded; therefore, the testbench sending-domain serial clock should be setfairly close to the free-running Deserializer PLL base frequency. I have useda half-period delay of 15.5 ns successfully.Some things to consider:

• The DesDecoder synthesizable parallel clock generator now is called onjust one edge, instead of on any change; therefore, it will run at an averageof half of the previous speed. The receiving-domain clock frequency must bechecked.

• The DesDecoder may call doParSync too often, thus perhaps causing theextracted parallel clock to fail because of frequent erroneous resynchronizationon zero data rather than on PAD0.

• The PLL ‘VFO MaxDelta possibly should be different for source simulationthan for synthesis of a correctly-simulating netlist.

• The VFO operating limits in VFO.v probably should be defined simply as‘DivideFactor ± ‘VFO MaxDelta, if not already so

• The number of delay stages (‘NumElems) in VFO should be set properly forcorrect source and netlist simulation. I have found 5 to be a good value here.

• In the top-level testbench, the total duration of the simulation and the sending-domain serial clock speed also may have to be tuned.

It is not important to have the Deserializer completed in this lab; there willbe time for tuning later in the course. However, all the major submodules shouldsimulate well by now on the unit level (PLL, DesDecoder, and FIFO), both asverilog source and as synthesized verilog netlist.

Success with the entire Deserializer system at this point would be indicatedby just a couple of sporadic unloads of data from the FIFO onto the output parallelbus. Playing with the DesDecoder or the PLL probably will make more differ-ence here than anything else. An example of the waveforms is given in Figs. 19.5,19.6 & 19.7.


Fig. 19.5 First successful simulation result for synthesized Lab23 Deserializer netlist (notSDF back-annotated). The 32-word FIFO clearly goes from empty to a normal state; and, uponreceiving-domain ReadCmdStim, the stored data are read out onto the parallel bus

Fig. 19.6 Closeup of the stored data which are being read out from the Deserializer FIFO

Fig. 19.7 Closeup of the arriving serial data during netlist simulation



Think about the possible kinds of verilog debugging technique.


Optional; Review fork-join functionality in Thomas and Moorby (2002)section 4.9.


20.1 The Serializer and The SerDes

Figure 20.1 is an update of the SerDes block diagram which was presented at thestart of Week 9 Class 2. Notice the slight change in the location of the clock divide-by-two, which now is in the Serialization Encoder:

Fig. 20.1 Current status of the SerDes project. Hatched areas have been implemented

Although the serializer looks almost as complex as the deserializer, it is muchsimpler. For one thing, there is a well-defined 1 MHz clock with no need for extrac-tion from a stream. The serializer PLL only has to do a phase-locked 32x frequencymultiplication to provide the serial output clock. Also, the serializer’s FIFO is in asingle clock domain.



The Serializer will use exactly the same PLL and FIFO modules as theDeserializer; however, we shall have to create new SerEncoder andSerialTx modules for the sending functionality.

20.1.1 The SerEncoder Module

This module, implementing the Serialization Encoder, reads from the serializer’sFIFO to create serial packets.

It shifts out each packet serially to the SerialTx.

20.1.2 The SerialTx Module

This module, implementing the Serial Transmitter, is what transmits the data on theserial line.

Because it contains the sending-side PLL, it is used to clock the SerEncodershift register.

20.1.3 The SerDes

With a working Serializer, assembling the complete serdes is trivial: Onemerely needs instantiate a Serializer and a Deserializer in a containingSerDesmodule, connect them with a wire, and supply a testbench for the SerDes.This lab will complete our class project, although we shall use the design later whenwe study design-for-test.

20.2 SerDes Lab 24


Lab Procedure

Step 1. Reuse the previous design. Start by creating a new SerDes subdirectoryin the Lab24 directory with a Deserializer subdirectory under it.

Make a new, complete copy of the provided solution from Lab23/Lab23Ans/Lab23 AnsStep10/Deserializer into the new Lab24/SerDes/Deserializer directory. If you wish to debug your own Lab23 work now, youmay use your own files, but the details of the instructions below assume you areusing the provided Lab23 answers.

20.2 SerDes Lab 24 363

There is no reason to copy over the Lab23 unit-test or synthesis files at this time;if useful, they may be copied from the Lab23 subdirectories later.

Step 2. Set up the SerDes files. Because the PLL and FIFO designs will be thesame throughout the serdes (but different instances may be parameterized differ-ently), some reorganization is appropriate.

The directory structure should be changed as shown in Fig. 20.2.

Fig. 20.2 File reorganization for Lab 24

Right next to the Deserializer subdirectory, create a new Serializersubdirectory in SerDes. Also, move the deserializer PLL and FIFO subdirectoriesup one level, so that Serializer, Deserializer, FIFO, and PLL all are inthe same SerDes directory.

Move Deserial.inc into the SerDes directory, renaming it SerDes.inc.Make up a simulator file list in the SerDes directory; name it SerDes.vcs,

and verify the new arrangement by using SerDes.vcs to load the new designhierarchy into the simulator.

Step 3. Prepare the Serializer module. Under Serializer, create a newSerialTx subdirectory, and install a SerialTx.v file there, with an emptySerialTx module, instantiating the PLL. Use the header from SerialRx.vtemporarily, if you want.

Fig. 20.3 Overall schematic of the completed Serializer. Not all Resets shown


Likewise create a SerEncoder subdirectory in Serializer; it should con-tain a SerEncoder module in SerEncoder.v. Copying the DesDecoderheader into SerEncoder.v may be a good way to reuse your previous work.A schematic of the design is provided in Fig. 20.3.

The serial packet format will use the same framing on both ends of our SerDes,so, copy the pad localparam assignments and other shared information to anew include file, SerDesFormats.inc, located in the SerDes directory. InLab23, these were in the DesDecoder module. Copy the comments explain-ing the format, too, in case debugging may become necessary. However, keep inmind that the specific format of the packets and their padding is parsed in detailfrom the DesDecoder verilog, and is formed by detailed verilog coding in theSerEncoder, so this new include file has no portability function. The file con-tents other than comments should be the following:

// SerDesFormats.inc:localparam[0:0] YES = 1’b1; // For general readability.localparam[0:0] NO = 1’b0;localparam[7:0] PAD3 = 8’b000_11_000;localparam[7:0] PAD2 = 8’b000_10_000;localparam[7:0] PAD1 = 8’b000_01_000;localparam[7:0] PAD0 = 8’b000_00_000;

The value of factoring out the pad format this way is that if the pad format shouldbe changed in the future, it will be changed the same way both for encoding anddecoding, thus avoiding possible maintenance problems. However, any pad formatchange will require detailed rewriting in both the Ser and the Des.

In the Serializer directory, declare a Serializer module in a file namedSerializer.v, and instantiate in it FIFOTop, SerEncoder, and SerialTx.

Use the corresponding Deserializer modules as guides; you probably won’thave to add new ports or connections, although names and directions will change.Be sure to retain parameters to size the FIFO and the address and data ports; theDeserializer parameters should be used directly for this. Use the same param-eter names and defaults for the new serializer as you did for the Deserializer;the values always can be overridden differently, if desired, during instantiation.

Your Serializer ports should be:

Outputs: SerOut, SerValid, FIFOEmpty, FIFOFull, SerClk.Inputs: ParIn (32 bits), InParValid, ParInClk, SendSerial (the

request to send), Reset.

In the Serializer, you will want a control input to assert a request to read the(parallel) FIFO output into the serial encoder. Recall that in the Deserializer,you had instead a control input to read the FIFO output into the receiving system.By controlling FIFO read at both ends, the FIFO is most properly used as a bufferduring continuous communication. Of course, on a FIFO-full or FIFO-empty con-dition, write may have to be controlled, too; but, we shall leave these problems tothe protocol of the system containing our serdes.


The Serializer should compose a FIFO write command to move new datainto the FIFO from the parallel bus. Thus, the Serializer controls FIFO readand write requests, in addition to routing the 1-MHz parallel-data input clock(ParInClk) to the FIFO.

The Serializer also should provide a FIFO-valid control (F Valid) for theParValid input pin of the SerEncoder, to flag usable parallel data from theFIFO; this can be done very simply by a continuous assignment,

assign F Valid = !F Empty && !Reset;

The other controls just described may be provided this way:

assign F_ReadReq = !F_Empty && SerEncReadReq && SendSerial;assign F_WriteReq = !F_Full && InParValid;

Here, F∗ refer to FIFO pins, SerEncReadReq is FIFO ReadReq from theSerEncoder, and InParValid and SendSerial are Serializer inputports.

After a first cut at the Serializer, create a do-nothing testbench in a separatefile named SerializerTst.v in the Serializer directory. Use it to simulatebriefly to check your connections and file locations.

Also, create empty placeholder files named SerDes.v and SerDesTst.v inthe SerDes directory.

The resulting file locations should be as shown in Fig. 20.4.

Fig. 20.4 File locations forthe Serializer

Step 4. Complete the serial transmitter. This module, SerialTx, should be verysimple, like the SerialRx module of the Deserializer. All it need contain is(a) a simple continuous assignment statement passing the serial data through frominput port to output port and (b) a properly connected PLL instance.

Step 5. Complete the serialization encoder. Refer to Fig. 20.3 above, which givesa Serializer block diagram, and to the schematic Fig. 20.5, which gives connectivitydetails.

We shall design this module for synthesis.The SerEncoder module has to use the serial and parallel clocks to frame the

data and transfer it serially to the SerialTxmodule. The clocks are independent ofthis module’s functionality and are guaranteed by the SerialTx’s PLL submoduleto be phase-locked, so there is no need to extract phase or frequency information.


However, the serialization has to be buffered so that it is not interrupted unless theFIFO, which provides the parallel-side input data, has become empty.

Recall that the 2:1 divided clock on the deserializer end of the serdes (Lab22)could be provided by the Deserializer testbench; this was because the Des wasoperating at 1 MHz and was providing correct parallel data all by itself; it was up tothe receiving domain to use the latched ParOut data properly. This is not feasibleon the Ser end, because the Ser not only has to operate at 1 MHz, too, but it alsohas to determine the rate at which parallel data will be copied in and serialized. Thismeans that a 1/2 MHz derived clock has to be part of the Serializer design; weshall derive it in the SerEncoder.

We shall obtain the 1 MHz clock (ParClk) provided to the Serializerfrom an external source (initially, your SerializerTst verilog testbench). TheSerEncoder therefore must contain a block doing a simple frequency-halving ofthe ParClk input and using it to request FIFO reads.

Fig. 20.5 TheSerEncoder. Dotted out-lines indicate blocks of code,not hierarchy

In addition to this, we shall define three other functional blocks within theSerEncoder module, as shown in Fig. 20.5:

• Loader block: Read on every posedge of the half-speed ParClk(HalfParClk), this always block will do two things:

1. Load the SerEncoder’s 32-bit buffer with a new word from the FIFO, orwith all 0’s if the FIFO was empty.

2. Assert or deassert SerValid, depending on whether the FIFO was empty.

• Shifter block. This always block will be read on every posedge ofSerClk, and exclusively will control the serial data values.Shifter simply lets its counter wrap at 64 and uses the count on each clock

to determine (mux) whether a PAD bit, or a data bit from SerEncoder’s inputbuffer, is applied to the serial line out.

• FIFO reader block. This block asserts a FIFO read request on every otherParClk to make available a new 32-bit data word for framing. It depends uponthe 1/2 MHz clock, HalfParClk, derived from ParClk. There is no reasonnot to implement this block as simple combinational logic, for example as,


assign FIFO ReadReq= HalfParClkr && !F Empty && ParValid && !Reset;

The complete block diagram of SerEncoder is shown in Fig. 20.5.

Step 6. Use SerializerTst to simulate the Serializer until it operatescorrectly. One thing to check is that the FIFO should stop accepting writes whenit is full. Also, parallel data should be flagged as invalid when the FIFO is empty.When the FIFO goes empty, the serial data should be flagged invalid when the lastvalid packet has been shifted out.

Typical Serializer simulation results are shown in Figs. 20.6 through 20.8.

Fig. 20.6 Overview of the first good Serializer verilog source simulation. The FIFO clearlygoes full and empty under reasonable conditions

Fig. 20.7 Zoom in on first good Serializer source simulation, showing proper handling of theindividual parallel-bus words


Fig. 20.8 Very high zoom in on the Serializer source simulation, showing individual serial-line clock cycles

Step 7. SerDes simulation. Connect the Serializer serial output to theDeserializer serial input by instantiating both in a new SerDes module. Youprobably will have to rename some wires, for example to make collateral outputsfrom both halves distinct.

Simulate, using at first your SerializerTst testbench (renamed toSerDesTst). Any serial transfer at all should be considered a complete successat this point. Depending on your FIFO state machine design, you may have to asserta reset twice to initialize everything at the beginning of the run.

Some typical SerDes simulation results are shown in Figs. 20.9 and 20.10.

Fig. 20.9 First complete SerDes verilog source simulation. The sending and receiving FIFOsseem to be operating reasonably


Fig. 20.10 Closer zoom in on the first complete SerDes verilog source simulation, showingindividual parallel-bus words in the receiving domain

Optional: After your SerDes is working, modify the Deserializer to havetwo different parallel data outputs, one clocked out as previously, on the receiving-domain clock, and a new one clocked out on the deserializing PLL embedded clock(ParClk), which is in a different domain.

Then, simulate. Notice the skew in the appearance of the 32-bit, clocked-outdata. This is an illustration of the digital side of the clock-domain synchronizationproblem. Of course, a digital simulator can not display the more serious analogueproblem of intermediate-value hangups which we discussed in Week 8 Class 1.

Step 8. SerDes unit-level synthesis. A synthesis of the full SerDes system willtake too long for classroom scheduling at this point, because system timing refine-ments may require several full SerDes synthesis iterations to get the completedesign working.

For this lab session, just verify that each major component of the SerDes can becompleted on the unit level: Synthesize, under reasonable constraints, the following,each in its own, preexisting subdirectory:

• PLL• FIFO• Deserializer.DesDecoder• Serializer.SerEncoder

The Deserializer.SerialRx and the Serializer.SerialTx are es-sentially empty wrappers, containing just a PLL instance, so there is no reason tosynthesize them now: Just do PLLTop alone.

You may wish to look at the synthesis scripts in the answer subdirectories. Al-though the DesDecoder is a relatively small design, it requires tight setup and


hold constraints for a perfect netlist, so the synthesizer will take hours (adjusting forthe design rules) to finish. Read the comments in the answer script provided to seehow to sidestep this long wait.

Likewise, the FIFO, a large design for a single module, will take a very longtime. Check the comments in the FIFO answer synthesis script for advice on ob-taining a usable netlist early in the run.

After synthesis of the individual units listed above, use the same simulation test-bench as you did for the source, to simulate each synthesized netlist. You shouldconsider reusing simulation testbenches from previous labs as starting points. Theunit-level netlists should simulate almost as well as, or better than, the source verilogfor them.

Optional: During each unit synthesis, write out an SDF file; then, use the origi-nal TSMC verilog simulation core library (tcbn90ghp.v, not the timing-kludged∗ v2001.v used so far everywhere) to simulate the netlist using the best possi-ble wireload-model delays. The delays won’t make much difference, but the TSMCtiming checks may indicate setup or hold problems requiring attention in a designintended for fabrication.

You may leave the synthesized netlists and side files in the individual subdirec-tories; they will not be used in higher-level simulation or synthesis file lists.

Figures 20.11 through 20.19 show some typical netlist simulation results, usingthe approximated timing in the provided verilog-2001 library. SDF back-annotationmakes for a slight difference in the simulation waveforms.

The PLL:

Fig. 20.11 Overall view of synthesized PLL netlist simulation

Fig. 20.12 Zoomed-in view of synthesized PLL netlist simulation, showing individual serial clockcycles


The FIFO:

Fig. 20.13 Overall view of synthesized FIFO netlist simulation

Fig. 20.14 Zoomed-in view of synthesized FIFO netlist simulation, showing individual 32-bitword transfers

The DesDecoder:

Fig. 20.15 Deserialization decoder (DesDecoder) overall view of synthesized netlist simulation


Fig. 20.16 DesDecoder netlist simulation, zoomed in to resolve individual parallel-bus words

Fig. 20.17 DesDecoder netlist simulation, zoomed in to resolve individual serial clock cycles

The SerEncoder:

Fig. 20.18 The SerEncoder: Overall view of synthesized netlist simulation


Fig. 20.19 Zoomed-in view of SerEncoder netlist simulation, showing individual serial clockcycles


What might be possible incompatibilities of the Ser vs. Des devices?What about assertions and timing checks for the serdes?


Optional: Compare your shift register with the switch-level MOS shift register inThomas and Moorby (2002) section 10.1. Why would you want to write a switch-level model if a logic synthesizer was available?


21.1 Design for Test (DFT)

21.1.1 Design for Test Introduction

Design for Test (DFT) is more an orientation, or perhaps a methodology, than adesign technique. DFT means that the hardware will be testable, which is to say, itsfunctionality will be verifiable, after implementation. Testability usually is plannedon the assumption of two major sources of malfunction, design errors and hardwarefailures.

Design errors can be found during testing only by observing a failure of function-ality. Most of these errors will be found and fixed during simulation or synthesis;this means testing both of the design source and of the back-annotated netlist fol-lowing floorplanning or completed layout. Errors of logic or of timing can be foundand corrected before tape-out for mask creation.

Hardware failures, of course, can be discovered only after implementation andproduction of the hardware. The most serious of these are fabrication quality orphysical design errors in which one or more individual gates become stuck in oneor another nonfunctional state in every unit produced. This kind of highly localizedfailure cannot be detected unless it has an observed effect during testing of a designprototype or of the hardware units affected. Less serious are random fabricationerrors which affect occasional units in a production run. Defective units must beeliminated before delivery to a customer; but, for this to be done, the defect in eachunit must be observed during hardware testing.

Other hardware errors may occur after bonding and packaging, in the form ofshort- or open-circuit defects; these latter are serious but not difficult to discover,because they are on the boundary of the IC and thus can be observed easily.

There also are “soft” errors caused temporarily by thermal or mechanical stress,unexpected external electric or magnetic fields, or exposure to ionizing radiation.An error is “soft” if it spontaneously recovers itself permanently, or vanishes, aftera reset or other change of device state. An intermittent hardware failure at a specificgate does not represent a soft error but rather a hard defect which has degraded thatgate. In this course, we shall not be concerned further with soft errors.



21.1.2 Assertions and Constraints

The first line of defense against hardware failure is good software. This means notjust good design specifications, but also meticulous attention to warning messagesfrom the simulator, synthesizer, static timing verifier, or other tool. In addition, goodsoftware means good design insight into possible problems, by use of assertions andwell-chosen simulation unit-test vectors. The functional specifications must be fullyvalidated before attempting to create a hardware representation. Also very importantare the timing checks in the verilog or synthesis library which monitor fulfillmentof the library-level design constraints during simulation.

We have discussed the software side already; now let us look more closely at thehardware.

21.1.3 Observability

This refers to the capability to measure functioning of logical transformations anddata transfers in the hardware. A chip would be 100% observable if every internalstate could be measured externally. Ignoring sequences of tests to be applied, thisreduces to measurement of every internal storage device Q (flip-flop; latch; memorycell) and every input pin I in the chip. The number of inputs then may be written asNI and the number of internal storage devices as NQ. Each test measurement, in adigital device, amounts to one bit of information, which makes for a factor of 2 inthe number of states; so, the total number N of possible states to test then must be,

N = 2NQ+NI

Equipping a device with a simple thing such as an 8 kB cache adds as much as28∗8

∗1024= about 1020,000 new states to observe. This factor then is multiplied bythe number of states calculated without the cache. It is easy to see why testabilityusually is expressed in terms of the number of pins and registers (a logarithm), ratherthan number of states.

A small board-level design can be made highly observable by providing electricaltest points on the traces on the board. A “bed of nails” automated tester then canbring its probes in contact with these test points, stimulate the board, and observethe effects. Any failure ideally can be detected and localized for manual rework, ordiscard, of a board found to be defective.

Such an approach is impractical for large digital IC’s, which have to be manu-factured protected against electrical influences except through their I/O pads. Evenprobing the I/O pads without risking damage is mechanically difficult when theymay number in the thousands for a modern ball-grid chip package.

Thus, observability has to be designed into the chip itself; it can not be left as anafterthought for someone to worry about after they receive the untested package orwafer in its final form.

21.1 Design for Test (DFT) 377

21.1.4 Coverage

Coverage is a metric (statistic) which describes the quality of a sequence of testvectors during simulation or hardware testing. Coverage is related to observability.Given the total number of points observable under the test assumptions (I/O’s only;or, all registers; or, whatever), coverage describes the fraction of them exercised bythe given sequence of test vectors.

Of course, all points are observable in a software simulation. For coverage pur-poses, the test protocol may include the restriction that only I/O’s will be monitored;this permits the software test vectors to be reused by a hardware tester. In such reuse,the simulated results would be compared with the hardware tester measurements inorder to validate the hardware functionality.

The idea behind coverage metrics is to verify the absence of failed design pins,ports, or gates; failures of these usually can be described as “stuck-at” faults. If thetest vectors can toggle every observable point, then coverage by those vectors is100%. A fault simulator is a specialized logic simulator which checks for togglingof observable points.

In hardware testing, a sequence of test vectors may be more or less efficientat achieving a given coverage. Reducing (“collapsing”) the number of vectors toachieve a given coverage is a desirable goal in testing, because it reduces the timerequired for the hardware tester to achieve the coverage. Time is money duringmanufacture.

Coverage can be used to describe a set of test vectors with reference to all points,not just those observable under the given assumptions; in this case, the maximumachievable hardware coverage usually will be less than 100%.

Be aware through all this, that ECC in the hardware can correct many defectswhich are missed in testing; however, this only works for stored data, not for controllogic. The goal of testing is to detect and eliminate all defects possible.

Coverage Summary:

Coverage is a metric representing thoroughness of testing of observable states:

• Coverage is a measure on a set of test vectors.• 100% coverage means all observable states have been tested.• Low coverage means new or better vectors have to be used.• Coverage usually is considered poor until it reaches 95% or more.

Coverage can refer to software (simulation) vectors:

• Represents lines of code or statements executed.• Highly order-dependent. Non-Markovian.• Independent of functional verification.

Coverage can refer to hardware vectors:

• Represents random defects checked.• Generally only hardware stuck-at fault detection.• Represents hardware functional verification.


21.1.5 Corner-Case vs. Exhaustive Testing

Hardware testing is a sophisticated field, and we shall touch on just some of itsprinciples here.

Recalling the 8-kB cache computation above, it is completely unrealistic to as-sume 100% hardware coverage of an entire chip of any practical size. To test allcombinations of states possible, looking for random defects and assuming a testvector of length 1024 bits, and a vector application rate of 1 GHz, the 8 kB cacheexample above would take about 1019,980 years to complete. This is from 107 sec-onds/year = 109+7 vectors/year = 109+7+3 = 1019 bits/year. The age of the knownuniverse is only about 1010 years.

Happily, one can assume that not all states must be observed to be sure of thehardware functionality. For example, interaction between adjacent storage cells ina memory array on chip is very possible; so, a defect might be observed in onecell only when adjacent cells were in a certain state. However, such interaction isvery unlikely among cells separated even by one other cell. For this reason, apply-ing an alternating ’b010101. . . vs. ’b101010. . . pattern to a cell array (regis-ter) generally reveals any single-bit defect or defect dependent on any two adjacentcells. Combining this with a “walking 1” and “walking 0” test (shifting an isolated‘1’ and then shifting an isolated ‘0’), one can assume at a certain level of confi-dence that all states which would reveal a defect in a hardware register had beenobserved.

Let’s look at the 8 kB cache example again, assuming the memory is organized(in at least two dimensions on the die, remember) so that only adjacent bits in astored byte can interact. Then, alternating patterns as above, twice applied each,require 4 vectors per byte. Walking ‘1’ and ‘0’ require, say, 8 vectors each, for atotal of 20 8-bit vectors to test one byte. Thus, 8 kB would require 8× 1024× 208-bit vectors, or a little over 106 bits of state. Assuming 1024-bit hardware testvectors, this is only 1.3 k vectors; at 1 MHz, time to test one cache memory thisway would take only a millisecond or two. In practice, a millisecond easily wouldbe available to test individual 8 kB cache memory dice on a wafer; so, more elab-orate test vectors, for example vectors validating adjacent pairs of bytes, could beaccommodated.

Corner case usually refers to the spatial or temporal boundary of a range of val-ues. The latter cache test vector example above in a sense was a corner case, becauseit was intended to test adjacency, whereas, only a small fraction of all memory bitscan be adjacent to those being tested at any one time.

More usually, corner case testing refers to selection of isolated values to test.For example, in a design including a verilog for loop that iterated through i=0 toi=127, one would pay special attention to the values 0, 1, 126, and 127 as thesoftware corner cases of the loop. And, watch out for −1 and 128!

Spatially, physical design problems should be sought especially on the bound-aries of voltage islands or clock domains. And, in defining hardware test vectors,vectors of all-‘0’, or all-‘1’ should be applied as externally-defined corner cases.


Temporally, within the tested device, states immediately following chip-enable,chip-disable, read, write, and so forth would be given greater coverage than others.Look for something to go wrong every time you turn a corner!

21.1.6 Boundary Scan

Boundary scan provides for increased observability of the pin-outs of chips on aboard; these are the boundaries of the chips. Rather than provide numerous directelectrical contact points for a bed-of-nails automated board tester, or for manualprobing, dedicated traces on the board are connected to the chip I/O’s; and, theboard I/O’s themselves are used to apply stimuli and observe results. This makespossible observation of inter-chip communication on wires or traces not normallyrouted to a board I/O.

For boundary scan, each scanned chip is equipped with a test port, the TAP(Test Access Port), for controlling the scan. The test logic of the TAP controllermay be very simple, as in the internal scan exercise we did in Week 2 Class 1,or it may include a device-specific, programmable state machine which automatessome of the test mode shifting to save time or computation by the external hardwaretester.

Very much the same as the JTAG internal scan port of Week 2 Class 1, the TAPhas five pins defined: TDI (Test Data In), TDO (Test Data Out), TCK (Test Clock),TMS (Test Mode Select), and TRST (Test controller Reset). Except TMS, the TAPpins physically may be pins having other chip functions, such as control, address,or data pins, the test functionality being selected by asserting TMS.

A generic boundary-scan is shown in Fig. 21.1. In test mode, the scan latches (orflip-flops) are chained together, stimuli may be shifted in by TCK, and outputs maybe shifted out. Each scan latch in a cell is connected to input and output muxes so itcan be bypassed during normal chip operation. At any time during normal operation,TMS and TCK may be asserted to store the output states in the latches for subsequentshift-out and inspection.


Fig. 21.1 Boundary scan logic on a board-mounted IC. Other IC’s on the board are omitted, as arethe TCK and TMS distributed to each of the scan cells. The IC is shown in test mode, with the scanchain the dotted line connecting the scan cells. Boundary scan makes points internal to the board,but not internal to the IC, observable

Boundary scan may be combined with internal scan or BIST (or both) in thesame chip.

21.1.7 Internal Scan

We have presented the rationale for muxed flip-flop internal scan in Week 2 Class 1.Briefly, in internal scan, the registers in the IC are replaced by scan cells which canbe configured as one or more chip-wide shift-registers in test mode. Several differentways of designing for internal scan are described in the readings recommended atthe end of this chapter.

Sometimes the terms “full scan” and “internal scan” are used interchangeably.However, because it is possible to omit some registers (for example, an entire mem-ory) from an internal-scan chain, internal scan does not imply full scan. Therefore,we prefer to use the two terms differently.

Full internal scan connects every register in the IC in the chain. Because inputsare observable already, this means that full internal scan in principle permits ob-servability of all 2NQ+NI possible internal states. This observability allows a levelof confidence which may be required in certain life-critical systems, or systems inspace or undersea vehicles, which can not be maintained or repaired during use.Such systems often are simple enough to make 100% coverage feasible.


Fig. 21.2 A generic design without internal scan. Outputs are latched, and internal regions ofcombinational logic are separated by sequential elements. A clock is shown distributed to all thesequential logic

Internal scan does not include pad cells, if they are present in the scannedmodule. Internal scan logic inserted in the design of Fig. 21.2 is shown in Fig. 21.3.

Fig. 21.3 The same design as above, after internal scan insertion. Every sequential element isreplaced with a scan cell. The scan data chain is shown (each SI to SO). Scan clock and mode, aswell as other controls, are omitted

Recall that almost all IC’s will have outputs latched for synchronization reasons.Outputs, then, will be in the scan chain. Other logic may exist which is not observ-able; but, with full internal scan, this unobservable logic can not affect functional-ity of the chip and thus usually may be ignored for test purposes. Such logic maybe present because of design errors or oversights, or because of design or produc-tion work-arounds. A famous example of the latter was the 386-SX microprocessor:Whenever a manufacturing defect was found in the onboard cache of a 386 die, thecache was disabled by tying off a pin, and the chip was packaged and sold at a lowerprice as a perfectly usable, cacheless, 386-SX. These 386 processors had thousandsof nonfunctional gates.

Unobservable logic may affect operating parameters of a chip, because unob-servable gates still may draw clock current and may leak the same way as functionalgates, causing additional power dissipation merely because they exist on the chip.


It should be mentioned again that internal scan and boundary scan are not mutu-ally exclusive and can be combined in the same chip.

21.1.8 BIST

Built-In Self-Test is an idea not restricted to IC design. Every PC has a ROM-basedmemory check executed automatically when it is turned on; likewise, when a harddisc is formatted at low level, the software, often in a disc-controller ROM, auto-matically verifies read and write access to every bit.

One advantage of BIST is that it requires no external test apparatus and only onecontrol input, as well as one result output of some kind. In almost all cases, theseadditional I/O requirements can be met by assigning multiple functions to preexist-ing design pins. Even more important, any number of BIST-equipped chips can betested concurrently during manufacture or operation. So, BIST allows design sizeand complexity to grow, and to be verified, with little additional hardware produc-tion testing time.

Thus, the cost of BIST in terms of chip I/O is negligible. However, the coresilicon required for BIST may be substantial, because the self-test controller andprogram must be stored somewhere; also, usually the BIST must be executed uponthe functional logic by means of additional, dedicated internal routing. If the designis complicated and includes many internal registers or modes of operation, the BISTmust be very elaborate to generate vectors to achieve even minimal confidence thatthe hardware is fully functional; this implies significant design overhead. Further-more, test results must be evaluated on-chip, which requires yet more storage andlogic.

The process of BIST insertion is illustrated in Figs. 21.4 and 21.5.

Fig. 21.4 A chip and a BIST test controller IP block

Fig. 21.5 Typical BISTinsertion

21.2 Scan and BIST Lab 25 383

Because of this, BIST is most efficient for IC’s of large size and regular, repetitivestructure – in other words, for memory IC’s. If the memory is fabricated with sparecells or words, BIST can be used during manufacture to find defective cells andreplace them with intact ones, increasing the production yield of the manufacturingprocess.

BIST usually depends on a standard TAP controller interface. BIST may be com-bined with scan logic. The BIST may accept a test-mode input, and, as a result putthe chip into test mode, scan in preprogrammed patterns, scan out the latched states,and evaluate the result, terminating with a go/no-go signal to the rest of the IC or tothe containing system. A random-logic state machine or a ROM may be involved.LFSR’s may be used to generate pseudorandom test patterns efficiently.

BIST often is used in systems which require verification without direct access –for example, in the space program or in underseas devices. The currently activeMars rovers are equipped with elaborate BIST, and associated redundancy, for errorrecovery. The high radiation exposure requires special accommodation to soft errorscaused by the cosmic rays against which Mars has no magnetic or atmosphericshielding.

In summary, DFT is a methodology which implies use of a certain collectionof design techniques. The methodology includes attention to corner cases and testcoverage of points of probable failure, both in software, during design, and in hard-ware, after implementation. The techniques encompass simulation, fault simula-tion, insertion of special hardware devices for internal or boundary scan, inclusionof built-in self-test apparatus, and use of I/O’s to provide observability of internalfunctionality.

21.2 Scan and BIST Lab 25

Do these exercises in the Lab25 directory.

Lab Procedure

Step 1. Internal scan exercise. Create a subdirectory named IntScan in theLab25 directory. If you have not yet done Step 9 of Lab05 (Week 2 Class 1),copy in the purely-combinational design and do Step 9. Add the FF’s, renaming thedesign to Intro TopFF, insert the scan cells using the synthesizer, examine theresult, and proceed to the next Step in this lab.

If you have done this Step already, just proceed to the next Step.

Step 2. Synthesizer boundary scan insertion. Boundary scan is intended for en-tire chips on a board; like BIST, it is inefficient for a small design. In this lab,we shall use the synthesizer to insert boundary scan in the original Lab01 designjust to see the result. We choose the Lab01 design because automatic insertionhas certain requirements which would be unnecessarily complicated to meet in ourlargerSerDes design.


A. Start by creating a new directory BScan in the Lab25 directory. Copy in theoriginal Lab01 Intro Top design, and simulate it briefly with its TestBench.v,to verify connectivity. After this, copy in the verilog pad model wrapper file and thesynthesis script file, both of which will be found named in the Lab25 directory.If you have not recreated the symlinks, you will have to copy the files from theCD-ROM misc directory.

B. Add a test port. Automatic boundary scan requires a preexisting TAP, soadd one new output, ScanOut, to Intro Top, and four new inputs, ScanIn,ScanMode, ScanClk, and ScanRst. These are just the usual internal-scan JTAGport names for the TAP.

C. Instantiate pads. For the synthesizer to insert boundary scan, we must havepads in the design. We need three different kinds of pad: input, output, andthree-state output. A three-state output pad is required for the TAP TDO(= ScanOut).

In a verilog design, pads are added inside the top-level module ports, as indi-vidual components within the top-level module; the module port names are signalnames, not hardware pin contacts, in this context. Of course, a top-level wrappermodule normally would be used; but, regardless, putting the pads inside the top de-sign module permits the same testbench to be used for a design before and after padshave been added.

The TSMC pad library that matches our core library can not be used for syn-thesis, because the pads each are multipurpose and may be controlled by inputs tobe input, output, or bidirectional. The models include switch-level elements and aretoo complicated both to be accurate in simulation and to be synthesizable.

For use in this lab, three pads have been selected from the pad library andprovided with verilog wrapper modules for simulation, only. The wrappers maybe seen in the verilog file, tcpadlibename 3PAD.v, linked in the Lab25directory.

The names of the wrappers, only, should be used to instantiate pads for theLab25 boundary scan exercises. The names are: PDC0204CDG 18 Out for out-put pads, PDC0204CDG 18 In for input pads, and PDC0204CDG 18 Tri for theone three-state output pad. The “0204” represents a drive strength (02 = 2 mA),and the “18” means 1.8V pad I/O (for 1.0V core logic). The port declarations forthese library components are typical for verilog:

module PDC0204CDG 18 Out (output PAD, input I);module PDC0204CDG 18 Tri (output PAD, input I, OEN);module PDC0204CDG 18 In (output C, input PAD);

The OEN output enable pin is asserted low in the library; however, the logichas been inverted to be asserted high for the three-state component in the verilogwrapper.

Remember that the ports of the Intro Top module were named X , Y , and Z(outputs) and A, B, C, and D (inputs), and that each port pin in verilog is associated


with an implied wire of the same name. After declaring a few new, obviously-namedto- and from- wires, the resulting pad structure for Intro Top should look some-thing like the following:

PDC0204CDG 18 Out Xpad1( .PAD(X), .I(toX) ); // X is port; toX is wire.PDC0204CDG 18 Out Ypad1( .PAD(Y), .I(toY) );PDC0204CDG 18 Out Zpad1( .PAD(Z), .I(toZ) );PDC0204CDG 18 In padA1( .C(fromA), .PAD(A) );PDC0204CDG 18 In padB1( .C(fromB), .PAD(B) );PDC0204CDG 18 In padC1( .C(fromC), .PAD(C) );PDC0204CDG 18 In padD1( .C(fromD), .PAD(D) );

The TAP port has to connect to its pad cells. The core I/Os of the pads should beleft dangling, so the synthesizer can determine how to connect to them:

PDC0204CDG 18 Tri TDOpad1( .PAD(ScanOut)/*, .I(),.OEN()*/ );PDC0204CDG 18 In padTMS1( /*.C(),*/ .PAD(ScanMode) );PDC0204CDG 18 In padTDI1( /*.C(),*/ .PAD(ScanIn) );PDC0204CDG 18 In padTCK1( /*.C(),*/ .PAD(ScanClk) );PDC0204CDG 18 In padTRST1( /*.C(),*/ .PAD(ScanRst) );

The pad instances will be treated as any other instances during synthesis andnetlist optimization. To protect the pad wiring from being modified during synthe-sis, add a comment directive to Intro Top.v which flags all pad instances asdont touch. You may use a wildcard, for example, “∗pad∗”, for this.

The synthesis .sct script will be used to tell the synthesizer which ports (seeStep 2B in this lab) we want it to use for the TAP.

D. Synthesize the boundary scan logic. After instantiating and wiring in the pads,you may use the BScan.sct synthesis script in the Lab25 directory to insert theboundary scan cells and TAP controller. After compiling, write out the hierarchyand view it in design vision. Read in the file, Intro TopNetlistBSD.v.You may wish to back-track in the schematic from the tdo pad to the boundaryscan tap controller, for example.

This exercise merely showed how to handle TAP ports; the controller has notbeen programmed, so this netlist can not be simulated.

The remainder of this lab is a memory BIST design.

Step 3. Design a BIST for our DPMem1kx32 RAM.A. Set up a working location. Create a new subdirectory named BIST, and copy

into it your DPMem1kx32.v RAM model from the Lab24 FIFO directory. Namethe new copy Mem.v and rename the module correspondingly.

B. Generate a benchmark synthesis result for later use; then set up a simulationtestbench. This preparation will reacquaint you with the (already simulated) mem-ory functionality and thus speed development and lessen the likelihood of designerrors.


First, synthesize the renamed RAM of A, in Mem.v, sized for 32 words of 32 ad-dressable bits each, optimizing for area only. As usual, impose simple logical-netlistdesign rules such as maximum fanout, etc. Make a copy of the synthesizer’s areareport in a text file for later reference. Also be sure you keep a copy of the synthesisconstraints you used to obtain this area. Do not simulate this netlist – we are justinterested in the approximate size of it, to compare sizes when the BIST is added.

For simulation of the verilog (source) design, instantiate the renamed RAM of Ain a new MemBIST Tst.v file, and create a small testbench there that exercises theRAM to demonstrate both write and read. Use a common clock for read and write.Set parameters to size the memory to be 32 addressable bits wide and 32 wordsdeep. Simulate briefly, just to verify basic functionality. Your setup should be thesame as in Fig. 21.6.

Fig. 21.6 Lab 25, Step 3B. The two verilog files are one testbench file and one memory modelfile. Mem is instantiated in MemBIST Tst

C. Prepare a BIST plan which describes what the built-in self-test should do.Of course, it will be for random hardware defects, only. We shall assume that onlyisolated defects can occur; this is solely for lab convenience; but, even so, in a realproduction run, on this assumption, we would catch the majority of hardware fabri-cation defects.

We must be sure that our tests accommodate the memory parameters of widthand depth, so we shall not assign numerical values to width or depth anywhere inthe BIST module. Our memory is equipped with parity, so we should monitor paritythroughout testing, in case of a single-bit failure during testing – even a soft one.

Here’s what our plan for this lab says the BIST should do:First test pattern: Validate the addressability. Write a different value to each ad-

dress word; then, read the values out to verify that the memory hardware can storethe expected number of different data. This will detect permanently shorted or openaddress lines, or obvious address decoder failures.

We shall write a value which counts from 0 to the max address, this count beingreplicated at several offsets in each word, and easily recognizable visually. An ex-ample of such a pattern sequence, counting from address 0 to address 31 (5’h0 to5’h1f), would be,

32’he0e0 e0e0, 32’he1e1 e1e1, ..., 32’hffff ffff

Second test pattern: Write ‘1’ to every bit at all memory addresses, then read outand check the value. Repeat with ‘0’. This will find any bit or bits stuck indepen-dently at either level and missed by the previous counting test.


We shall stop with these two patterns, although more efficient, more thorough,or more exotic tests could be devised. For example, alternating checkerboard pat-terns on adjacent words might be written and read, or walking ‘1’ tests run. Testingof a large RAM can be very elaborate, and thus there is an advantage to program-ming the test into the RAM chip, so that all RAMs in a manufacturing run, or in-stalled in a system, might be tested at once, with minimal need for external testapparatus.

Step 4. Plan the BIST interface. We shall put the BIST logic in a separate module.The layout almost certainly will require that the memory core be a regular cell arrayin its own block; so, the BIST logic will have to be kept in a separate partition ofthe design.

Our BIST will have to address the memory by its address bus, and it will have toread and write on the memory data bus. The clocks can be shared. There will haveto be a new input in the test-equipped memory for a test-start signal, and at least onenew output to report test status. We shall keep these ports separate and not worryabout sharing test functions on preexisting pins. During self-test, the Mem modulemust be unresponsive to external addressing or read or write requests, and it willhave to disable its output drivers.

The best way to accomplish all this is to instantiate the Mem in a wrapper modulewhich also instantiates a module containing the BIST logic. During self-test, theBIST-Mem system then can be isolated from the external world by the wrapper logic.

Step 5. Create the BIST wrapper: Create a new module named MemBIST Topin a file named MemBIST Top.v. by making a copy of Mem.v. Give the newMemBIST Top module the same I/O’s as Mem, except for one new input,DoSelfTest, and two new outputs, Testing, and TestOK.

After modifying the MemBIST Top header for the new I/O’s as just described,instantiate Mem in the MemBIST Top module, and, declaring explicit wires for ev-ery I/O, directly connect all Mem I/O’s to the corresponding MemBIST Top onesusing continuous assignment statements. Explicit wires in this exercise are impor-tant and will simplify interconnection of BIST and Mem later in the lab. There is noarea or timing overhead for such wires, because the synthesizer will convert implicitwires to explicit ones anyway.

Your setup should be as in Fig. 21.7.

Fig. 21.7 Lab25 Step 5. The MemBIST Top wrapper has almost the same module ports as Mem.Mem is shown instantiated inside MemBIST Top


We are done with MemBIST Tst.v of Step 3. After instantiating Mem inMemBIST Top as above, and completing the wiring, rename MemBIST Tst.v toMemBIST TopTst.v, changing the testbench module name accordingly. Changethe Mem instantiation to MemBIST Top. Then, simulate the new two-module mem-ory (MemBIST Top containing Mem) briefly to check connectivity before goingfurther.

Step 6. Define the BIST interface. Create a new BIST.v file for the built-in self-test module, BIST, by making a copy of MemBIST Top.v.

But, before changing anything in the new file, instantiate BIST in the originalMemBIST Top.v by making a copy of the existing Mem instance and editing itas will be described. In this design, because Mem already has a complete inter-face to guide us, it will be easiest to connect a BIST instance first, and then toedit the copied module file in BIST.v to work out declarations of the requiredBIST I/O’s.

The setup for this Step thus is as shown in Fig. 21.8, and the work will be tomodify the BIST instance so we will know how to change BIST.v.

Fig. 21.8 Lab25 Step 6. Using wrapper MemBIST Top to help determine the new BIST moduleports

So, looking at the new BIST instance in MemBIST Top.v:First, we shall have to pass the AdrHi and DWid parameters to BIST; so, we

should retain these exactly as copied from Mem.We can eliminate Dready and ChipEna from the BIST I/O’s.Second, we require control of the Mem data input and output, so we should re-

tain these I/O’s in BIST. Soon, we shall multiplex these Mem busses so they canbe directed either to the MemBIST Top ports, or to the BIST ports. We shall con-nect the DataO output port of the Mem instance to the DataI input of the BISTinstance, and vice-versa. So, we retain the DataO and DataI ports in the BISTinstance.

Third, we should add a BIST address output bus, as well as read request(ReadCmd) and write request (WriteCmd) outputs. These also will have to be mul-tiplexed with Mem inputs. Because none of our planned tests involves simultaneous


read and write, a single BIST address bus can be supplied both to the RAM readand write address input ports.

Fourth, we require a BIST clock input and a hardware Reset input. We shallrequire that testing be clocked by the RAM read clock (ClkR); so, this is the onewe shall supply to BIST.

Finally, we should add the special BIST I/O’s (the DoSelfTest input, andthe Testing and TestOK outputs), all of which will be routed directly to theMemBIST Top module ports. We also should monitor the RAM’s ParityErroutput using a corresponding BIST input during testing.

At this point, after the above described changes were made, the verilog code foryour BIST instance in MemBIST Top should look something like this:

wire ClkRw, Resetw, ...; // Assigned from MemBIST Top module inputs....BIST #( .AdrHi(AdrHi), .DWid(DWid) )

BIST U1( .DataO(), .Addr(), .ReadCmd(), .WriteCmd() // outputs., .Testing(), .TestOK() // outputs., .DoSelfTest(), .ParityErr(), .DataI() // inputs., .Clk(ClkRw), .Reset(Resetw) // inputs.);

Step 7. Implement the BIST controls in MemBIST Top. After adding port namesas above, complete the wiring in MemBIST Top to BIST U1, including the mul-tiplexed data and other busses. You can do the muxes most easily as continuousassignments with conditional operators; you’ll have to declare new wires for theBIST instance to do this. If, initially in Step 5 above, you used continuous assign-ment statements to wire the connections between the MemBIST Top header andthe Mem instance, only the new wire declarations and the conditional expressionswill have to be added.

Briefly, when the DoSelfTest input goes high, the edge will initiate the test.Muxes in MemBIST Top will be put in test mode by the assertion of Testing,which will be kept at ‘1’ while self-test is in progress; TestOK will be as-serted after a self-test if the test found no defect (otherwise, TestOK willstay at ‘0’).

After this, go to the new file named BIST.v, which you created above, removeeverything but the header and parameters from your new BIST module, changethe port names and directions to match the code fragment above, and simulateMemBIST Top briefly, using MemBIST TopTst, just to check connectivity.

Step 8. (partly optional) Implement the BIST functionality. Before anythingelse, the test mode control should be defined. Here is one way to do this:


...reg AllDoner // Flag completion of testing for internal BIST use.

, Testingr; // Sets test mode for the BIST.//assign Testing = Testingr; // Testing is a BIST output port.//always@(posedge Clk, posedge Reset)

begin : TestSequencerif (Reset==1’b1)

beginTestingr <= 1’b0;AllDoner <= 1’b0; // Normal level (noncommittal)....end

else // Must be a clock:begin : TestClockedif (DoSelfTest==1’b0)

begin // Init, but leave TestOK alone:Testingr <= 1’b0;AllDoner <= 1’b0;...end

else Testingr <= 1’b1; // Entering test mode for this clock.... ...

All reg names end in ‘r’. A rising level on Testingr then can be used to startthe self-test; the test routines should determine when the testing is finished.

A clocked always block can be used as a simple state machine to sequencethe BIST through its tests. Our tests in Step 3 above address all of memory forwrite and then for read. It seems reasonable to implement each test as a separatealways block containing an address counter and clocked on the BIST input Clk.The always for each test would be run selectively because of a flag set in the test-sequencing block. Actually, each always block used to write memory could bedifferent from the one verifying the stored results. The test sequencer could checktest status on every clock; when a test is complete, the current always block couldset a flag reporting results and telling the test sequencer it is done.

Implementation of this BIST makes a very good optional project; but, it is verytime-consuming and probably is at least a day’s work. Therefore, a complete im-plementation is provided for you in BIST Done.v in the Lab25/Lab25 Ansdirectory.

To continue this lab, copy BIST Done.v into your BIST directory. Copy yourown BIST.v to a different name to save it. After looking through BIST Done.v,copy or link it to BIST.v to replace your empty interface model. Simulate brieflyto check connections.

If you want to spend some lab time on the BIST module, it might be interestingto rewrite part of the answer provided as a verilog task: The test sequencer code forphases 1 through 6 is very repetitive; this suggests replacing it with six calls to atask with a single number as input.

Simulate to validate your changes: A good simulation might exercise the memorya little; then, run a self-test; then, exercise it a little more. See Fig. 21.9.


Fig. 21.9 Typical MemBIST verilog source simulation

Step 9. Synthesize the completed MemBIST Top design, optimizing it for areawith the same constraints as you used in Step 3 B. The synthesis will take about 10minutes with mild constraints. With a quick netlist, don’t bother with simulation;the netlist may not simulate because of lack of clock constraints, but its size will beabout right. Compare the size of the design with and without the BIST.

Optional: After comparing areas, you may add clock definitions and a maximumoutput delay constraint to your synthesis script, constrain to fix hold violations, andresynthesize (see the answer directory for specific values). This netlist will takea half-hour or more to synthesize, but the result, as in Fig. 21.10, will simulatecorrectly.

Fig. 21.10 Simulation of the synthesized MemBIST verilog netlist, using the same testbench asfor the source


21.2.1 Lab postmortem

How big was the BIST netlist, compared with the one for our bare DPMem1kx32RAM?

Would you expect the use or arrangement of the BIST tasks to make anydifference in synthesis?

21.3 DFT for a Full-Duplex SerDes

We’ll finish up today with a lab on test insertion for our SerDes. First, we’ll mod-ify our Lab24 SerDes design to have full-duplex functionality, as would a PCIExpress lane. Then we’ll add test logic.

21.3.1 Full-Duplex SerDes

A full-duplex lane is just a dual serial line, with one serdes sending in one direction,and a second serdes sending in the other. The two serial lines being independent,this lane can send and receive simultaneously in both directions.

Fig. 21.11 Two instances of the Lab24 SerDes assembled into a full-duplex lane. A simulatortestbench can represent the two communicating systems, A and B. The FIFO depths differ for thetwo systems

For variety, we shall assume that system A can provide a FIFO only with 8 words,but that system B can provide one 16 words deep. See Fig. 21.11. Such a differencewould not be unusual, assuming that our duplex serial lane was between differentchips on a board. For example, the communicating chips might be from differentsuppliers, in different technologies, or fabricated on availability of different kindsof IP.

21.4 Tested SerDes Lab 26 393

It is important to understand the organization of the full-duplex lane proposed.Each SerDes device model can be used to span the opposite sending and receivingends of a single serial line; or, each SerDes could be interpreted as referring to oneend (send and receive) of a full-duplex lane. The difference is illustrated in the topand bottom of Fig. 21.12.

Fig. 21.12 Two ways of usinga pair of SerDes devicesbetween two systems, A and B

Our choice is the upper one shown; it permits immediate instantiation of ourSerDes design. We assume that our two serdes will reside in a system (maybe on asingle chip) which will floorplan each Ser some distance away from its Des, so thata serial data transfer between A and B would be useful.

21.3.2 Adding Test Logic

Before proceeding to the lab, some thought might be given to the following points:

• How should we equip our serdes for testability?• How good is observability without test logic?• Where should we add assertions?• How much internal scan?• Would boundary scan be useful?

21.4 Tested SerDes Lab 26

Work in a new subdirectory of the Lab26 directory for each of these exercises. Theinstructions will say how to name the subdirectories.

Lab Procedure

Step 1. Gather the parts of the full-duplex serdes. Create a directory namedFullDup in Lab26, and create in it a subdirectory named SerDes. Make a


copy of the entire contents of the SerDes directory from Lab24/Lab24 Ans/Lab24 Step08 in it. For now, use the provided answer design; there will be op-portunity later to return to your own SerDes implementation, if you should wantto do so.

What we’ll do now is to create place-holder files for our full-duplex serdes bymodifying the files for the Lab24 unidirectional SerDes.

Leave SerDes.v in the new SerDes directory, but move the other filesup one directory, into the new FullDup directory. These files should include:SerDes.inc, SerDesFormats.inc, SerDes.vcs, and SerDesTst.v.The only things remaining in the SerDes directory should be SerDes.v and thefour subdirectories originally there.

Rename the moved design files by changing “SerDes” in their names to“FullDup”. For example, you would have in the FullDup directory,FullDup.inc, and FullDupTst.v, among others. Don’t change the contentsof these files yet.

Step 2. Use a testbench copy as a template for FullDup.v. Copy the file you justrenamed to FullDupTst.v, to a new file named FullDup.v in the FullDupdirectory.

Open FullDup.v in a text editor and delete everything in it except the oldSerDes instance, with its parameter map.

Copy-and-paste a second, identical copy of this instance into FullDup.v.Name the upper instance in the file SerDes U1 and the lower instance SerDes U2.

Then, open the SerDes.v file (in the SerDes subdirectory), and copy-andpaste the SerDesmodule header declarations into the top of FullDup.v. Changethe module name to “FullDup”.FullDup.v, which once was a copy of FullDupTst.v, now should consist

of the module port declarations from SerDes.v, and two instances of SerDeseach with different instance names. The top module name in FullDup.v shouldbe FullDup.

Next, we shall edit FullDup.v to make minor changes in the SerDes part ofthe design; then, we shall complete the full-duplex design of FullDup.

Step 3. Resize the SerDes FIFOs. It would be fun arbitrarily to vary both widthsand depths of the FIFOs for the two systems, A and B, shown in Fig. 21.11. However,the frame encoding would have to be altered to change the width, so we shall leaveall widths at 32 bits (33 including parity), and just modify the depths. Also, we haveused exact counter wrap-arounds in the FIFO state machine, so each FIFO musthave a depth in words equal to a power of two.

As shown in Fig. 21.11, instead of 32 words, each A FIFO should include 8words, and each B FIFO 16 words. We need separate depth parameters for the Aand B sides of each SerDes.

To accomplish this in the FullDup.v file, remove the old AWid FIFO depthparameter and replace it in the FullDup header with four new ones, one for eachFIFO instance in the full duplex design. The result should look something like this:


module FullDup #(parameter DWid = 32, RxLogDepthA = 3, TxLogDepthA = 3 // 3 -> 8 words., RxLogDepthB = 4, TxLogDepthB = 4 // 4 -> 16 words.)

... (port declarations) ...

The new parameters should be passed to the SerDes instances; but, first wehave to decide which instance in the verilog corresponds to which one in Fig. 21.11above. Let’s take the first SerDes instance in the file, which we have namedSerDes U1, as the top serial line in Fig. 21.11 (“Serial Link 1”).

Then, postponing port-mapping details, the parameters should be passed thisway:

...SerDes #( .DWid(DWid)

, .RxLogDepth(RxLogDepthB), .TxLogDepth(TxLogDepthA))

SerDes U1 ( .ParDataOutRcvr ... // A sender; B receiver....

SerDes #( .DWid(DWid), .RxLogDepth(RxLogDepthA), .TxLogDepth(TxLogDepthB))

SerDes_U2 ( .ParDataOutRcvr ... // B sender; A receiver....

At this point, the SerDes module can not use the new parameters, so we mustmodify it to accept them.

A SerDes module in our FullDup design gets two FIFO depth parameter dec-larations, one for the receiving (Deserializer) FIFO, and one for the sending (Serial-izer) FIFO; of course, at the SerDes level, there is no distinction between ‘A’ and‘B’. In the SerDes subdirectory, in SerDes.v, pass in the new parameters, andfind every occurrence of AWid and rename it to RxLogDepth or TxLogDepthappropriately. Do not rename the FIFO parameter, just change the value mappedto it. The Serializer .AWid gets the Tx parameter; the Deserializer.AWid gets the Rx parameter. Assign 5 (32 words) to be the module header de-fault for both depth values; this will be overridden to 3 (8 words) or 4 (16 words)during instantiation.

There is no need for further editing of parameters lower in the design; theSerializer and Deserializer will pass the proper values to their submod-ules as they did before.

In particular, each FIFO Top instance now properly will pass on the depth val-ues required to define FIFO addressing and number of memory words. Because


of our previous planning, there is no reason to rename the parameter declared inthe FIFO Top module or to change anything in the FIFO or PLL parts of thedesign.

Step 4. Complete the connection of the full-duplex lane. Back again inFullDup.v, we have to decide what should be our module I/O’s, and how theyshould connect to the SerDes instances.

The SerDes instances are independent; there should not be any communicationbetween them except over the serial line, and this simplifies our decisions. What weshall do next, is decide what to call the nets connecting to the SerDes instances,and what should be the FullDup module I/O’s. We shall try to winnow away allbut the minimum required FullDup I/O’s.

Let’s start with a simplification of the system relationships: SerDes U1 (link#1 in Fig. 21.11) transmits data from A to B; SerDes U2 transmits from B toA. Therefore, on the A side, the only SerDes U1 serializer output port would bethe serial line (SerLineXfer); the A serializer inputs would be ParDataIn,InParClk, InParValid, Reset, and TxRequest;. On the SerDes U1 Bside, the B deserializer, the outputs would be ParOutRxClk (in the B clock do-main) and ParOutTxClk (in the A clock domain). The B deserializer inputs wouldbe OutParClk, Reset, and RxRequest.

The opposite I/O relations must hold for SerDes U2.That takes care of the instance pins. As for the nets, to avoid elementary errors

or confusion, we shall start by renaming all of them. Initially, we’ll prepend ‘A’or ‘B’; later, when decisions have been made, we shall rename the nets by movingthe prepended letters to the end of the net name, following the example above of“RxLogDepthA”, etc.

So, let’s begin by adding an ‘A’ prefix to every SerDes U1 and SerDes U2A net, and a ‘B’ prefix on every other net connected to the SerDes instances. Forexample, in SerDes U1, we would have AInParStim, AInParClkStim, etc.At this point, the only net lacking an ‘A’ or ‘B’ would be ResetStim, which, asshown above, we shall share between A and B.

After this, we can go up to the module port declarations and make a complete,duplicate copy of the current FullDup module ports. Do this by simple copy-and-paste of the entire module header port declarations, just doubling the originalnumber of FullDup I/O’s.

Once the FullDup module ports are duplicated, doubling the number of I/O’sin the FullDup header, prepend an ‘A’ to every net name in one copy and a ‘B’ toevery net name in the other.

We could declare a few new nets and stop here, but let’s simplify things a bit:Simplification A. Eliminate one FullDup module input port by declaring just

one common Reset in the module header to be routed (later) to both SerDesinstances.

Simplification B. We want to transfer data between A and B, which are bussedsystems; we don’t really require the serial lines to be visible outside of FullDup.


So, declare wires for the serial lines named SerLine1 and SerLine2, andsubstitute them for the nets on the SerLineXfer pins of SerDes U1 andSerDes U2, respectively. These output nets will remain otherwise unused, andthis allows us to delete the two corresponding module output ports in the headerof FullDup.

Simplification C. We may assume that A and B correspond to distinct clock do-mains. Therefore, the A clock for clocking in data to SerDes U1 should be thesame clock as the A clock used to clock out SerDes U2 data. Declare a singlemodule input named ClockA, and connect it to SerDes U1.InParClk andSerDes U2.OutParClk. Then, declare ClockB, connect it correspondingly,and delete the module input ports for all other clocks. This eliminates a total ofanother two FullDup module ports.

Simplification D. It was suggested optionally at the end of Lab24 to cre-ate two parallel-bus output ports from the SerDes; data from one would beclocked out by the receiving-domain’s clock, and data from the other would beclocked out using the parallel clock embedded in the serial framing. These portswere named ParDataOutTxClk and ParDataOutRxClk. The purpose wasto see the way the data delivery differed because of the different clockdomains.

Thus, you may have two parallel output ports on each SerDes. We don’t wantthese two ports in FullDup; we only want outputs clocked in the receiving domain.So, if you have them, delete both ParDataOutTxClk ports, even if they werecommented out in your Lab24 SerDes instance. This leaves just one parallel-dataoutput port in each SerDes instance and allows us to delete as many as two moreFullDup module output ports.

Simplification E. Let’s look at the RxRequest and TxRequest inputs. Theywere convenient for debugging our SerDes and demonstrating FIFO functionality;but, now they can be reduced. However, to preserve flexibility in case our reductionplan doesn’t work out, let’s keep the port connections and declare four request wiresto assign them explicitly; this will be shown in a code example below. So, we’ll tieRxRequest high permanently; we’ll control everything with the input ParValidsignals at both ends: If ParValid is asserted in either system, it also will assertTxRequest in that system. This allows us to remove four FullDup ports andsimplifies the SerDes operational protocol.

Simplification F. Wrap-up. Except ClockA, ClockB, and Reset, renamethe FullDup module ports so each remaining input begins with “In” and eachoutput with “Out”. Also, except Reset, rename all ports so that every A portends with ‘A’, and every B port with ‘B’. This moves the ‘A’ and ‘B’ prefixesto the end of each name, to become postfixes; the whole renaming sequence wasan accounting technique to help keep the port renaming free of confusion orerrors.

After all this, your FullDup.v should contain something close to the codeshown below:


‘include "FullDup.inc" // timescale & period delays.//module FullDup

#(parameter DWid = 32 // 32 bits wide., RxLogDepthA = 3, TxLogDepthA = 3 // 3 -> 8 words deep., RxLogDepthB = 4, TxLogDepthB = 4 // 4 -> 16 words deep.)( output[DWid-1:0] OutParDataA, OutParDataB, input[DWid-1:0] InParDataA, InParDataB, input InParValidA, InParValidB, ClockA, ClockB, Reset);

//wire SerLine1, SerLine2

, RxRequestA, RxRequestB, TxRequestA, TxRequestB;//assign RxRequestA = 1’b1;assign RxRequestB = 1’b1;assign TxRequestA = InParValidA;assign TxRequestB = InParValidB;//SerDes #( .DWid(DWid)

, .RxLogDepth(RxLogDepthB), .TxLogDepth(TxLogDepthA))

SerDes U1 // Ports reordered:( .ParOutRxClk(OutParDataB), .SerLineXfer(SerLine1), .RxRequest(RxRequestB), .ParDataIn(InParDataA), .InParValid(InParValidA), .TxRequest(TxRequestA), .InParClk(ClockA), .OutParClk(ClockB), .Reset(Reset));

(similarly for SerDes U2; but, with SerLine2, and with ‘A’ & ‘B’ suffices reversed)

Simulate briefly to verify connectivity, and file names and locations. Rememberthat by default some simulators will attempt to open ‘include files on a pathrelative to their invocation context.

You may reuse your SerDesTst.v testbench (which was renamed toFullDupTst.v in Step 1) for this after changing names and adding the new FIFOparameters. Or, to expedite the testbench, which should be quite elaborate for thisdesign, you might consider just copying the FullDupTst.v testbench file pro-vided for you in the Lab26 Ans/FullDup Step4 directory.

Use the simulator to check the hierarchy tree to be sure that both SerDes in-stances, and all FIFO instances, are present. Make sure all warnings about buswidths, assignment widths, and so forth are corrected or well-understood beforeproceeding. Fig. 21.13 and 21.14 show some FullDup simulation results.


Fig. 21.13 The first full-duplex serdes (FullDup) simulation. Serdes U1 waves are on top; U2below. Data are random and different in each direction; the A and B FIFO’s are of different depth

Fig. 21.14 Close-up showing clock skew between the two, independent parallel-bus domains

Step 5. Add assertions. Create a new subdirectory FullDupChk in the Lab26directory and copy everything in FullDup into it. Do this Step in the new directory.

We’ve ignored assertions and timing checks in our serdes design, except for oneparity check assertion in the FIFO memory model. Let’s correct this deficiency now.

No FIFO is visible at the FullDup level of the design, but we still would liketo know when the FIFO’s go full or empty. The first appearance of the full andempty flags in the design occurs at the SerDes level, in the Serializer andDeserializer instance port mappings.

Add four assertions to SerDes, two in Serializer and two inDeserializer , that check that the four FIFO flags are false. Each assertion


should print a warning message to the simulator console. The assertions may beyour own, or they may be modelled after our generic assertion in Week 4 Class 2(Lab 11). Use %m so that the assertion will print the module instance name in whichit was triggered.

Our design will not fail under these warning conditions, but they imply that pos-sibly the containing A and B systems will have to resend lost data, the loss beingdetermined from the data rather than from FullDup hardware. We could use theseassertion warnings to decide whether the design word depth of the FullDup FIFO’sshould be changed.

Simulate. There should be at least a few FIFO-empty states to exercise your newassertions.

If you have not done so already, add testbench code to enable transmission (Tx)from both Serializers; your testbench also should make ClockA and ClockBindependent. You might consider copying the testbench provided in Lab26 Ans/FullDup Step4, to save time in lab. This testbench is fairly sophisticated andincludes independent clock-frequency drift in both domains, as well as differentrandom input data for each serdes.

Step 6. Add timing checks to monitor the DesDecoder control pulse widths.Do this Step in FullDupChk. Using a simulation testbench from Lab26 Ans,or one very similar, display the signals in a DesDecoder. There is only oneSerDes DesDecoder module, so both instances will implement your checks.Our simple approach to control in this module has allowed ParValid, ParClk,doParSync, and SyncOK to be asserted apparently as pulses of wildly varyingwidth.

Let us study these pulse widths. We wish to determine the narrowest positivepulse that occurs during simulation. To do this, add $width timing checks inDesDecoder to report a violation whenever the width of a positive pulse on oneof these signals is less than the value given by a specparam. Declare a differentspecparam for each timing check.

Vary the specparam values while repeatedly running simulation. For example,after setting up four specparams and four $width timing checks, set all thespecparams but one to 0; set the other one to 100 (=100 ns); this will disable allbut one. Then, run the FullDup simulation for a relatively long time, say 500k to1M ns.

If there is no violation reported, increase the nonzero criterion specparamvalue by 50 or 100 ns, or more, and repeat the simulation until violations occur(you will know a violation message when you see one). Use the simulator [Stop]button if violations are numerous. Going by the message numbers or guesses, thendecrease the criterion and hunt until you have found the maximum criterion value atwhich no violation occurs. Leave the specparam there.

Repeat the procedure for the other three specparams, separately or simultane-ously, until all are at their maximum no-violation level.

Now, any unusual state causing a shorter pulse will be revealed by these checks.Record your timing-check widths in the table below.


Variable Maximumno-violationwidth (ns)

ParValidParClkdoParSyncSyncOK

It pays to be able to find unexpected brief pulses. As the old naval shipyard sayinggoes,

“Loose glitch sinks chips”.Step 7. For this Step, change out of FullDupChk and return to the FullDupdirectory. Make up a synthesis script file, FullDup.sct, using as a template, forexample, a script in one of the subdirectories at Lab24/Lab24 Ans.

With this script, synthesize the design while imposing only zero-area and just onetiming constraint: Set a stringent maximum delay limit on all outputs, one which thesynthesizer can meet only with a little slack. Do not define a clock. Flatten the designbefore running compile with all defaults (no option); this will make the resultingnetlist comparable with the scan-inserted netlist we shall create next. Be sure to logan area report into a text file for later reference.

The purpose of this Step is just to prove quickly that the design verilog is entirelyin synthesizable format. Without any clocking, clearly it should not be expectedto simulate correctly. However, with clocks and other timing constrained, and withhold violations fixed, this FullDup design can be made to synthesize to a correctlysimulating netlist. See the CD-ROM answers for various ways of applying workablesynthesis constraints. For curiosity’s sake, Fig. 21.15 and 21.16 give waveforms fora fully-constrained, completed class serdes project:

Fig. 21.15 Netlist simulation of the completed full-duplex serdes, synthesized with hold-fixing forall clocks


Fig. 21.16 Close-up of the completed full-duplex serdes netlist simulation, demonstrating that theA domain input parallel word 0x0fd2 8f1f (time ∼193 ms) is transferred correctly to the Bdomain output bus (time ∼235 ms)

Step 8. Insert scan logic. Create a new subdirectory named FullDupScan andcopy the entire contents of the FullDup directory into it. Simulate briefly at thenew location to verify file locations and connectivity. Do the rest of this lab inFullDupScan.

Use the synthesizer to insert internal scan as follows:A. First, rename your copied synthesis script to something like FullDup

Scan.sct; edit it to change its output file names (netlist and log) so they willbe identifiable as scanned outputs.

In the script file, delete the existing compile command and substitute in its placethe scan-insertion commands used in Lab25. These may be found atLab25 Ans/IntScan/DC Scanned/Intro TopFFScan.sct. You willhave to change the Clk and Clr names and the scan clock period. You also willhave to edit the FullDup module header in FullDup.v to add two new inputports, tms and tdi, and one output port, tdo, for the scan circuitry.

B. Run the synthesizer to insert scan elements. This will take only a little longerthan did plain synthesis. Be sure to log an area report in a separate text file so youcan compare it with the one in the previous Step.

The same constraints should not be difficult to meet with scan inserted, becauseprevious compilation should have shown they could be met without scan. This mighthave been guessed by recalling that scan insertion replaces sequential elements withscan elements, and that the extra gate count from (effectively) one mux per flip-flopreally can’t amount to very much. Without dwelling on library issues, inspection ofthe library we are using for synthesis indeed shows that the area of a typical scan flip-flop is only about 30% greater than that of an ordinary flip-flop. This exercise pri-marily was to show the effect of scan on netlist size. Do not bother with simulationat this point; the TAP controller will not be functional because the synthesis designconstraints did not include a clock, preventing creation of a working scan test mode.


C. Browse through the scan-inserted netlist file in a text editor to see where thescan elements were substituted.


Where in FullDup might other timing checks be added?What was the area impact of scan insertion in this design?How hard would it be to provide a hardware test engineer with a key to the

scanned design? In other words, define a FullDup scan test vector to be a vectorthe length of the entire FullDupScan scan chain. How wide would this vector be?Then, how would one go about providing a key giving the location of each designbit in the test vector?


Read The NASA ASIC Guide: Assuring ASICs for SPACE, Chapter Three:Design for Test (DFT). At: http://klabs.org/DEI/References/design guidelines/content/guides/nasa asic guide/Sect.3.3.html (updated 2003-12-30).

If you want a little more detail on LFSR pseudorandom number generation, trythe Koeter article, What’s an LFSR?, at http://focus.ti.com/lit/an/scta036a/scta036a.pdf (2007-01-29).

(Optional) Retrieve your BIST.v empty interface file of Lab25, Step 8, andcomplete the BIST functionality your own way.


22.1 SDF Back-Annotation

Today, we’ll look into SDF just enough to understand how it works.

22.1.1 Back-Annotation

Back-annotation refers to the modification of a netlist with newly assigned orupdated attributes, usually delay times. A tool is run; and, the result goes back tothe netlist; whence, back-annotation. The netlist itself is not modified; the back-annotations instead are stored in a side file.

Although the synthesizer can estimate the effect of trace length on delay timewhile optimizing a netlist, these delays initially are in the synthesis library’s wire-load model, and they do not appear as such in the synthesized netlist. The netlistmerely contains component instances with certain timing and drive strengths cho-sen during optimization to be adequate to meet the timing constraints, given the ex-pected trace lengths. These component choices are fairly inaccurate when comparedwith actual performance after placement and routing. A good estimate of many ofthe layed-out delays based solely on the initially synthesized netlist might be off by10%; a typical estimate would be off by more.

After a netlist has been used to implement a floorplan or placed-and-routedlayout, accurate delay times can be associated with the placement of individualcomponent instances. The verilog library propagation delays are ignored and back-annotated delays are used instead. This means that a back-annotation side file shouldcontain a description of the timing of every component instance in the netlist.

22.1.2 SDF Files in Verilog Design Flow

The standard file format for netlist delay back-annotation is called SDF (StandardDelay Format). It is described in IEEE Std 1497; its use with a verilog netlist isdescribed in section 16 of the verilog language standard, IEEE Std 1364. Various



other back-annotation file formats, for example SPEF (Standard Parasitic ExchangeFormat), are used for a variety of other physical layout purposes not included in thepresent course. SDF contains delay times; SPEF contains capacitances. In any case,the flow is as in Fig. 22.1.

Fig. 22.1 Back-annotation in a typical design flow. The estimated delays or physical parametersare kept in a side file associated by tools with the design netlist

Timing in an SDF or other back-annotation file will override verilog libraryspecify block path delays, including conditional delays. The net (interconnect)delays attributable to trace length also can be back-annotated in SDF, although in theverilog, such delays generally would be lumped to component outputs. Verilog pathdelays are associated with the IOPATH keyword in SDF; net delays are associatedwith the INTERCONNECT keyword. The keyword CELL is used for any componentor module instance. Except for colons in timing triplets, the only SDF punctuationis by parentheses (‘( ‘and’ )’); so, the language looks a bit like lisp or EDIF at firstglance.

In an SDF file, the SDF keywords are written in upper case and are not case-sensitive; references to netlist objects are copied literally from the netlist and arecase-sensitive.

22.1.3 Verilog Simulation Back-Annotation

This process can be very simple and straight-forward. A verilog netlist is createdsomehow, for example by synthesis, and an SDF file is obtained somehow for thatnetlist. Typically, after the netlist has been floorplanned, an SDF file may be written

22.2 SDF Lab 27 407

out by the flooplanning tool. Another way to obtain an SDF file is simply to writeone out from our synthesizer after saving the netlist in verilog format.

Once a verilog netlist and corresponding SDF file have been obtained, the ver-ilog system task, $sdf annotate (“sdf file name”), is inserted into the netlist atwhichever level of hierarchy the SDF file was written. When the simulator then isinvoked, those modules which were back-annotated (usually, the whole design) willbe simulated using the SDF timing instead of the verilog component library timing.

Complexity can arise if timing is to be back-annotated from several SDF files. Ifthe files are nonoverlapping, the ABSOLUTE keyword (which appears everywherein our SDF files in the lab below) may be used to make the last file read overwritea common delay with its final value. It is possible to use an INCREMENT keywordinstead; in that case, subsequent delay values are added to previous ones.

In summary:

SDF (Standard Delay Format) is defined in IEEE Std 1497. It is

• A netlist representation of timing triplets.• Written by the synthesizer, static timing tool, or layout tool.• Hierarchical; an SDF file may be rooted at any verilog module level.• Associated with a design netlist by the following verilog system task in a

design module initial block:$sdf annotate (“file name.sdf”);

22.2 SDF Lab 27


Lab Procedure

Step 1. Simulate a design. Create three new subdirectories, orig, ba1, and ba2,in Lab27. Change to the orig directory and copy in the verilog files for our oldfriend, the Intro Top design from Lab01. Use the original files provided, includ-ing the original TestBench.v file.

In TestBench.v, comment out the ‘include of Extras.inc, invoke thesimulator, and simulate the design. Do not exit; but, rather, leave up the waveformwindow displaying the simulation of the top-level primary inputs and outputs.

If you are using VCS and save your VCS configuration to a file, this file may beused to invoke VCS for the rest of this lab with the same exact window sizes andwaveforms displayed.

NOTE: The simulator process must be left active; otherwise, the wave win-dow will not resize or redraw if moved. The easiest way to guarantee this is todo each step in this lab in a new terminal window.


Step 2. Simulate a netlist. After simulation in Step 1, replace your copied Lab01Intro Top.sct with the modified one provided in your Lab27 directory. Thedesign rules in the Lab27 version have been changed so that the gate count inthe netlist is minimized. You may wish to use a text editor to compare the originalIntro Top.sct with the new one to see how this was done.

After copying in the new Intro Top.sct, synthesize the design with it. Thenew script will write out the synthesized netlist and an SDF file automatically. TheSDF file will be based solely on the synthesis library’s wireload model, but its timingwill be far more accurate than that of the original verilog design.

Make up a new simulator file list containing TestBench.v, the netlist file justcreated, and the LibraryName v2001.v verilog component library file provided.Simulate the netlist and compare the waveforms with those of the original design inStep 1.

Fig. 22.2 Timing differences between the verilog source (above) and its synthesized netlist

As shown in Fig. 22.2, the netlist-simulation wave shapes should be somewhatdifferent from those of the source verilog, because the actual library gate delays arevery short (approximated in the LibraryName v2001.v file) compared with thedelays in the source verilog. Most noticeably, a glitch in the Xwatch output wavewill be missing in the netlist simulation.

After examining the simulation timing, leave up the netlist simulation waveformwindow, but close the wave window, or exit the simulation session, of the originaldesign (Step 1).

Step 3. Simulate the back-annotated netlist. Copy TestBench.v, the netlist file,the simulator file list, and the SDF file of Step 2 into the ba1 directory you createdabove.

22.2 SDF Lab 27 409

Open the netlist file in a text editor and find the module Intro Top. Add aninitial block in that module to reference the SDF file. For example,

initial $sdf annotate(‘‘Intro TopNetlist.sdf’’);

Simulate again, and compare the SDF back-annotated waveforms with those ofthe original netlist of Step 2. All things considered, the shapes should be the same,and the timing of the outputs should be close to the same. There will be small tim-ing differences (see Fig. 22.3), because the original netlist used timing from theverilog component library, LibraryName v2001.v, which was hand-created withapproximate timing, whereas the back-annotated netlist uses more accurate SDF de-lays created by the synthesizer from the original LibraryName.lib (compiled bySynopsys as LibraryName.db) database.

Fig. 22.3 Netlist timing for an approximated verilog library (above) and synthesizer SDF back-annotation

When you have examined the delays, close the simulator waveform display win-dow from Step 2 but leave up the display from this Step.

Step 4. Modify the SDF timing and simulate to see the difference. Copy all filesfrom ba1 of Step 3 to the ba2 directory you created previously.

Let’s modify the timing of the Intro Top.Z output port to see how this mightbe done by back-annotation by a floorplanner or layout editor. This is just for in-structional purposes: A designer never would edit manually an SDF file under nor-mal conditions.

Open the verilog netlist file in a text editor, find the Intro Top module, andlocate the driver of the Z output port; this probably will be an inverter gate of somekind. Just find the last driver cell of the Z output and note the component type ctypeand the instance name inst name of that directly-connected driver cell.


Then, open the SDF file in a text editor and look for this statement,

(CELL(CELLTYPE ‘‘ ctype’’)(INSTANCE inst name)

Below the instance name, there should be a DELAY and IOPATH statement. It isthe IOPATH statement which determines the delay on that path in inst name whenthe netlist is simulated with the SDF back-annotation. The first triplet is rise delay;the second is fall. The correct instance will be at the top of the design representedin the SDF file, and it will not be preceded with a hierarchical path.

You might like to look in the LibraryName v2001.v file to see how the hand-entered verilog delay differs from the synthesizer’s delay for this component. Thisdifference will account for our simulator display differences. However, in general,an SDF delay will be instance-specific; whereas the library delay can be only com-ponent (module) specific. The actual database used by the synthesizer to calculatethe SDF delays can be seen in LibraryName.lib, in your Synopsys library instal-lation directory area.

Anyway, to see how the SDF file controls timing, edit the IOPATH rise delay forthe Z output to be about 20 times the original value, and the fall delay to be about 10times. You may wish to comment out the original line; use a verilog “//” to do this.Then, save the file and repeat the simulation. The effect of the new delays shouldbe fairly obvious: As visible in Fig. 22.4, the final Z positive pulse in the simulationshould be displaced to a greater time and should become narrower than it was withthe unaltered SDF back-annotation.

Fig. 22.4 Back-annotated netlist timing: Original SDF (above) vs. hand-modified SDF

22.2 SDF Lab 27 411

Optional. You may consider repeating Step 4 but entering different min vs. typvs. max values in the SDF file, and then invoking the simulator to use any one ofthese sets of delays.


What if the $sdf annotate task call is located in the wrong module?How might just part of a design netlist be back-annotated?


Optional. Recall Step 2 of today’s lab. Explain in detail why a brief transition to 0on the top-level X output occurred during simulation of the original design but notfor the original synthesized netlist.


23.1 Wrap-up: The Verilog Language

23.1.1 Verilog-1995 vs. 2001 (or 2005) Differences

We have continuously exercised one major difference between verilog-1995 andstandards later than verilog-2001: The ANSI-C module header declaration format.Some other differences new in 2001 were: Implementation of generate, multi-dimensional arrays (verilog-1995 allowed only 1 dimension), the signed arith-metic keyword (only integers or reals were signed in verilog-1995), and theautomatic keyword for recursion in tasks or functions. In addition, there wasstandardization of the use of SDF, and there were VCD dump file enhancements.There were a host of other, perhaps less important changes, all explained in theSutherland paper referenced in today’s Additional Study.

23.1.2 Verilog Synthesizable Subset Review

By now, all this may be well understood. However, here’s a brief review:The synthesizer can not synthesize:

• delay expressions,• initializations (initial blocks),• timing checks,• verilog system tasks or functions;

or, generally, anything else which is transient in, or depends on, simulation time.The component delays used by the synthesizer for timing optimization and timingconstraint violations come from its own library models and not from specifyblocks or other verilog constructs.

The synthesizer may reject, or issue a severe warning about, procedural blockscontaining even one nonblocking assignment with a delay. However, it will



synthesize correctly any sequence either of blocking or nonblocking assignmentswhich are not associated with delays. Delays in the verilog simply are ignored.

The synthesizer will synthesize generate blocks, looping or conditional.The synthesizer will enforce certain good coding practices by rejecting assign-

ments to the same variable from different always blocks, and by rejecting mixedblocking and nonblocking assignments to the same variable or in the same alwaysblock.

The synthesizer has great difficulty synthesizing correctly any construct whichis equivalent functionally to a complicated latch – which is to say, to a complicatedlevel-sensitive or transparent latch. To avoid latch problems, (a) do not omit vari-ables from an always block with a change-sensitive event control list; and (b) donot omit logic states from cases assigned in a change-sensitive block, unless thecode represents a simple latch with just one output bit.

23.1.3 Constructs Not Exercised in this Course

We have indeed covered the entire verilog language, including some constructs notuseful in VLSI design; but, we have omitted a few details which differ unimportantlyfrom ones covered, or which are not frequently implemented in any tool. Here is acomplete list:

• All the verilog language keywords are listed in Appendix B of IEEE Std 1364.Thomas and Moorby (2002) section 8.5, also contains a table of them, a miscel-laneous few of which we have ignored in the present work.

• File input/output manipulations. These include $readmemb and $readmemh.Note that the correct verilog filesystem name divider in Windows is ‘/’, not ‘\’!

Thomas and Moorby (2002) sections F.4 and F.8 present the available file-related system functions. There also is some relevant discussion of file I/O inBhasker (2005), section 3.7.4.

• System tasks and functions. For the most part, consistent with the synthesisorientation of the course, we have called attention to these constructs only whennecessary. The many builtin tasks and functions we have ignored are reason-ably self-explanatory, given our practice in the few we have used. Some goodexplanation may be found in Bhasker (2005), section 10.3. See a list of thembelow.

• Many compiler directives (‘directive). These also are quite easily understood,for the most part, from their names. See a list of them below.

• Attributes. These are Boolean expressions in a scopeless block delimited by thetwo tokens, “(∗” and “∗)”, the same tokens as used in Pascal for comments.For example, (∗ dont touch U1.nand002 ∗). Verilog attributes are tool-specific and are meant to control synthesis or simulation behavior. Current toolsuse comment directives (“//synopsys . . .”) for this purpose and generallywill not recognize verilog attributes.

23.1 Wrap-up: The Verilog Language 415

• defparam, force-release, assign-deassign. As has been explained,these constructs should be avoided in modelling code, although force-release sometimes might be useful for special testbench functionality or fordebugging.

• Declaration of module output ports as reg. This is permitted in verilog-2001, as a compatibility carryover from verilog-1995. It is not recommendedas a design practice, because internal wires no longer can be connected to suchports (bringing back the error-proneness of the 1995 module header format), andmodule delays on such ports are difficult to make visible. Also, the presenceof mixed wire and reg ports requires perpetual attention to mutually exclu-sive port assignments (assign vs. always), negating any small savings oftime because of omission of internal reg declarations paired with continuousassignments.

• The verilog PLI, which is discussed below.

Closely related, omitted topics which are not part of verilog:

• Many synthesizer constraints have not been exercised. The command referencemanual for the Synopsys DC synthesizer contains hundreds of commands, thou-sands of options, and exceeds 2,000 printed pages.

• The Synopsys DC Shell dcsh constraint syntax of dc shell or designvision command mode. This is very similar to Tcl syntax, but its use is depre-cated now by Synopsys.

• The Synopsys Design Constraint language (SDC), which is an adaptation of Tcland which is licensed free by Synopsys. SDC is meant to provide scripting andconstraint portability among tools such as synthesizers, static timing verifiers,power estimators, coverage estimators, and formal verifiers. It can be used withVHDL or any other language such as System Verilog or SystemC, if the tool willaccept it.

23.1.4 List of all Verilog System Tasks and Functions

These are documented in the IEEE 1364 Std document, section 17. They fallinto ten categories; a complete list is given in the next table. The functional-ity usually is indicated by the name, but the Std or other help documentationshould be consulted for full explanation. Many are intended solely for testbenchuse.

No tool set known to the author has implemented all the system tasks andfunctions; however, all tools known to the author have implemented the subsetpresented in this course. The ones we have exercised are in bold in the tablebelow.


In choosing a new task or function for some special purpose, it is advisable toverify that all tools involved have implemented that choice correctly (or at least inexactly the same way).

Type Task and Function Names

Display $display $displayb $displayh $displayo $monitor$monitorb $monitorh $monitoro $monitoroff $strobe$strobeb $strobeh $strobeo $write $writeb $writeh$writeo $monitoron

Time $time $stime $realtime

Sim. Control $finish $stop

File I/O $fclose $fdisplay $fdisplayb $fdisplayh $fdisplayo$fgetc $fflush $fgets $fmonitor $fmonitorb$fmonitorh $fmonitoro $readmemb $swrite $swriteo$sformat $fscanf $fread $fseek $fopen $fstrobe$fstrobeb $fstrobeh $fstrobeo $ungetc $ferror$rewind $fwrite $fwriteb $fwriteh $fwriteo$readmemh $swriteb $swriteh $sdf annotate $ssacf$ftell

Conversion $bitstoreal $itor $signed $realtobits $rtoi$unsigned

Time Scale $printtimescale $timeformat

Command Line $test$plusargs $value$plusargs

Math $clog2 $ln $log10 $exp $sqrt $pow $floor $ceil $sin$cos $tan $asin $acos $atan $atan2 $hypot $sinh$cosh $tanh $asinh $acosh $atanh

Probabilistic $dist chi square $dist exponential $dist poisson$dist uniform $dist erlang $dist normal $dist t$random

Queue Control $q initialize $q remove $q exam $q add $q full

VCD $dumpfile $dumpvars $dumpoff $dumpon $dumpports$dumpportsoff $dumpportson $dumpall $dumpportsall$dumplimit $dumpportslimit $dumpflush$dumpportsflush $comment $end $date $enddefinitions$scope $timescale $upscope $var $version $vcdclose

PLA Modelling $async$and$array $async$nand$array $async$or$array$async$nor$array $sync$and$array $sync$nand$array$sync$or$array $sync$nor$array $async$and$plane$async$nand$plane $async$aor$plane $async$nor$plane$sync$and$plane $sync$nand$plane $sync$or$plane$sync$nor$plane

23.1 Wrap-up: The Verilog Language 417

23.1.5 List of all Verilog Compiler Directives

A complete list follows below. No tool known to the author has implemented allthese directives, but many tools which have not implemented a directive are likelyto ignore it or just issue a warning. As suggested before, if feasible, use ‘undef atthe end of any file in which a macro name has been ‘defined.

‘begin keywords ‘celldefine ‘default nettype ‘define ‘else‘elsif ‘endcelldefine ‘endif ‘end keywords ‘ifdef ‘ifndef‘include ‘line ‘nounconnected drive ‘pragma ‘resetall‘timescale‘unconnected drive ‘undef.

23.1.6 Verilog PLI

PLI stands for “Programming Language Interface”. The PLI strictly speaking is notpart of the verilog language, so we have not discussed it previously in this course.However, the PLI is specified in the same IEEE standard document (Std 1364) whichspecifies the verilog language.

The PLI language is C, and the PLI is a library of routines callable from C. ThePLI is a feature of the verilog simulator which allows users to design and run theirown system tasks and even to create new verilog-related applications to be run by thesimulator. Examples of such applications would be fault-simulators, timing calcula-tors, or netlist report-generators. Like the builtin system tasks such as $display,the user-defined ones also must be named beginning with ‘$’.

The PLI generally is not especially useful for a designer, but it may be valuableto someone creating tools for a design team. Another use might be to implementcertain simulator features not available in the current release by a particular toolvendor.

Examining the messages printed by the VCS simulator when it compiles a newmodule, one can see that it simply is a specialized compiler compiling and linkingan executable C program. The VCS GUI is a different, precompiled program whichoptionally creates an interprocess communications link with this executable whenthe latter runs; the executable’s I/O then is formatted and displayed by the GUI. Tosave run time and memory, VCS can be invoked in a text mode, with no GUI; theVCS help menu shows how. QuestaSim, Silos, Aldec and other verilog simulatorswork the same way as VCS, although the GUI and the simulator may be more orless tightly coupled.

In the past, the verilog simulation executable was interpreted (like a BASIC pro-gram or shell script) rather than being compiled. Except for running slower than acompiled executable (but being modifiable faster), the interpreted version workedessentially the same way as the compiled version.


Fig. 23.1 Sketch of PLI relationship to simulation and synthesis flows

The PLI operates on the simulator itself (actually, on its internal data structuresand operational runtime code), as shown in Fig. 23.1. Two independently accessibledatabases representing the verilog design are the verilog design source code, whichmay be accessed by shell scripts or by visual inspection in a text editor, and thesynthesized netlist, which, if written out in verilog format, also may be accessibleby shell script or text editor. The simulator internal data structures generally are ina proprietary binary format; but, if a design is simulatable, the internal data mustconsist of objects representing the design structure, component models (includingprimitives), nets, and attributes of the corresponding verilog objects. Busses andports must be represented independently for every bit, and, for netlist simulation, allloops must be represented unrolled. Over and above the objects in the design, thesimulator database includes state, delay, and simulation time information.

As of the 2005 Std, the PLI consists solely of the VPI or verilog procedural in-terface routines. Past versions of the PLI included two other collections of routines,TF or task-function routines and ACC or access routines, which now are absorbedinto the VPI. All three collections are overviewed in chapter 13 of Palnitkar (2003),and we won’t go into their properties further here.

23.2 Continued Lab Work (Lab 23 or later)


On the enhancements introduced in verilog-2001, read Stuart Sutherland’s excel-lent summary, “The IEEE Verilog 1364-2001 Standard: What’s New, and Why You

23.2 Continued Lab Work (Lab 23 or later) 419

Need it”. Based on an HDLCon 2000 presentation. http://www.sutherland-hdl.com/ papers/2000-HDLCon-paper Verilog-2000.pdf (2005-02-03).

Read the summary of verilog-2001 enhancements in Thomas and Moorby (2002)Preface pp. xvii–xx.


For file I/O, see sections 9.5.1 and 9.5.5 and the Palnitkar CD example Memory.v.Read section 14.6 on coding for logic synthesis.Read through all of chapter 14 on logic synthesis.Read chapter 13 on the PLI.


24.1 Deep-Submicron Problems and Verification

24.1.1 Deep Submicron Design Problems

We restrict this discussion to electrical factors of interest to a verilog designer. Deepsubmicron effects related strictly to fabrication are ignored (ion implantation; detailsof optical proximity correction; shift to shorter, ultraviolet mask exposure wave-lengths; immersion lithography; etc.).

The term “deep submicron” was adopted in the late 1990s to refer to layoutpitches below about 250 nm; or, below about 1/4 of a micron. At this scale, tracedelays begin to overtake gate delays as primary considerations in design timing.For our purposes here, we include nanoscale technology in this deep-submicroncategory (Fig. 24.1).

Fig. 24.1 Linear proportional gate schematic sizes and trace widths. Routing represented across achip region of constant width

Some generalities about the deep-submicron problem: Call the linear pitch sizeL. The major issue is that trace delays along distances L begin to contribute signifi-cantly to total delays, and anything introducing variation in trace timing becomes amajor problem:



• The rate of electric charge transfer to or from a transistor is about proportionalto its area, or L2. But the delay on a trace is more or less proportional to its RCtime constant. The resistance R increases linearly as trace width decreases, andcapacitance C to ground decreases about the same way, so the time constant of atrace because of this reactivity remains about the same as pitch shrinks. To chargeto a given voltage, capacitance C of a trace decreases linearly as one power of L.Thus, the time to charge a trace of a given length increases about proportionallyto the one remaining power of L. Individual trace lengths become important fordelay calculation.

• As gate sizes shrink, distances on chip begin to increase relative to gate size (seeFig. 24.1). So, delays begin to grow because distances connecting gate pins donot shrink as rapidly as L, which is proportional to the distance between traces.

• As a greater and greater number N of gates is fabricated on a given chip area, thenumber of possible connections (traces) grows as something between 2N and N2.Thus, on the average, as L shrinks (and chip complexity grows), the final fanoutrequired of the average gate grows. For this reason, too, loading caused by tracesbegins to dominate timing estimation on the average.

In addition to propagation delay timing problems, pitches at and below 90 nmhave introduced new problems related to fabrication-layer current leakage and noise.In particular, shrinking distance between traces has increased capacitance betweenthem and therefore has increased the average cross-coupled noise.

Larger designs, lower supply voltages, and smaller pitches imply that librariesmust contain a greater variety of different gates, including scan cells. Characteriza-tion must be more elaborate because of the smaller potentials and closer spacing ofthe gates on chip.

Library simulation and synthesis models can not simply store gate delays; in-stead, delays have to be calculated according to gate connectivity context, usinginput slew rates, output capacitance loading, current source capability, and chip op-erating conditions (process quality, voltage, and temperature – “PVT”). A sophis-ticated scalable polynomial, composite current source, or other complicated modelmust be used.

Also, random manufacturing process variations tend to be greater, the smaller thepitch. So, electromigration has become an issue for aluminum metal traces, requir-ing the more complex process of fabrication of copper traces.

As pitches have shrunk, clock speeds have increased; and, with leakage, this hasput power dissipation in the forefront of all design problems. To mitigate power loss,and to reduce switching slew rates, voltage supplies on chip have dropped from 5 Vor more to around 1 V.

Power loss, which goes more or less as the square of supply voltage, can be re-duced by optimizing different chip areas for different operating voltages, so voltagelevel shifters have become commonplace. All clock and data traces generally mustbe level-shifted between different voltage domains. Furthermore, power consump-tion of many battery-operated digital devices has to be minimized by isolating differ-ent chip areas, using specialized data isolation cells, for low-speed or sleep modes

24.1 Deep-Submicron Problems and Verification 423

of operation. Sleep modes may require retention of state by additional scan-likesequential cells, or by register duplication in always-on retention latches.

In addition to random fabrication variations, the inevitable but predictable diffrac-tion of light at near-wavelength dimensions becomes a major factor in mask design,requiring corrective reshaping of masks (optical proximity corrections) to obtainpattern images adequate to fabricate working devices in the silicon. The growingrequirement for diffractive corrections is shown in Fig. 24.2.

Fig. 24.2 The growth of optical proximity correction in digital design (Reproduced courtesy ofW. Staud, Invarium, Inc.)

Diffraction problems are equivalent to quantum-effect problems. For light, thequanta have the wavelength of the masked photons; for fabricated structures, thequanta have the wavelength of the logic-gate electrons. Low-k dielectric leakage isdirectly calculable in terms of quantum tunnelling of the electrons which operatethe field-effect transistor components of each on-chip CMOS device.

In a different concern, huge designs have made simple full internal scan chainsimpractically slow. Deep submicron internal scan has to be hierarchical, with sub-chains nested and muxed to the TAP port, requiring complicated TAP controllerlogic to select subchains for the hardware tester.

At modern clock speeds, RF interactions also are becoming important, especiallyfor serial transfer domains, such as those of ethernet, wireless antenna interfaces,and PCI Express.

All these new problems can be solved by proper physical design. While physicaldesign is beyond the scope of this course of study, good design, especially intelligentpartitioning, at the verilog source level can ease greatly the work of finding physicalsolutions to these, and to other, new, deep submicron problems as they arise.

Design pitches will continue to shrink until quantum uncertainty finally brakesprogress in this direction. It is unclear what will be the scale at which fundamental


limits prevent further reductions in digital structures, but it probably will be below10 nm, the linear extent of 50 to 100 average atoms in the crystal structure of a solid.

24.1.2 The Bigger Problem

One should be aware that the deep-submicron problems definitely will increase aspitches shrink. However, based on industry surveys, even at about 130 nm, the pri-mary causes of chip failures resulting in respin still are functional failure (logicaldesign error) or clock-timing design error. According to these data, at 90 nm onewould expect that perhaps only 25% of chip failures specifically will be because ofdeep submicron factors.

Therefore, the verilog designer can prevent the majority of chip respin fail-ures by entering correct functionality, with accurate assertion limits, and bypreparing error-free clocking.

24.1.3 Modern Verification

Palnitkar (2003) and others group assertions with verification tools; however, thepresent author would prefer to view assertions as part of the design, along withtiming checks and synthesis constraints.

The present author views verification as an operation independent of design en-try; verification is a practice which finds design errors. Assertions should be enteredby the designer and usually are based solely on design considerations (how the de-sign should simulate). Of course, as a result of design verification, QA assertionsmay be added, after the fact, to catch errors by preventing them from recurringsilently.

So, we shall not discuss assertions here, with verification.Verification of a verilog design should be understood as beyond the simple use

of a compiler for a simulator or synthesis application. Compilation only checks thesyntax of the language in which the design which was entered.

Usually, functional verification is done during design by use of a simulator. Be-cause the synthesized netlist is optimized by netlist library component timing, andnot verilog source code delays, simulation must be repeated, at least on a design-unitlevel, using a netlist and back-annotated delays.

Although a full chip, nowadays easily over 5 million equivalent gates, can notbe debugged functionally on the typical workstation, designers resort to simulationof a netlist of arbitrary size by using distributed computation, hardware simulationacceleration, or hardware emulation in FPGA’s.


Timing verification is a major step in signoff of a physical design before tape-outfor manufacturing. Any one delay out of specified limits, on a chip with twentymillion traces, can cause failure and loss of millions of dollars in redesign andrefabrication costs. Potentially, such failures can cause additional, far greater, saleslosses because of late entry in the product market window.

Timing verification rarely is done by simulation, because of infeasibility of mean-ingfully simulating tens of millions of traces, and because of inability to evaluatesimulation results even if such exhaustive simulations could be done. Instead, statictiming verifiers are used which estimate delays from physical library data and back-annotated netlists. A static verification run can be completed on a modern design ina day.

24.1.4 Formal Verification

An important tool for functional verification is formal verification. A formal verifi-cation tool runs statically, thus making possible the complete checking of very largedesigns.

There are several distinct kinds of formal verification:

• Equivalence Checking. This is the most popular type of formal verification;tools implementing equivalence checking of a new netlist against a “golden stan-dard netlist” (functionally known to be correct) are mature and reliable. Typi-cally, a verdict is reached of “equivalent”, “functionally different”, or “cannotdetermine equivalence”.

Tools which check a netlist against an RTL model are not so well perfectedand often impose coding style requirements in addition to those for synthesis. Ofcourse, any RTL design which can be synthesized can be submitted to a netlistequivalence checker.

There also exist equivalence checkers for system-level designs against RTLimplementations; however, such tools currently are experimental and are not yetwidely used.

• Model Checking. These tools are analogous to tools which check library com-ponent characterization in ALF or Liberty against a SPICE model: A model isdefined in a specialized language (for example, PSL; or, Property SpecificationLanguage; IEEE Std 1850), and an implementation in an HDL such as verilogthen is checked against the model. Often, such a tool can be considered a re-quirements checker as well as an implementation checker. Bugs can be foundthis way, and specific design units can be verified; but, usually, a complete ver-ification of an entire implementation is not practical. Many of these tools canhandle simple sequential logic.

• Formal Proving. This approach describes an implementation in logical state-ments and attempts to find inconsistencies. It can be applied fairly easily toblocks of combinational logic, but it can be more complex and difficult thanan HDL when sequential logic is involved.


Some tools combine formal verification with partial simulation, allowing thedesigner to step through formally difficult parts by simulation, and to obtain com-plete verification of those parts of the design most amenable to formalmethods.

24.1.5 Nonlogical Factors on the Chip

We have studied only digital design in this course. Digital design requires correctfunctionality and correct timing for correct logical operation. However, independentof these two primary factors are those implied by physical interactions among logicgates and by location on the chip. A working chip must have a proper distributiongrid for gate power supply (VDD and VSS), and, it must be designed so that gateactivity can not interfere with that of other gates:

• Power Distribution. As already mentioned, chips can be designed with regionsoperating at different voltages. The usual reason for this is conservation of power(in battery-operated devices) or limitation of high temperatures caused by powerdissipation. Also, adoption of IP blocks in a big chip can be facilitated if the chipcan meet special power-supply requirements of the IP.

Many devices can operate in a “normal” vs. a sleep mode. In the sleep mode,clocking simply may be suspended; however, power also may be reduced orswitched off to reduce leakage in unused regions. In many designs, state canbe saved on entering sleep mode and restored when normal function is resumed.

A chip typically is bounded by its I/O pads; just inside these pads, for chips witha single power supply, power and ground are distributed in two rings. A specificvoltage region in a multivoltage chip also will be ringed by power and groundsupplies. The rings provide noise isolation in addition to supplying power.

Within any voltage region, the high and low potentials will be supplied in a gridcovering the interior of the region. Each row of cells will get vias connecting itspower pins to the high and the low supplies. The grid provides multiple currentsources and sinks so that brief surges are averaged out, reducing variation at eachindividual gate.

• Noise Estimation. Before being taped out, a chip must be checked by simulationor static verification for noise. No possible gate activity should be capable of cou-pling unconnected logical states from one gate to another. Coupling simply maybe capacitive; however, when changes are rapid, and gate distances and switchingtransients are fast enough to cause metal traces to maintain significant potentialdifferences along themselves, inductive coupling also must be taken into account.Switching changes also can radiate electromagnetically at trace corners, causingpower loss in addition to noise.

Coupled noise can cause changes in logic level. More common is the problemof alteration of timing: A coupled potential in the direction of a changing levelin a nearby trace can speed up the change; in the opposing direction, it can retard


the change. Either of these effects can cause failure of the nearby operation bycreating a setup or hold error.

To prevent noise, the design must include shielding (unused power or groundtraces between switching logic traces) or must allow enough distance betweennearby traces that the coupling is not significant under normal operating con-ditions. Long parallel runs of possibly interfering logic must be rerouted. Thethickness (perpendicular to chip surface) of the traces also may be varied toreduce coupling, because thick traces cross-couple capacitively better than thinones; however, in many fabrication processes, metal or polysilicon thickness ispredetermined and can not be varied locally.

In any event, the noise of contiguous logic, especially traces, must be estimatedand controlled if the chip is to be fabricated reliably.

24.1.6 System Verilog

System Verilog is an Accellera standard recently adopted by IEEE. The last Ac-cellera version was 3.1a; this draft is available as a free download from the Accelleraweb site, but the only authoritative document is from IEEE.

This language is a superset of verilog-2001 which has been extended to includemany C ++ design features as well as a standalone assertion sublanguage.

An important goal of System Verilog is to make porting of code to and fromC ++ easier than it has been with verilog-2001. Another important goal is to makecomplex, system-level design and assertion statements directly expressible.

Some features of System Verilog are

• User-defined types (typedefs as in C + +), and VHDL-like type-castingcontrols.

• Various pointer, reference, dynamic, and queue types.• Simulator event scheduler extensions.• Pascal-like nested module declarations, VHDL-like packages for declara-

tions, and C ++-like class declarations (with single inheritance, only).• New basic variable type (logic) may be assigned procedurally or concurrently;

new basic two-state bit type.• Timing-check-like gated event control by iff:

always@(a iff Ena==1’b1) ...;

• timeunit and timeprecision declarations instead of ‘timescale.• C-like break, continue, and return (no block name required).• Module final block; synthesis always block variants: always latch,

always ff, always comb.• New interface type to make reuse of module I/O declarations easier.• A self-contained assertion language subset permitting assertions in module scope

or in procedural code.


• covergroup for functional code coverage.• New programming interface, the DPI (Direct Programming Interface).

In the present author’s opinion, the most important of these enhancements are:

• packages• break, continue, and return• interface types• assertion language.

24.2 Continued Lab Work (Lab 23 or later)


(Optional) Read Thomas and Moorby (2002) section 11.1 for a project in whichtoggle-testing is used to represent power dissipation as a function of adder structure.You may wish to implement the verilog models suggested.


Read section 15.1 on simulation, emulation, and hardware acceleration as aids todesign verification. Also read the summary in section 15.4.

Read section 15.3 on formal verification.

Index

*, in event control, 46-> (event trigger), 177$display, 34, 146, 165$display, example, 44$finish, 11, 165, 195, 196$fullskew timing check, 297$hold timing check, 298$monitor, 34, 174$nochange timing check, 299$period timing check, 299$readmemb, 414$readmemh, 414$recovery timing check, 298$recrem timing check, 299$removal timing check, 298$sdf annotate, 407$setup timing check, 298$setuphold timing check, 298$skew timing check, 297$stop, 165, 195, 302$strobe, 34, 174$time, 34$timeskew timing check, 297$width timing check, 297, 299

adder vs. counter, 104ALF, library format, 114always block, 24, 177, 196always, event control syntax, 32always, for concurrency, 203always, scope, 269arithmetical shift, 121array, addressing, 86array, multidimensional, 85array, select, 86array, verilog, 84arrayed instance, 213

assertion, defined, 34assertion, example, 58assign, continuous, 4, 14assign-deassign, 415assign-deassign, to avoid, 116assignment, blocking, 32, 119, 178assignment, nonblocking, 32, 119, 178asynchronous control, priority, 48asynchronous controls, 48, 49automatic, 206automatic task or function, 147

back-annotation, 19, 405Backus-Naur Format (BNF), 45BASIC programming language, 417behavioral, 102, 104behavioral flowchart, 134behavioral synchronization, serial clock, 132behavioral synthesis, 140BIST (Built-In Self-Test), 382bitwise operators, 12BNF, 45boundary scan, 379bufif1, 122, 185, 217bufif1, switch-level model, 251Built-in self-test (BIST), 382

case, 31, 120case equality, 121, 200case, example, 122, 199case, expression match, 199case-sensitivity, verilog, 11casex, expression match, 200casex, to be avoided, 201casez, expression match, 201casez, wildcard match, 202cell, configuration keyword, 280

429

430 Index

charge strengths, 253checksum, 89chip failures, causes, 424clock domains, 108clock domains, 2-stage synchronizing ffs, 272clock domains, independent, 235, 271clock domains, serdes, 130, 131clock domains, synchronizing latches, 272clock generator, always, 33clock generator, concurrent, 196clock generator, forever, 33clock generator, restartable, 197clock, serdes embedded, 232, 235, 237, 239,

240clocked block, 48clocks, implementing, 33cmos, 252cmos primitive, 251CMOS, switch-level primitive, 250collapsing test vectors, 377comment region, macro, 24, 27comment tokens, verilog, 11comment, synthesis directive, 24comment, verilog, 23concatenation, 87concurrent block, 44concurrent block names, scope, 269conditional, 121conditional operator, 31conditional, expression match, 200config, 280config, configuration keyword, 280config, scope, 269config, to be avoided, 281constant, verilog, 43contention, 115, 124contention in verilog, 114continuous assignment, 4, 12, 14, 25, 48corner case testing, 378counter, 101, 104counter, carry look-ahead, 106counter, gray code, 107counter, one-hot, 102counter, ring, 107counter, ripple, 106counter, synchronous, 106counter, unsigned binary, 101counter, verilog, 71, 77coverage summary, 377coverage, hardware testing, 377coverage, in software, 377

D flip-flop, from nands, 189, 191, 192D flip-flop, verilog, 37

D latch, verilog, 37DC macro, predefined, 27decoder, example, 220decoder, tree example, 220decoder, verilog, 122deep submicron effects, 421default, configuration keyword, 280default nettype, 186, 187, 216define, 215define, scope, 269defparam, 415defparam, to be avoided, 262delay pessimism, 171delay pessimism, moderated in specify, 305delay triplet, example, 285delay value, units, 13delay, #0, 172delay, 6-value in specify, 287delay, blocking, 172delay, conditional in specify, 288delay, conflict within specify, 289delay, declared on net, 284delay, distributed, 282delay, in nonblocking, 169delay, intra-assignment, 169delay, lumped, 282delay, lumped example, 284delay, min and max, 248delay, multivalued, 171, 247delay, nonblocking, 172delay, not in UDP, 243delay, overlap with specify, 289delay, pessimism, 246delay, polarity in specify, 288delay, procedural, 23, 171, 178delay, procedural avoided, 233delay, regular, 169delay, scheduled, 169–171delay, to x, 171delay, transport (VHDL), 170delay, triplet, 249delay, trireg to ‘x’, 253delay, with strength, 247DesDecoder, project synthesizable, 339Deserializer, concurrent schematic, 337Deserializer, project schematic, 326Design Compiler, flattening logic, 10Design Compiler, script functionality, 7Design for Test (DFT), 375design partitioning, for synthesis, 271design partitioning, rules, 270design vision, netlist viewer, 7design, configuration keyword, 280DFT (Design for Test), 375

Index 431

DFT, summarized, 383disable statement, 130disable, example, 133disable, task or function, 147dont touch, in script vs verilog, 78

ECC, 88–90, 92, 93ECC, finite element, 91ECC, parity, 90ECC, serial data, 94ECO, example, 236edge, functional defined, 246edge, timing defined, 246endconfig, configuration keyword, 280equivalence checking verification, 425error limit, pulse filter, 303error-handler, generic, 164event (keyword), 177event control, 24event control @, 177event control wait, 178event control, inline, 32event queue, stratified, 172, 173event queue, verilog, 116event, active, 170, 173, 174event, declared, 177event, future, 174event, inactive (#0), 173, 174event, monitor, 174event, nonblocking, 174event, queue example, 175event, regular, 170event, vs. evaluation, 173exponentiation, 121expression, defined, 46

fault simulator, 377FIFO, 131, 151FIFO bubble diagram, 159FIFO dataflow, 152FIFO parts, 153FIFO project states, 158FIFO schematic, 158FIFO state logic, 159FIFO transition logic, 160, 161, 163FIFO, dual-port RAM, 338FIFO, project dual-clocked, 338FIFO, project synthesizable, 338for, 31, 195, 197–199, 217for, examples, 198force-release, 415force-release, to avoid, 116forever, 195, 196fork-join, 149, 170, 202, 203

formal proving verification, 425format specifier, example, 35format specifiers, 34frame, serdes project, 69full-duplex serdes, 392full-path delay, 287function, 146function declaration, 146, 147function, automatic, 206function, example, 148, 233, 234function, scope, 269function, width indices, 207

gate-level, 104generate, 213–215generate, block declarations, 218generate, conditional, 215generate, decoder tree, 222, 223generate, loop example, 217generate, loop scope quiz, 224generate, looping, 216, 218generate, no nesting, 216generate, scope, 269generate, simple decoder, 219generate, unrolled naming, 218genvar, 218genvar, in looping generate, 216

hard macro, defined, 78hierarchy, in verilog, 211

identifier, ASIC library component, 244identifier, escaped, 44identifier, verilog, 44if, 31, 195if, expression match, 199ifdef, 215ifdef example, 28ifdef, example, 74include, example, 73inertial delay, 109, 119, 303inertial delay example, 303inertial delay, simulators, 172initial, 196initial block, 11, 24initial block, example, 2, 12initial, cautions, 33initial, scope, 269inout, 188instance arrays, 213instance, configuration keyword, 280, 281instance, of module, 13integer, 43interface, in System Verilog, 270

432 Index

interface, partitioning, 269internal scan, 380IP Block, 183

JTAG, 50, 52, 53, 59, 379

keywords lower case, 11

lane, defined, 392lane, PCIe, 67large, charge strength, 253latch, 47latch error, examples, 47latch synthesis, 47LFSR, 89, 91, 92, 383LFSR polynomial, 90Liberty library timing checks, 295Liberty, library format, 114LIFO, 151literal, 43literal expression, syntax, 13literal syntax, 26literal, syntax example, 15localparam, 227, 259localparam, conditional example, 357localparam, example, 275logic levels in verilog, 12logical operators, 12

macro (compiler directive), 45macro, examples, 73macro, recommended usage, 28medium, charge strength, 253memory ECC, 88Mentor proprietary information, xximessaging tasks, 34model checking verification, 425module, 11module header, 11, 12module header formats, 21module instance, scope, 269module, ANSI header, 259module, ANSI header example, 261module, contents, 2module, output reg ports, 415module, scope, 269module, traditional header, 260module, traditional header example, 262modules, for concurrency, 204MOS, resistive strength rules, 250MOS, switch-level primitives, 250mux, schematic, 39mux, switch-level model, 255mux, verilog, 39

named block, 129, 130, 147nand, switch-level model, 256nmos, 251nmos primitive, 250noise estimation problems, 426none (implied net default), 187nor, switch-level model, 256noshowcancelled inertia, 305noshowcancelled specparam, 305not, 217not, switch-level model, 251, 252notif1, 217notifier, 297notifier reg example, 303notifier, in timing check, 296, 302

observability, 376operator precedence, verilog, 121operators, bitwise vs. logical, 128operators, verilog table, 120

packet, serdes, 131packet, serdes project, 69parallel block (fork-join), 149parallel-path delay, 287parallel-serial converter, 78parameter, 22, 27, 44, 80, 188, 227, 259parameter declaration, 188parameter override, 188, 189parameter real, 259parameter signed, 259parameter, in ANSI header, 260parameter, index range, 259parameter, not in literals, 77parameter, override by name, 260parameter, override by position, 261parameter, real, 73parameter, signed, 73, 261parity, memory, 88partitioning, analog-digital example, 237pass-switch primitives, 252path delays, full and parallel, 287PATHPULSE conflict rules, 304PATHPULSE example, 304PATHPULSE specparam, 303PATHPULSE, inertial delay control, 303PCI Express (PCIe), 67PCIe lane, 67PLL, 61PLL 1x, 61PLL comparator, synthesizable, 318PLL, 1x schematic, 62PLL, 1x synthesizable, 318–320PLL, 32x, 70

Index 433

PLL, 32x blocks, 71PLL, 32x schematic, 72PLL, clock extraction, 133PLL, digital lock-in, 64PLL, synthesizable, 314, 325pmos, 251pmos primitive, 250port connection rules, 187port map. of instance, 13power distribution problems, 426primitive, 243primitive, scope, 269procedural, 102procedural assignment, 14procedural block, 45procedural block names, scope, 269pulldown, 186pulldown primitive, 252pullup, 186pullup primitive, 252pulse filtering limits, 303pulsestyle ondetect inertia, 305pulsestyle ondetect specparam, 305pulsestyle onevent inertia, 305pulsestyle onevent specparam, 305

race condition, 49, 116, 117, 173race condition, defined, 116race, initial blocks, 118RAM, bidir wrapper, 98RAM, Mem1kx32 schematic, 96RAM, simple verilog, 87RAM, size issues, 83rcmos primitive, 252real variable, 62realtime reg type, 306reconvergent fanout, 36reg, 12, 14reg , in output port, 23reg vs trireg, 253reg, input port illegal, 23rejection limit, pulse filter, 303relational expression, of ‘x’, 119repeat, 197replication, 121rnmos primitive, 250rounding of decimals, 73rpmos primitive, 250RTL, 104, 132RTL, defined, 103rtran primitive, 252rtranif0 primitive, 252rtranif1 primitive, 252

scan chain, 56, 57scan, boundary, 52, 379scan, internal, 50, 380scheduled conflicts, 172SDC, Synopsys Design Constraint format, 415SDF file, 18SDF summary, 407SDF syntax, 406, 407SDF, delay override, 406SDF, in verilog flow, 405SDF, net delays, 406SDF, path delays, 406SDF, use with simulator, 406serdes FIFO, 68serdes project block diagram, 361serdes, class project, 69serdes, packet, 81serdes, project block diagram, 312serdes, project to full-duplex, 392serial-parallel converter, 231Serializer, project schematic, 363shift register, 35shift register, example, 234, 356shift register, RTL, 40shift register, schematic, 36, 38, 40showcancelled inertia, 305showcancelled specparam, 305simulators, strength spotty, 123small, charge strength, 253soft errors, hardware, 375, 383source switch-level models, 252specify block, 285specify block summary, 285specify block, 6-value delays, 287specify, scope, 269specparam, 285, 286specparam example, 286specparam, with timing triplets, 286SPEF, 406SPICE, 251SR latch, 5, 189, 190state machine, design, 150state machine, verilog, 151statement, defined, 46static serial clock synchronization, 141strength, assigning, 115strength, charge, 114strength, charge values, 253strength, drive, 113strength, resistive MOS rules, 250strength, table, 114strength, with delay, 247string, verilog, 44strings, verilog storage, 33

434 Index

structural, 102–104supply0 (net type), 187supply1 (net type), 187switch level, 114switch-level model, 249switch-level primitive logic, 185Synopsys proprietary information, xxisystem function, 45system task, 45System Verilog, 270, 415System Verilog summary, 427SystemC, 415

T flip-flop, 105, 143table, in UDP, 243–245TAP, 423TAP controller, 52, 379, 383task, 146task data sharing, 147task declaration, 146task, automatic, 206task, concurrency example, 203task, example, 233task, exercise, 164task, for concurrency., 202task, scope, 269testbench, Intro Top example, 2three-state buffer, 122, 124time reg type, 305timescale, 16, 216timescale macro, 13timescale specifier, 4timing arc, defined, 281timing arc, examples, 282timing check, 45timing check as assertion, 295timing check feature summary, 296timing check negative limits, 300–302timing check notifier, 296, 302timing check, conditional event, 302timing check, data event, 296timing check, limits must be constant, 297timing check, reference event, 296timing check, table of all 12, 297timing check, time limits, 296timing check, timecheck event, 296timing check, timestamp event, 296timing checks vs system tasks, 295timing checks, in QuestaSim, 297timing path and arcs, 281timing path, causality, 282timing triplet, example, 285timing triplets, 249toggle flip-flop, 105

tran primitive, 252tranif0 primitive, 252, 255tranif1 primitive, 252, 255transfer-gate primitives, 252tri (net type), 186tri0 (net type), 186tri1 (net type), 186triand (net type), 186trior (net type), 186trireg (net type), 186trireg switch-level net, 253trireg vs tran primitive, 252trireg, example, 253TSMC proprietary information, xxi

UDP, 243UDP, combinational example, 244UDP, sequential example, 245UDP, summary, 246use, configuration keyword, 280

VCD file, 17vector, 13, 25vector index syntax, 14vector, bit significance, 15vector, example, 16vector, logical operator, 26vector, negative index, 30vector, select, 86vector, sign bit, 25vector, width (type) conversion, 26verification, forrnal, 425verification, functional, 424verification, timing, 425verilog tutorial, from Aldec, xxiiiverilog, 1995 vs 2001, 413verilog, ACC C routines, 418verilog, arrayed instance, 213verilog, attributes, 414verilog, clocked block, 178verilog, coding rules, 177, 178verilog, comment directives, 414verilog, compiler directives, 414, 417verilog, conditional compile, 215verilog, configuration, 279verilog, declaration ordering, 205verilog, declaration regions, 205verilog, hierarchical names, 212, 213, 268verilog, hierarchy path, 211verilog, keywords, 414verilog, named block, 129verilog, PLI, 45, 415, 417verilog, primitive gates, 185verilog, scope of names, 269

Index 435

verilog, simulator file I/O, 414verilog, synthesizable, 50verilog, synthesizable summary, 413verilog, system tasks and functions, 414–416verilog, TF C routines, 418verilog, UDP (primitive), 243verilog, variable, 43verilog, VPI C routines, 418VFO, project FastClock oscillator, 315VFO, synthesizable, 315, 316VHDL, 415

wand (net type), 186watch-dog device, 208while, 197, 198while, examples, 198width specifier, literals, 4wire, 12wire (net type), 186wire, implied names, 186wire, implied types, 186wire, other net types, 186wor (net type), 186wrapper module methodology, 322

digital vlsi design with verilog: a textbook from silicon valley technical institute

Documents