wolfgang roesner verification tools development ibm corp austin, tx logic simulation : languages,...
TRANSCRIPT
Wolfgang RoesnerVerification Tools DevelopmentIBM Corp Austin, TX
Logic Simulation :Languages, Algorithms, Simulators
Outline
Hardware Design Languages
Modeling Levels - A Taxonomy of HDL Constructs
General Purpose HDL Simulators - Event-Driven Sim
Improving Simulator Performance
Synchronous Design Methodology and Cycle-Based Sim
Let's Design a Cycle-Based Simulator
Simulation of Hardware Design Languages
Logic or "functional" simulation today is done mostly with HDLs (Hardware Design Languages)
Most popular languages today (both are IEEE standards) Verilog VHDL
Verilog: logic modeling and simulation language started in EDA industry (start-up) in the 80's was acquired by Cadence donated to IEEE as a general industry standard approx. 60% market share in U.S. EDA market
VHDL: committee-designed language contracted by U.S. (DoD) (ADA-derived) functional/logic modeling and simulation language approx. 40% market share in U.S. EDA market
Modeling Levels - Highest Level : Interface
Let's look at the common constructs we use to specify the functionality of a piece of hardware:
Model
inputs
outputs
t
input behavior over time
output behavior over time
Modeling Levels - Major Dimensions (I)
Temporal Dimension: continous (analog) gate delay (psec?) clock cycle instruction cycle events
Data Abstraction: continuous (analog) bit : multiple values bit : binary abstract value composite value ("struct")
discrete time
discrete value
Modeling Levels - Major Dimensions (II)
Functional Dimension: continuous functions (e.g. differential equations) Switch-level (transistors as switches) Boolean Logic Algorithmic (eg. sort procedure) Abstract mathematical formula (e.g. matrix
multiplication) Structural Dimension:
Single black box Functional blocks Detailed hierarchy with primitive library elements
A good VHDL-centric taxonomy can be found at: http://rassp.scra.org
Modeling Levels - Major Dimensions (III)
Continuous Gate Delay Clock Cycle Instruction Cycle Events
Continuous Multivalue Bit Bit abstract value "struct"
Continous Switch Level Boolean Logic Algorithmic Abstract Mathematical
Single Black Box Functional BlocksDetailed
Component Hierarchy
Temporal
Data
Functional
Structural
Coverage of Modeling Levels - Verilog
Continuous Gate Delay Clock Cycle Instruction Cycle Events
Continuous Multivalue Bit Bit abstract value "struct"
Continous Switch Level Boolean Logic Algorithmic Abstract Mathematical
Single Black Box Functional BlocksDetailed
Component Hierarchy
Temporal
Data
Functional
Structural
Coverage of Modeling Levels - VHDL
Continuous Gate Delay Clock Cycle Instruction Cycle Events
Continuous Multivalue Bit Bit abstract value "struct"
Continous Switch Level Boolean Logic Algorithmic Abstract Mathematical
Single Black Box Functional BlocksDetailed
Component Hierarchy
Temporal
Data
Functional
Structural
*
* extremely inefficient compared to Verilog
In HDL Terms
VHDL, Verilog were both defined as simulation languages
Big emphasis on the structural refinement VHDL: entity/architecture/component/port/signal Verilog: module/instance/port/signal/reg
Specification of function General: programming language constructs
VHDL : user-defined data type, package, procedure, function, sequential code
Verilog: function, task, sequential code Parallelism:
VHDL : process, signal update, wait Verilog: "always" block, fork/join, wait, event construct
Special purpose H/W constructs VHDL : concurrent assignment, delayed assignment, signal, Boolean logic Verilog : continuous assignment, 4-value Boolean logic, switch-level
support
General HDL Simulators
Before we can look at architectures of simulators we need to understand the execution model that the HDLs imply:
VHDL's execution model is defined in detail in the IEEE LRM (Language Reference Manual)
Verilog's execution model is defined by Cadence's Verilog-XL simulator ("reference implementation")
Event-Driven Execution
An event-driven VHDL example
process (count)begin my_count <= count; trigger <= not trigger;end process;
process (trigger)begin if (count<=15) then count <= count + 1 after 1ns; else count <= 0 after 1ns; end if;
end process;
Block 1
Block 2
Each process:- loops forever- waits for change in signal from other process
A more hardware-oriented example
s(0) <= a(0) xor b(0) after 2ns;c(0) <= a(0) and b(0) after 1ns;
s(1) <= a(1) xor b(1) xor c(0) after 2ns;c(1) <= (a(1) and b(1)) or (b(1) and c(0)) or (c(0) and a(1)) after 1ns;
sum_out(1 to 0) <= s(1 to 0);carry_out <= c(1);
and
a
band
xor
s(0)
c(0)
xorxor =>
s(1)sum_out(1)
and
and
or
or
=>
=>
sum_out(0)
carry_out
A more hardware-oriented example
s(0) <= a(0) xor b(0) after 2ns;c(0) <= a(0) and b(0) after 1ns;
s(1) <= a(1) xor b(1) xor c(0) after 2ns;c(1) <= (a(1) and b(1)) or (b(1) and c(0)) or (c(0) and a(1)) after 1ns;
sum_out(1 to 0) <= s(1 to 0);carry_out <= c(1);
and
a
band
xor
s(0)
c(0)
xorxor =>
s(1)sum_out(1)
and
and
or
or
=>
=>
sum_out(0)
carry_out
Let's simulate:a=11b=01
Time = 0ns (step1)
Red Boxes : evaluate in current step
1
1
1
0
A more hardware-oriented example
s(0) <= a(0) xor b(0) after 2ns;c(0) <= a(0) and b(0) after 1ns;
s(1) <= a(1) xor b(1) xor c(0) after 2ns;c(1) <= (a(1) and b(1)) or (b(1) and c(0)) or (c(0) and a(1)) after 1ns;
sum_out(1 to 0) <= s(1 to 0);carry_out <= c(1);
and
a
band
xor
s(0)
c(0)
xorxor =>
s(1)sum_out(1)
and
and
or
or
=>
=>
sum_out(0)
carry_out
Let's simulate:
a=11b=01
Time = 0ns (step2)
1
1
1
0 1
A more hardware-oriented example
s(0) <= a(0) xor b(0) after 2ns;c(0) <= a(0) and b(0) after 1ns;
s(1) <= a(1) xor b(1) xor c(0) after 2ns;c(1) <= (a(1) and b(1)) or (b(1) and c(0)) or (c(0) and a(1)) after 1ns;
sum_out(1 to 0) <= s(1 to 0);carry_out <= c(1);
and
a
band
xor
s(0)
c(0)
xorxor =>
s(1)sum_out(1)
and
and
or
or
=>
=>
sum_out(0)
carry_out
Let's simulate:
a=11b=01
Time = 0ns (step3)
1
1
1
0 1 1
A more hardware-oriented example
s(0) <= a(0) xor b(0) after 2ns;c(0) <= a(0) and b(0) after 1ns;
s(1) <= a(1) xor b(1) xor c(0) after 2ns;c(1) <= (a(1) and b(1)) or (b(1) and c(0)) or (c(0) and a(1)) after 1ns;
sum_out(1 to 0) <= s(1 to 0);carry_out <= c(1);
and
a
band
xor
s(0)
c(0)
xorxor =>
s(1)sum_out(1)
and
and
or
or
=>
=>
sum_out(0)
carry_out
Let's simulate:
a=11b=01
Time = 0ns (step4)
1
1
1
0 1 1 1
A more hardware-oriented example
s(0) <= a(0) xor b(0) after 2ns;c(0) <= a(0) and b(0) after 1ns;
s(1) <= a(1) xor b(1) xor c(0) after 2ns;c(1) <= (a(1) and b(1)) or (b(1) and c(0)) or (c(0) and a(1)) after 1ns;
sum_out(1 to 0) <= s(1 to 0);carry_out <= c(1);
and
a
band
xor
s(0)
c(0)
xorxor =>
s(1)sum_out(1)
and
and
or
or
=>
=>
sum_out(0)
carry_out
Let's simulate:
a=11b=01
Time = 1ns (step1)
1
1
1
0 1 1 1
1
1
1
1
A more hardware-oriented example
s(0) <= a(0) xor b(0) after 2ns;c(0) <= a(0) and b(0) after 1ns;
s(1) <= a(1) xor b(1) xor c(0) after 2ns;c(1) <= (a(1) and b(1)) or (b(1) and c(0)) or (c(0) and a(1)) after 1ns;
sum_out(1 to 0) <= s(1 to 0);carry_out <= c(1);
and
a
band
xor
s(0)
c(0)
xorxor =>
s(1)sum_out(1)
and
and
or
or
=>
=>
sum_out(0)
carry_out
Let's simulate:
a=11b=01
Time = 1ns (step2)
1
1
1
0 1 0 1
1
1
1
1
1
A more hardware-oriented example
s(0) <= a(0) xor b(0) after 2ns;c(0) <= a(0) and b(0) after 1ns;
s(1) <= a(1) xor b(1) xor c(0) after 2ns;c(1) <= (a(1) and b(1)) or (b(1) and c(0)) or (c(0) and a(1)) after 1ns;
sum_out(1 to 0) <= s(1 to 0);carry_out <= c(1);
and
a
band
xor
s(0)
c(0)
xorxor =>
s(1)sum_out(1)
and
and
or
or
=>
=>
sum_out(0)
carry_out
Let's simulate:
a=11b=01
Time = 1ns (step3)
1
1
1
0 1 0 0
1
1
1
1
1
1
A more hardware-oriented example
s(0) <= a(0) xor b(0) after 2ns;c(0) <= a(0) and b(0) after 1ns;
s(1) <= a(1) xor b(1) xor c(0) after 2ns;c(1) <= (a(1) and b(1)) or (b(1) and c(0)) or (c(0) and a(1)) after 1ns;
sum_out(1 to 0) <= s(1 to 0);carry_out <= c(1);
and
a
band
xor
s(0)
c(0)
xorxor =>
s(1)sum_out(1)
and
and
or
or
=>
=>
sum_out(0)
carry_out
Let's simulate:
a=11b=01
Time = 1ns (step4)
1
1
11
0 1 0 0
1
1
1
1
1
1
A more hardware-oriented example
s(0) <= a(0) xor b(0) after 2ns;c(0) <= a(0) and b(0) after 1ns;
s(1) <= a(1) xor b(1) xor c(0) after 2ns;c(1) <= (a(1) and b(1)) or (b(1) and c(0)) or (c(0) and a(1)) after 1ns;
sum_out(1 to 0) <= s(1 to 0);carry_out <= c(1);
Let's simulate:
a=11b=01
Time = 1ns (step5)
and
a
band
xor
s(0)
c(0)
xorxor =>
s(1)sum_out(1)
and
and
or
or
=>
=>
sum_out(0)
carry_out
1
1
11
0 1 0 0
1
1
1
1
1
11
A more hardware-oriented example
s(0) <= a(0) xor b(0) after 2ns;c(0) <= a(0) and b(0) after 1ns;
s(1) <= a(1) xor b(1) xor c(0) after 2ns;c(1) <= (a(1) and b(1)) or (b(1) and c(0)) or (c(0) and a(1)) after 1ns;
sum_out(1 to 0) <= s(1 to 0);carry_out <= c(1);
and
a
band
xor
s(0)
c(0)
xorxor =>
c(1)carry_out
and
and
or
or
=>
=>
sum_out(0)
carry_out
Let's simulate:
a=11b=01
Time = 1ns (step5)
1
1
11
0 1 0 0
1
1
1
1
1
11
That's enough - you get the point
The statement-(process-) dependencies define a network.
Changes are dynamically propagated through the network
Event-Driven Simulation
The simulator maintains a list of all atomic executable blocks a data structure that represents the interconnect of the
blocks via signals a value table that holds all current signal values
At start time the simulator schedules all executable blocks of the models
Core-Architecture of an Event-Driven Simulator
Block Code
Schedule Signal Updates
Scheduler Data- Signal Sink Lists
- Event Queue
Select next scheduled block
More Blocks?
Signal - Updates
More Blocks?
IncrTime
Done
Improving Simulation Speed
The most obvious bottle-neck for functional verification is simulation throughput
There are two basic ways to improve throughput Simulator performance Running many simulations in parallel
Parallelization Hard : parallel simulation algorithms
much parallel event-driven simulation research has not yielded a breakthrough hard to compete against "trivial parallelization"
Simple: run independent testcases on separate machines Workstation "SimFarms" 100s - 1000s of engineer's workstations run simulation in the background ideal parallelization factor
Improving Simulator Performance (I)
Full-HDL support
If full cover of VHDL/Verilog is important
Optimizing compiler techniques treat sequential code constructs like general programming
language all optimizations for language compilers apply:
– data/control-flow analysis– global optimizations– local optimizations (loop unrolling, constant propagation)– register allocation– pipeline optimizations– etc. etc.
Global optimizations are limited because of model-build turn-around time requirements
Example: modern microprocessor is designed w/ ~1Million lines of HDL
Imagine the compile time for a C-program w/ 1M lines!
Improving Simulator Performance (II)
Full-HDL support
Better scheduling algorithms scheduling is clearly the bottle-neck in all event-simulators
Use hybrid techniques: some of the simplifications discussed in the following can be
applied to localized "islands" of HDL. requires an HDL compile process that automatically analyzes the
structure of the model and uses "speed-up" modes for sub-partitions
Problem:– assume 50% of the model has such islands– even if we could speed up simulation of those parts to take 0 time, we
would only gain a speedup factor of 2x
Improving Simulator Performance (III)
Simplifications
Use higher-level HDL specification
s(0 to 2) <= ('0' & a (0 to 1)) + ('0' & b(0 to 1) );
sum_out(0 to 1) <= s(1 to 2);carry_out <= s(0);
s(0) <= a(0) xor b(0) after 2ns;c(0) <= a(0) and b(0) after 1ns;
s(1) <= a(1) xor b(1) xor c(0) after 2ns;c(1) <= (a(1) and b(1)) or (b(1) and c(0)) or (c(0) and a(1)) after 1ns;
vs.
Improving Simulator Performance (IV)
Principles of using higher-level HDL specification Common theme: cut down of number of scheduled events
create larger sections of un-interrupted sequential code use less fine-grain granularity for model structure
– -> smaller number of schedulable blocks use higher-level operators use zero-delay wherever possible
– methodology implications: timing verification is not done together with functional simulation
Data abstractions use binary over multi-value bit values
– multi-value : use only for bus contention situations to resolve several drivers with different strengths (strong, resistive, high-impedance)
use word-level operations over bit-level operations
NEXT : Most Powerful - methodology-based subset of HDLs
Synchronous Design Methodology
Clock the design only so fast the longest possible combinational delay path settles before cycle is over
Cycle time depends on the longest topological path Hazards/Races do not disturb function
Longest topological path can be analytically calculated w/o using simulation -> stronger result w/o sim patterns
and
xorxor
and
and
or
or
clock : dependent on critical path
Logic Design Groundrules Synchronous design LSSD : design for testability Critical delay path defines the clock frequency
Behavioral Function and Timing Correctnesscan be verified independently
Design can be verified independently of its implementation -> Logic Synthesis -> Custom Design -> Synthesizable, high-level HDL as main vehicle for functional
verification IF: Boolean Equivalence Checking proofs closure between functional
an implementation view
Functional Verification can use zero-delay functional simulation --> Cycle-Based Simulation, FSM-based Formal Verification
Functional View of Synchronous Designs
Logic mapped to a non-cyclic network of Boolean functions Network also constains state-holding primitives : latch/reg/array HDL does not contain any timing information Function can be evaluated by zero-delay signal evaluation --> Speed of Simulation + Simplicity of Tools
LatchesArrays Boolean
LogicNetwork
Cycle Simulation Algorithm
LevelizedCombinational
LogicLat
che
s
A
B
C
Logic is ordered into levels so thatorder of evaluation is correct.E.g., A and B are computed before C.
Why Is This Faster?As an example, let's assume we compile our circuit into a stream of pseudo instructions....
and
a
band
xor
s(0)
c(0)
xorxor =>
s(1)sum_out(1)
and
and
or
or
=>
=>
sum_out(0)
carry_out
Load temp1, a(0)Load temp2, b(0)Xor temp1, temp2, temp3Store temp3, s(0)And temp1, temp2, temp3Store temp3, c(0)Load temp1, a(1)Load temp2, b(1)Xor temp1, temp2, temp3Load temp4, c(0)Xor temp3, temp4, temp5Store temp5, s(1)
We can cover every Boolean function into a minimal set
(~4 or better)instructions
Methodology FlowRTL
(VHDL, Verilog)
Language Compile Model Build
Physical VLSI Design Tools / Custom Design
Cycle-BasedSimulation Model
Formal Verification :
Boolean Equivalence
CheckVERITY
Cycle-BasedSimulator
Storage Elements
Latch Inference for sequential HDL if there is a possible set of signal evaluations that leave
the value of a signal undefined -> that signal is assumed to be a storage element
Verification & Logic Synthesis must interpret HDL the same way (check with Boolean Equivalence Checking)
HDL Compilation Options For Cycle Sim
Preserve the HDL structures Compile HDL like a programming language Preserve design hierarchy, processes, modules, functions,
procedures Implementation process very similar to programming language
compilers Incremental processing is trivial Model optimization is hard (cross-functional boundary) and limited
Map HDL to library of primitive functions (e.g. IBM "Texsim")
Crush design hierarchy to increase optimization potential Synthesis-like process, but simpler because of missing physical
constraints Incremental model build is very hard Designer view of hierarchy must be preserved for model debug
process
Evaluation Sequence Options For Cycle Sim
Oblivious Evaluation For every cycle, evaluate all Boolean logic Minimal amount of book-keeping and runtime control data
structures Large amount of redundant evaluation for large models with low
change activity
Event-Driven Evaluation Evaluate changes only Minimize redundant work for low-activity models Book-keeping overhead (But: use synchronicity constraint)
Hybrid Techniques (example: Texsim) Use design partitions as base granularity for event-driven
evaluation Use “key controling signals” as guards for event-driven evaluation
Function Evaluation Options For Cycle Sim
Interpreted Map design network into an efficient data structure which a
simulator “walks” at run-time “Model” is data, highly portable Few successfull examples where interpretation ofn a
datastructure didn’t add significant performance loss due to indirection
Compiled (example: Texsim) Map design network into a sequence of instructions Generating “C-source” is not an industrial-strength option “Model” consists of machine-instructions, platform dependent Many programming language compiler optimizations apply, but
the problem is simpler (more constrained)
IBM's Texsim simulator
Flattening of the hierarchy Network optimizations - using levelized network
Constant propagation DeMorgan’s law and other Boolean optimizations Equivalent function removal Merging of functions with only one fanout
Final levelization Structural analysis and logic partitioning (e.g. latch-partitioning,
below Create model symbol table and allocate value table Compile logic by partition and level
Generate reference history for use by register allocation Generate machine independent code & perform peephole optimization Perform instruction scheduling Generate target machine object code
Orders of Magnitude in Speed
Event Simulator 1
Cycle Simulator 20
Event driven cycle Simulator
50
Acceleration 1000
Emulation 100000
These numbers are supposed to give you intuitive feel:
The actual numbers depend on many other factors: Model activity rate Test bench activity & implementation language Multiple clock domains
We could write a simulator now...
But we have not talked about many areas:
Model-Build software principles (how to make gigantic model in minutes)
Simulator user interface (how to talk to user and to programs) How a cycle simulator deals with multi-value signals
do we need those ? Yes, mainly for bus logic and power-on-reset sim
How do we take the cycle-sim algorithm and implement in special purpose hardware : simulation acceleration, emulation
Where is there still a need for pure event-simulation?
Good research: only few papers in the field during the last 10 years Good place to start is Peter Maurer at Florida State. Extremely
interesting for folks who want to explore different algorithms