to bug or not to bug: lessons learned developing bug-compatible software 22 june 2005 eric keiter...
TRANSCRIPT
To bug or not to bug: Lessons learned developing
bug-compatible software
22 June 2005
Eric Keiter
Dept. 9237: Electrical and Microsystems Modeling
Sandia National Laboratories
Albuquerque, NM, USA
Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company,for the United States Department of Energy under contract DE-AC04-94AL85000.
Outline
Context What is circuit simulation? What is Xyce? What do you mean, “bug-compatible”?
Lessons learned
Development tools: Bugzilla, Bonsai, CVS, autotools, etc.
The release process
Analog Circuit Simulation (SPICE or Xyce)
System of Coupled Differential Algebraic Equations (DAEs)
Kirchoff’s Laws, Arbitrary Network Formulation: Modified Nodal Analysis.
(modified KCL) Most equations are Kirchoff Current Law
(KCL) equations. Most solution variables are nodal voltages. Most currents are obtained via an Ohm’s
law relationship.
Nonlinear system of equations: Implicit time integration Newton’s method for the nonlinear solve at
each time step. Linear system of equations (Ax=b) solved
at each Newton iteration.
Kirchoff’s Current Law (KCL)
Kirchoff’s Voltage Law (KVL)
Simulation Hierarchy
Analog circuit simulation is just one of many types of simulation used by electrical designers.
Tradeoff between fidelity and speed/problem size. Digital (VHDL) simulation: very fast, but low
fidelity TCAD Device simulation: very slow, but very
high fidelity. Circuit simulation: in-between.
Digital/Digital/Mixed-SignalMixed-SignalSimulationSimulation
Analog (ODE)Analog (ODE)SimulationSimulation
Device-Scale(PDE)Device-Scale(PDE)SimulationSimulation
SoftwareSoftware 0110010101100101
Speed Fidelity
Digital (VHDL)Analog
(like SPICE)Xyce
TCAD DeviceCharonBoard and
Device Parasitics
History of analog circuit simulation (SPICE)
SPICE = Simulation Program with Integrated Circuit Emphasis. Developed at UC-Berkeley.
1969: Class project, called CANCER = Computer Analysis of Nonlinear Circuits, Excluding Radiation. 6000 lines of Fortran.
1971: SPICE 1. Released into public domain. 1975: SPICE 2. PhD thesis of Larry Nagal. 8000 lines of Fortran. 1989: SPICE 3. PhD thesis of Tom Quarles. 135,000 lines of C.
Xyce, excluding solver libraries, is ~400,000 lines of C++. Progress! SPICE was popular because it was public domain, and it was quickly
adopted by universities, and by industry. SPICE became an industry standard for analog circuit simulation.
Around 1990, most circuit simulation innovation became commercialized, mostly with codes built on the SPICE3 engine. Pspice, Hspice, SmartSpice. Sporty Spice, Scary Spice, Posh Spice
More recently, EDA companies have been buying each other, and retiring a lot of SPICE-related products.
More Miscellaneous Context
10 years ago, conventional wisdom: iterative matrix solvers don’t work for circuits. wrong!
In 2000, there were over 500 PSpice licenses at Sandia.
Many SPICE models are terrible; part of the standard. Simplifications based on 1960’s computers. Discontinuities, bad derivatives.
As SPICE is a design tool, a bad model can be OK. Sometimes, a designer just wants a generic behavior, rather
than a precise one. “bug compatible” is expected.
One Perspective on SPICE
Q: If you could go back over your career and change something within your field, what would you change?
Pease: “I’d shoot the guys who were going to invent SPICE.”
Robert A. Pease, ad insert from EDN “Movers and Shakers” 2003 supplement.
• Taken from a talk by Larry Nagal, the inventor of SPICE
Xyce Requirements
Xyce is an ASC project, started in 1999. Requirements: SPICE-compatible Massively parallel Include special Sandia physics models (ie radiation) Useful to unsophisticated users
Another way to put it: “Be the same as SPICE, but also be much better” “I never want to see the “Time step too small” error again” –
Ken Marx, Sandia Electrical Analyst
Xyce Status, circa 2005
We’ve mostly succeeded: SPICE-compatible (sort of) Radiation models Demonstrated Parallel
scalability on “real” problems (see chart)
Real customers using Xyce on real DSW problems.
We still have a lot to do!
NWCC/Spirit Simulation Time
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
0 50 100 150
Processors
Time(s)
Photograph of Permafrost ASIC (PA)
PA Scaling study (~800,00 devices)
Number of Processors
Sca
led
Spe
ed
0
128
256
384
512
640
768
896
1024
0 128 256 384 512 640 768 896 1024
Optimal Speedup
Total Solution
Scaled Problem Speedup14,000 Devices/Processor
Largest Analog Circuit Calculation Ever Performed !
XyceTM massively parallel circuit analysis code
Nonlinear transmission line problem
14 million electrical devices modeled
6 million unknowns Over factor of 500
speedup using 1024 processors of LLNL’s ASCI White supercomputer 14,000 electrical devices
per processor
How did we do it? Lessons learned
Hire good people Get everyone on the same page.
Difficult, but REALLY important!
Open mind about legacy (ie SPICE) codes Support the low end as well as the high end
Most Xyce simulations are not massive.
Use development tools (within reason) Bugzilla, Bonsai, UML, etc. Regression testing: nightly, weekly, monthy
Rigorous requirement-tracking, issue-tracking, and bug-tracking.
Formal release process
Lesson: Hire good people
“Hire the best people you can and then get out of their way!” – Bill Camp, circa 1993.
This may seem obvious, but in practice is difficult. Look for:
Lots of programming experience, even if not SPICE (or whatever) Programs for fun, has done so for years. (maybe as a kid) Background in numerical programming
Handholding a bad developer is (eventually) a waste of time.
One great developer is worth 5 lousy developers.
Lesson: Get everyone on the same page
Note: Make sure you have a page. Good people often disagree Good people, when they are new, are going to do
“dumb” things. This includes everyone, including the best people. Don’t get upset - plan for it! Have an established code-development “business model” Interact with new developers a LOT. Correct misconceptions early. Have documentation ready: developer guide, theory guide,
etc.
This is the most important issue.
The Same Page: Coding standard/guidelines
This is the first issue for Xyce where getting all the developers to agree was difficult. C++ or C-style comments? Philosophy of pointers and/or references? Curly bracket usage? Standard class and function comment headers. Should we use STL? (ASC Red didn’t support it) Directory structure?
The important thing was not so much WHAT we chose, but that we chose SOMETHING and STUCK TO IT.
Many of these things are a matter of taste.
In general, to achieve consensus, final choices were: Acceptably Imperfect Something everyone was willing to live with
The Same Page:Plan for new developers in the code.
Some parts of the code benefit a LOT from a very sophisticated structure Especially modules with just one developer (Xyce topology)
However, be careful: Some parts of the code will be touched by many developers,
with varying skill levels Xyce Device module (>10 developers since 2000)
Continual refactoring and abstracting in the multi-developer sections can be problematic
In my initial development of the Xyce device package, part of that development included getting other developers working in it, giving me feedback. This included (among others) 2 undergrad student interns.
Lesson: Open mind about legacy code
Initially, we thought all of SPICE was lousy Slamming it was popular (by users and developers)
“I’d shoot the people who were going to invent SPICE” Much of it looked strange, “out of the mainstream” The last “free” version of Spice came out in the late 1980’s.
Reality: some of it was lousy, some of it very clever. Kundert Sparse direct solver Voltage limiting
As the project has evolved, we’ve referred to the Spice3 source a lot. It is the standard we are attempting to emulate, after all. To an extent, we have to be “bug-compatible”!
Prototype
Make your initial mistakes in a code that you plan to throw away.
Summer, 2000 - Xyce prototype was written. Some of the structure was retained for Xyce Some was thrown away.
Developer tools
Rational Rose - UML
CVS/ Bonsai
Bugzilla
Regression testing
Certification testing
Xyce UML Diagrams
Package DiagramPackage Diagram
Class DiagramClass Diagram
“Abstract Factory”Design Pattern
“Abstract Factory”Design Pattern
C++ CodeC++ Code
We used Rational Rose for initial code generation.
For a larger project, round-trip engineering may have been worth doing - for us, it wasn’t.
One danger - spending months developing a UML diagram before attempting code generation - result: un-compilable code.
CVS/Bonsai
Like with most codes, configuration management is very important.
We use CVS/Bonsai.
Hyperlinked to bugzilla.
Bugzilla
We use Bugzilla extensively. Over 700 issues since 2000. We use it for tracking all sorts of project planning - it is
not just for user reported bugs. Most bugzilla issues are entered by developers
A typical usage: In FY05, we’ve used it to track all the development neede for
our Advanced Development (AD) project. One “blocker” issue for AD, that sits on top of a large number of
specific issues. (like “develop the SOI device model”)
Bugzilla plays a major, major role in our release process.
Testing: Regression and Certification
Regression testing - last chance “safety net” Nightly tests Weekly tests 49 different builds
Certification testing - used for release process Each test tied to a specific bugzilla issue Many certification tests ultimately become
regression tests later, after a release.
The release process
The release process
Planning: identify features, and bugzilla issues for this release.
QA: Every fixed issue must have a certification test to start QA, demonstrating the issue.
The code in “QA” until every certification test, for every support platform, passes.
If any test fails, we drop out of QA, fix code, and restart QA, from the beginning.
Production builds: created if Xyce gets through QA w/o failure, for all platforms.
Certification: Signed by a manager.
Regression: Certification tests are migrated to regression tests, post-release.
The release process: Planning
The Planning phase happens first.
Developer meeting.
Decisions made: Which bugzilla issues will be fixed (addressed) for the release,
and which ones will be deferred? Which issues are “deal killers”, and which ones are
expendable? Who is assigned to each issue? (we usually know already, but
sometimes assignments change) What date do we freeze/branch the code?
For each issue, developers commit to produce a “certification test” that proves the issue is fixed and/or addressed.
These certification tests are used in QA.
Note: Of course, some issues can’t easily be tested - those issues simply get documented. (see Dilbert cartoon)
The release process: Branching
The Branching the code is crucial.
On the freeze date, tag and branch to create a “release branch”.
From that point onward: New features are only developed in the “development (main) branch” The release branch will be changed minimally, for small bugfixes.
Later in the release process, we sometimes branch the code again, to plan for patch releases.
After the release process is over, the branches are merged back together.
The release process: QA
The QA phase dominates the release process, in terms of time.
Typically, we plan for 3 rounds of QA.
In our last release, we needed 6 (too many!)
We typically plan for each QA round to take 2 weeks.
Once the code is in a round of QA, it should be completely frozen.
If the code needs to be fixed (certification test fails): the current round of QA stops. the release branch code is fixed, and QA restarts, from the beginning.
QA is complete only when the code can pass every test, without any code modifications.
In general, the release branch code should change minimally over all the QA cycles.
The release process: Post-QA
Once we get through a round of QA with no certification test failures, we can move to the next phase(s).
Code is branched again, to plan for possible “patch” release.
Production builds
Documentation: Users/Reference guides, Release notes.
Certification: This includes setting up and filing paperwork, which requires getting signatures from: Manager HPEMS PI Technical lead Testing team lead
Announcement/Distribution
Release branch is merged back into the development branch. If the release process has taken a long time (months), this is a big job.
Certification tests are migrated to regression tests, within reason.
The End