sparse linear solver for power system analyis using fpga · sparse linear solver for power system...
TRANSCRIPT
Sparse Linear Solver for Power System Analyis
using FPGA ∗
J. R. Johnson† P. Nagvajara‡ C. Nwankpa§
1 Extended Abstract
Load flow computation and contingency analysis is the foundation of powersystem analysis. Numerical solution to load flow equations are typicallycomputed using Newton-Raphson iteration, and the most time consumingcomponent of the computation is the solution of a sparse linear systemneeded for the update each iteration. When an appropriate eliminationordering is used, direct solvers are more effective than iterative solvers. Inpractice these systems involve a larger number of variables (50,000 or more);however, when the sparsity is utilized effectively these systems can be solvedin a modest amount of time (seconds). Despite the modest computationtime for the linear solver, the number of systems that must be solved islarge and current computation platforms and approaches do not yield thedesired performance. Because of the relatively small granularity of the linearsolver, the use of a coarse-grained parallel solver does not provide an effectivemeans to improve performance. In this talk, it is argued that a hardwaresolution, implemented in FPGA, using fine-grained parallelism, provides acost-effective means to achieve the desired performance.
Previous work [1, 2, 3] has shown that FPGA can be effectively usedfor floating-point intensive scientific computation. It was shown that highMFLOP rates could be achieved by utilizing multiple floating-point units,
∗This work was partially supported by DOE grant #ER63384, PowerGrid - A Com-putation Engine for Large-Scale Electric Networks
†Department of Computer Science, Drexel University, Philadelphia, PA 19104. email:[email protected]
‡Department of Electrical and Computer Engineering, Drexel University, Philadelphia,PA 19104. email: [email protected]
§Department of Electrical and Computer Engineering, Drexel University, Philadelphia,PA 19104. email: [email protected]
1
Report Documentation Page Form ApprovedOMB No. 0704-0188
Public reporting burden for the collection of information is estimated to average 1 hour per response, including the time for reviewing instructions, searching existing data sources, gathering andmaintaining the data needed, and completing and reviewing the collection of information. Send comments regarding this burden estimate or any other aspect of this collection of information,including suggestions for reducing this burden, to Washington Headquarters Services, Directorate for Information Operations and Reports, 1215 Jefferson Davis Highway, Suite 1204, ArlingtonVA 22202-4302. Respondents should be aware that notwithstanding any other provision of law, no person shall be subject to a penalty for failing to comply with a collection of information if itdoes not display a currently valid OMB control number.
1. REPORT DATE 01 FEB 2005
2. REPORT TYPE N/A
3. DATES COVERED -
4. TITLE AND SUBTITLE Sparse Linear Solver for Power System Analysis using FPGA
5a. CONTRACT NUMBER
5b. GRANT NUMBER
5c. PROGRAM ELEMENT NUMBER
6. AUTHOR(S) 5d. PROJECT NUMBER
5e. TASK NUMBER
5f. WORK UNIT NUMBER
7. PERFORMING ORGANIZATION NAME(S) AND ADDRESS(ES) Drexel University
8. PERFORMING ORGANIZATIONREPORT NUMBER
9. SPONSORING/MONITORING AGENCY NAME(S) AND ADDRESS(ES) 10. SPONSOR/MONITOR’S ACRONYM(S)
11. SPONSOR/MONITOR’S REPORT NUMBER(S)
12. DISTRIBUTION/AVAILABILITY STATEMENT Approved for public release, distribution unlimited
13. SUPPLEMENTARY NOTES See also ADM00001742, HPEC-7 Volume 1, Proceedings of the Eighth Annual High PerformanceEmbedded Computing (HPEC) Workshops, 28-30 September 2004 Volume 1., The original documentcontains color images.
14. ABSTRACT
15. SUBJECT TERMS
16. SECURITY CLASSIFICATION OF: 17. LIMITATION OF ABSTRACT
UU
18. NUMBEROF PAGES
21
19a. NAME OFRESPONSIBLE PERSON
a. REPORT unclassified
b. ABSTRACT unclassified
c. THIS PAGE unclassified
Standard Form 298 (Rev. 8-98) Prescribed by ANSI Std Z39-18
and FPGA could outperform PCs and workstations, running at much higher-lock rates, on dense matrix computations. The current work argues thatsimilar benefit can be obtained for the sparse matrix computations arisingin power system analysis. These conclusions are based on operation countsand system analysis for a collection of benchmark systems arising in practice.
Benchmark data indicates that between 1 and 3 percent of peak floatingpoint performance was obtained using a state-of-the-art sparse solver (UMF-PACK) running on 2.60 GHz Pentium 4. The solve time for the largestsystem (50,092 unknowns and 361,530 non-zero entries) was 1.39 seconds.
A pipelined floating point core was designed for the Altera Stratix fam-ily of FPGAs. An instantiation of the core on an Altera Stratix with speedrating (-5) operates at 200 MHz for addition and multiplication and 70 MHzfor divion. Moreover, there is sufficient room for 10 units. Assuming 100%utilization of eight FPUs, the projected performance for the FPGA imple-mentation is 0.069 seconds, which provides a 20-fold improvement. Whileit is optimistic to assume perfect efficiency, hard-wired control should pro-vide substantially better efficiency than available with a standard processor.Moreover, analysis of the LU factorization shows that the average numberof updates per row throughout the factorization is 20.3, which provides suf-ficient parallelism to benefit from 8 FPUs. An implementation, and moredetailed model, is being carried out to determine the attainable efficiency.
References
[1] J. Johnson, P. Nagvajara, and C. Nwankpa. High-Performance LinearAlgebra Processor using FPGA. In Proc. High Performance EmbeddedComputing (HPEC 2003). Sept. 22-24, 2003.
[2] K. Underwood. FPGAs vs. CPUs: Trends in Peak Floating-Point Per-formance. In Proc. International Symposium on Field-ProgrammableGate Arrays (FPGA 2004). Feb. 22-24, 2004.
[3] Ling Zhuo and Viktor K. Prasanna. Scalable and Modular Algorithms forFloating-Point Matrix Multiplication on FPGAs. In Proc. InternationalParallel and Distributed Processing Symposium (IPDPS 2004), 2004.
2
HPEC 2004
Sparse Linear Solver for Power System Analysis
using FPGAJeremy Johnson, Jeremy Johnson, Prawat Prawat
NagvajaraNagvajara, Chika , Chika NwankpaNwankpaDrexel University
Ultra-Fast Bus
PowerGrid
Economic AnalysisStation
Decision MakingStationEngineering
Analysis Station
Hardware Power SystemModel
Tie-Line Flows
Drexel University
HPEC 2004
Goal & ApproachTo design an embedded FPGATo design an embedded FPGA--based based multiprocessor system to perform high speed multiprocessor system to perform high speed Power Flow Analysis.Power Flow Analysis.To provide a single desktop environment to To provide a single desktop environment to solve the entire package of Power Flow Problem solve the entire package of Power Flow Problem (Multiprocessors on the Desktop).(Multiprocessors on the Desktop).Solve Power Flow equations using NewtonSolve Power Flow equations using Newton--RaphsonRaphson, with hardware support for sparse LU., with hardware support for sparse LU.Tailor HW design to systems arising in Power Tailor HW design to systems arising in Power Flow analysis.Flow analysis.
HPEC 2004
ResultsResultsSoftware solutions (sparse LU needed for Power Flow) Software solutions (sparse LU needed for Power Flow) using highusing high--end PCs/workstations do not achieve efficient end PCs/workstations do not achieve efficient floating point performance and leave substantial room for floating point performance and leave substantial room for improvement.improvement.HighHigh--grained parallelism will not significantly improve grained parallelism will not significantly improve performance due to granularity of the computation.performance due to granularity of the computation.FPGA, with a much slower clock, can outperform FPGA, with a much slower clock, can outperform PCs/workstations by devoting space to hardwired PCs/workstations by devoting space to hardwired control, additional FP units, and utilizing finecontrol, additional FP units, and utilizing fine--grained grained parallelism.parallelism.Benchmarking studies show that significant performance Benchmarking studies show that significant performance gain is possible.gain is possible.A 10x speedup is possible using existing FPGA A 10x speedup is possible using existing FPGA technologytechnology
HPEC 2004
Sparse Linear Solver for Power System Analysis
using FPGAJeremy Johnson, Jeremy Johnson, Prawat Prawat
NagvajaraNagvajara, Chika , Chika NwankpaNwankpaDrexel University
Ultra-Fast Bus
PowerGrid
Economic AnalysisStation
Decision MakingStationEngineering
Analysis Station
Hardware Power SystemModel
Tie-Line Flows
Drexel University
HPEC 2004
Goal & ApproachTo design an embedded FPGATo design an embedded FPGA--based based multiprocessor system to perform high speed multiprocessor system to perform high speed Power Flow Analysis.Power Flow Analysis.To provide a single desktop environment to To provide a single desktop environment to solve the entire package of Power Flow Problem solve the entire package of Power Flow Problem (Multiprocessors on the Desktop).(Multiprocessors on the Desktop).Solve Power Flow equations using NewtonSolve Power Flow equations using Newton--RaphsonRaphson, with hardware support for sparse LU., with hardware support for sparse LU.Tailor HW design to systems arising in Power Tailor HW design to systems arising in Power Flow analysis.Flow analysis.
HPEC 2004
Algorithm and HW/SW Algorithm and HW/SW PartitionPartition
Data Input Extract Data Ybus
LU FactorizationForward SubstitutionBackward Substitution
Jacobian Matrix
Post Processing
Update Jacobian matrixMismatch < Accuracy
HOST
HOST
YES
NO
HPEC 2004
ResultsResultsSoftware solutions (sparse LU needed for Power Flow) Software solutions (sparse LU needed for Power Flow) using highusing high--end PCs/workstations do not achieve efficient end PCs/workstations do not achieve efficient floating point performance and leave substantial room for floating point performance and leave substantial room for improvement.improvement.HighHigh--grained parallelism will not significantly improve grained parallelism will not significantly improve performance due to granularity of the computation.performance due to granularity of the computation.FPGA, with a much slower clock, can outperform FPGA, with a much slower clock, can outperform PCs/workstations by devoting space to hardwired PCs/workstations by devoting space to hardwired control, additional FP units, and utilizing finecontrol, additional FP units, and utilizing fine--grained grained parallelism.parallelism.Benchmarking studies show that significant performance Benchmarking studies show that significant performance gain is possible.gain is possible.A 10x speedup is possible using existing FPGA A 10x speedup is possible using existing FPGA technologytechnology
HPEC 2004
BenchmarkBenchmarkObtain data from power systems of interestObtain data from power systems of interest
SourceSource # Bus# Bus Branches/BusBranches/Bus OrderOrder NNZNNZ
PSSEPSSE 1,6481,648 1.581.58 2,9822,982 21,68221,682
PSSEPSSE 7,9177,917 1.641.64 14,50814,508 108,024108,024
PJMPJM 10,27810,278 1.421.42 19,28519,285 137,031137,031
PJMPJM 26,82926,829 1.431.43 50,09250,092 361,530361,530
HPEC 2004
System ProfileSystem Profile
# Bus# Bus # # IterIter #DIV#DIV #MUL#MUL #ADD#ADD NNZ L+UNNZ L+U
1,6481,648 66 43,87643,876 1,908,0821,908,082 1,824,3801,824,380
18,324,78718,324,787
13,604,49413,604,494
89,003,92689,003,926
7,9177,917 99 259,388259,388 18,839,38218,839,382
108,210108,210
571,378571,378
576,007576,00710,27910,279 1212 238,343238,343 14,057,76614,057,766
26,82926,829 770,514770,514 90,556,64390,556,643 1,746,6731,746,673
HPEC 2004
System ProfileSystem ProfileMore than 80% of rows/cols have size < 30More than 80% of rows/cols have size < 30
ROW_SIZE
110.0
100.0
90.0
80.0
70.0
60.0
50.0
40.0
30.0
20.0
10.0
0.0
1200
1000
800
600
400
200
0
Std. Dev = 21.15 Mean = 20.3
N = 2981.00
COL_SIZE
90.085.0
80.075.0
70.065.0
60.055.0
50.045.0
40.035.0
30.025.0
20.015.0
10.05.0
0.0
1200
1000
800
600
400
200
0
Std. Dev = 15.68 Mean = 16.6
N = 2981.00
HPEC 2004
Software PerformanceSoftware Performance
Software platformSoftware platformUMFPACKUMFPACKPentium 4 (2.6GHz)Pentium 4 (2.6GHz)8KB L1 Data Cache8KB L1 Data CacheMandrake 9.2Mandrake 9.2gccgcc v3.3.1
# Bus# Bus TimeTime FP FP EffEff
1,6481,648 0.07 sec0.07 sec 1.05%1.05%
7,9177,917 0.37 sec0.37 sec 1.33%1.33%
10,27810,278 0.47 sec0.47 sec 0.96%0.96%
26,82926,829 1.39 sec1.39 sec 3.45%3.45%
v3.3.1
HPEC 2004
Hardware Model & Hardware Model & RequirementsRequirements
Store row & column indices for nonStore row & column indices for non--zero entrieszero entriesUse column indices to search for pivot. Overlap Use column indices to search for pivot. Overlap pivot search and division by pivot element with pivot search and division by pivot element with row reads.row reads.Use multiple Use multiple FPUsFPUs to do simultaneous updates to do simultaneous updates (enough parallelism for 8 (enough parallelism for 8 –– 32, avg. col. size) 32, avg. col. size) Use cache to store updated rows from iteration Use cache to store updated rows from iteration to iteration (70% overlap, memory to iteration (70% overlap, memory ≈≈ 400KB 400KB --largest). Can be used for largest). Can be used for prefetchingprefetching..Total memory required Total memory required ≈≈ 22MB (largest system)22MB (largest system)
HPEC 2004
ArchitectureArchitecture
SDRAM MemorySRAM Cache
SDRAM ControllerCACHE Controller
Processing Logic
FPGA
HPEC 2004
Pivot HardwarePivot Hardware
Translateto virtual
Indexreject
Readcolmap
Physical index
Virtual index
Memoryread
FPcompare
Column value
Pivot index
Pivot value
Pivot column Pivot
Pivot logic
HPEC 2004
Parallel Parallel FPUsFPUs
HPEC 2004
Update HardwareUpdate Hardware
Update RowFMUL
Column Word
Merge Logic
FADDSelect Write Logic
colmapUpdate
Memoryread row
Submatrix Row
HPEC 2004
Performance ModelPerformance ModelC program which simulates the computation C program which simulates the computation (data transfer and arithmetic operations) and (data transfer and arithmetic operations) and estimates the architecture’s performance (clock estimates the architecture’s performance (clock cycles and seconds).cycles and seconds).
Model AssumptionsModel AssumptionsSufficient internal buffersSufficient internal buffersCache write hits 100%Cache write hits 100%Simple static memory allocationSimple static memory allocationNo penalty on cache writeNo penalty on cache write--back to SDRAMback to SDRAM
HPEC 2004
PerformancePerformance
…… Speedup (P4) vs. # Update Unitsby Matrix Size
0.001.002.003.004.005.006.00
2 4 8 16
# Update Units
Spee
dup
Jac2kJac7kJac10kJac27k
HPEC 2004
GEPP BreakdownGEPP BreakdownRelative portion of LU Solve Time
Quad Update Units
26%
10%64%
Pivot
Divide
Update
Relative portion of LU Solve TimeSingle Update Unit
10%
4%
86%
Pivot
DivideUpdate
Cycle CountCycle CountBy # of Update UnitsBy # of Update Units
11 44 1616
PivotPivot 259401259401 259401259401 259401259401
DivideDivide 9643996439 9643996439 9643996439
UpdateUpdate 24093122409312 642662642662 248295
Relative portion of LU Solve Time16 Update Units
43%
16%
41%PivotDivideUpdate
248295