sparse linear solver for power system analyis using fpga · sparse linear solver for power system...

Sparse Linear Solver for Power System Analyis

using FPGA ∗

J. R. Johnson† P. Nagvajara‡ C. Nwankpa§

1 Extended Abstract

Load flow computation and contingency analysis is the foundation of powersystem analysis. Numerical solution to load flow equations are typicallycomputed using Newton-Raphson iteration, and the most time consumingcomponent of the computation is the solution of a sparse linear systemneeded for the update each iteration. When an appropriate eliminationordering is used, direct solvers are more effective than iterative solvers. Inpractice these systems involve a larger number of variables (50,000 or more);however, when the sparsity is utilized effectively these systems can be solvedin a modest amount of time (seconds). Despite the modest computationtime for the linear solver, the number of systems that must be solved islarge and current computation platforms and approaches do not yield thedesired performance. Because of the relatively small granularity of the linearsolver, the use of a coarse-grained parallel solver does not provide an effectivemeans to improve performance. In this talk, it is argued that a hardwaresolution, implemented in FPGA, using fine-grained parallelism, provides acost-effective means to achieve the desired performance.

Previous work [1, 2, 3] has shown that FPGA can be effectively usedfor floating-point intensive scientific computation. It was shown that highMFLOP rates could be achieved by utilizing multiple floating-point units,

∗This work was partially supported by DOE grant #ER63384, PowerGrid - A Com-putation Engine for Large-Scale Electric Networks

†Department of Computer Science, Drexel University, Philadelphia, PA 19104. email:[email protected]

‡Department of Electrical and Computer Engineering, Drexel University, Philadelphia,PA 19104. email: [email protected]

§Department of Electrical and Computer Engineering, Drexel University, Philadelphia,PA 19104. email: [email protected]

1

Report Documentation Page Form ApprovedOMB No. 0704-0188

Public reporting burden for the collection of information is estimated to average 1 hour per response, including the time for reviewing instructions, searching existing data sources, gathering andmaintaining the data needed, and completing and reviewing the collection of information. Send comments regarding this burden estimate or any other aspect of this collection of information,including suggestions for reducing this burden, to Washington Headquarters Services, Directorate for Information Operations and Reports, 1215 Jefferson Davis Highway, Suite 1204, ArlingtonVA 22202-4302. Respondents should be aware that notwithstanding any other provision of law, no person shall be subject to a penalty for failing to comply with a collection of information if itdoes not display a currently valid OMB control number.

1. REPORT DATE 01 FEB 2005

2. REPORT TYPE N/A

3. DATES COVERED -

4. TITLE AND SUBTITLE Sparse Linear Solver for Power System Analysis using FPGA

5a. CONTRACT NUMBER

5b. GRANT NUMBER

5c. PROGRAM ELEMENT NUMBER

6. AUTHOR(S) 5d. PROJECT NUMBER

5e. TASK NUMBER

5f. WORK UNIT NUMBER

7. PERFORMING ORGANIZATION NAME(S) AND ADDRESS(ES) Drexel University

8. PERFORMING ORGANIZATIONREPORT NUMBER

9. SPONSORING/MONITORING AGENCY NAME(S) AND ADDRESS(ES) 10. SPONSOR/MONITOR’S ACRONYM(S)

11. SPONSOR/MONITOR’S REPORT NUMBER(S)

12. DISTRIBUTION/AVAILABILITY STATEMENT Approved for public release, distribution unlimited

13. SUPPLEMENTARY NOTES See also ADM00001742, HPEC-7 Volume 1, Proceedings of the Eighth Annual High PerformanceEmbedded Computing (HPEC) Workshops, 28-30 September 2004 Volume 1., The original documentcontains color images.

14. ABSTRACT

15. SUBJECT TERMS

16. SECURITY CLASSIFICATION OF: 17. LIMITATION OF ABSTRACT

UU

18. NUMBEROF PAGES

21

19a. NAME OFRESPONSIBLE PERSON

a. REPORT unclassified

b. ABSTRACT unclassified

c. THIS PAGE unclassified

Standard Form 298 (Rev. 8-98) Prescribed by ANSI Std Z39-18

and FPGA could outperform PCs and workstations, running at much higher-lock rates, on dense matrix computations. The current work argues thatsimilar benefit can be obtained for the sparse matrix computations arisingin power system analysis. These conclusions are based on operation countsand system analysis for a collection of benchmark systems arising in practice.

Benchmark data indicates that between 1 and 3 percent of peak floatingpoint performance was obtained using a state-of-the-art sparse solver (UMF-PACK) running on 2.60 GHz Pentium 4. The solve time for the largestsystem (50,092 unknowns and 361,530 non-zero entries) was 1.39 seconds.

A pipelined floating point core was designed for the Altera Stratix fam-ily of FPGAs. An instantiation of the core on an Altera Stratix with speedrating (-5) operates at 200 MHz for addition and multiplication and 70 MHzfor divion. Moreover, there is sufficient room for 10 units. Assuming 100%utilization of eight FPUs, the projected performance for the FPGA imple-mentation is 0.069 seconds, which provides a 20-fold improvement. Whileit is optimistic to assume perfect efficiency, hard-wired control should pro-vide substantially better efficiency than available with a standard processor.Moreover, analysis of the LU factorization shows that the average numberof updates per row throughout the factorization is 20.3, which provides suf-ficient parallelism to benefit from 8 FPUs. An implementation, and moredetailed model, is being carried out to determine the attainable efficiency.

References

[1] J. Johnson, P. Nagvajara, and C. Nwankpa. High-Performance LinearAlgebra Processor using FPGA. In Proc. High Performance EmbeddedComputing (HPEC 2003). Sept. 22-24, 2003.

[2] K. Underwood. FPGAs vs. CPUs: Trends in Peak Floating-Point Per-formance. In Proc. International Symposium on Field-ProgrammableGate Arrays (FPGA 2004). Feb. 22-24, 2004.

[3] Ling Zhuo and Viktor K. Prasanna. Scalable and Modular Algorithms forFloating-Point Matrix Multiplication on FPGAs. In Proc. InternationalParallel and Distributed Processing Symposium (IPDPS 2004), 2004.

2

HPEC 2004

Sparse Linear Solver for Power System Analysis

using FPGAJeremy Johnson, Jeremy Johnson, Prawat Prawat

NagvajaraNagvajara, Chika , Chika NwankpaNwankpaDrexel University

Ultra-Fast Bus

PowerGrid

Economic AnalysisStation

Decision MakingStationEngineering

Analysis Station

Hardware Power SystemModel

Tie-Line Flows

Drexel University

HPEC 2004

Goal & ApproachTo design an embedded FPGATo design an embedded FPGA--based based multiprocessor system to perform high speed multiprocessor system to perform high speed Power Flow Analysis.Power Flow Analysis.To provide a single desktop environment to To provide a single desktop environment to solve the entire package of Power Flow Problem solve the entire package of Power Flow Problem (Multiprocessors on the Desktop).(Multiprocessors on the Desktop).Solve Power Flow equations using NewtonSolve Power Flow equations using Newton--RaphsonRaphson, with hardware support for sparse LU., with hardware support for sparse LU.Tailor HW design to systems arising in Power Tailor HW design to systems arising in Power Flow analysis.Flow analysis.

HPEC 2004

ResultsResultsSoftware solutions (sparse LU needed for Power Flow) Software solutions (sparse LU needed for Power Flow) using highusing high--end PCs/workstations do not achieve efficient end PCs/workstations do not achieve efficient floating point performance and leave substantial room for floating point performance and leave substantial room for improvement.improvement.HighHigh--grained parallelism will not significantly improve grained parallelism will not significantly improve performance due to granularity of the computation.performance due to granularity of the computation.FPGA, with a much slower clock, can outperform FPGA, with a much slower clock, can outperform PCs/workstations by devoting space to hardwired PCs/workstations by devoting space to hardwired control, additional FP units, and utilizing finecontrol, additional FP units, and utilizing fine--grained grained parallelism.parallelism.Benchmarking studies show that significant performance Benchmarking studies show that significant performance gain is possible.gain is possible.A 10x speedup is possible using existing FPGA A 10x speedup is possible using existing FPGA technologytechnology

HPEC 2004

Sparse Linear Solver for Power System Analysis

using FPGAJeremy Johnson, Jeremy Johnson, Prawat Prawat

NagvajaraNagvajara, Chika , Chika NwankpaNwankpaDrexel University

Ultra-Fast Bus

PowerGrid

Economic AnalysisStation

Decision MakingStationEngineering

Analysis Station

Hardware Power SystemModel

Tie-Line Flows

Drexel University

HPEC 2004

Goal & ApproachTo design an embedded FPGATo design an embedded FPGA--based based multiprocessor system to perform high speed multiprocessor system to perform high speed Power Flow Analysis.Power Flow Analysis.To provide a single desktop environment to To provide a single desktop environment to solve the entire package of Power Flow Problem solve the entire package of Power Flow Problem (Multiprocessors on the Desktop).(Multiprocessors on the Desktop).Solve Power Flow equations using NewtonSolve Power Flow equations using Newton--RaphsonRaphson, with hardware support for sparse LU., with hardware support for sparse LU.Tailor HW design to systems arising in Power Tailor HW design to systems arising in Power Flow analysis.Flow analysis.

HPEC 2004

Algorithm and HW/SW Algorithm and HW/SW PartitionPartition

Data Input Extract Data Ybus

LU FactorizationForward SubstitutionBackward Substitution

Jacobian Matrix

Post Processing

Update Jacobian matrixMismatch < Accuracy

HOST

HOST

YES

NO

HPEC 2004

ResultsResultsSoftware solutions (sparse LU needed for Power Flow) Software solutions (sparse LU needed for Power Flow) using highusing high--end PCs/workstations do not achieve efficient end PCs/workstations do not achieve efficient floating point performance and leave substantial room for floating point performance and leave substantial room for improvement.improvement.HighHigh--grained parallelism will not significantly improve grained parallelism will not significantly improve performance due to granularity of the computation.performance due to granularity of the computation.FPGA, with a much slower clock, can outperform FPGA, with a much slower clock, can outperform PCs/workstations by devoting space to hardwired PCs/workstations by devoting space to hardwired control, additional FP units, and utilizing finecontrol, additional FP units, and utilizing fine--grained grained parallelism.parallelism.Benchmarking studies show that significant performance Benchmarking studies show that significant performance gain is possible.gain is possible.A 10x speedup is possible using existing FPGA A 10x speedup is possible using existing FPGA technologytechnology

HPEC 2004

BenchmarkBenchmarkObtain data from power systems of interestObtain data from power systems of interest

SourceSource # Bus# Bus Branches/BusBranches/Bus OrderOrder NNZNNZ

PSSEPSSE 1,6481,648 1.581.58 2,9822,982 21,68221,682

PSSEPSSE 7,9177,917 1.641.64 14,50814,508 108,024108,024

PJMPJM 10,27810,278 1.421.42 19,28519,285 137,031137,031

PJMPJM 26,82926,829 1.431.43 50,09250,092 361,530361,530

HPEC 2004

System ProfileSystem Profile

# Bus# Bus # # IterIter #DIV#DIV #MUL#MUL #ADD#ADD NNZ L+UNNZ L+U

1,6481,648 66 43,87643,876 1,908,0821,908,082 1,824,3801,824,380

18,324,78718,324,787

13,604,49413,604,494

89,003,92689,003,926

7,9177,917 99 259,388259,388 18,839,38218,839,382

108,210108,210

571,378571,378

576,007576,00710,27910,279 1212 238,343238,343 14,057,76614,057,766

26,82926,829 770,514770,514 90,556,64390,556,643 1,746,6731,746,673

HPEC 2004

System ProfileSystem ProfileMore than 80% of rows/cols have size < 30More than 80% of rows/cols have size < 30

ROW_SIZE

110.0

100.0

90.0

80.0

70.0

60.0

50.0

40.0

30.0

20.0

10.0

0.0

1200

1000

800

600

400

200

0

Std. Dev = 21.15 Mean = 20.3

N = 2981.00

COL_SIZE

90.085.0

80.075.0

70.065.0

60.055.0

50.045.0

40.035.0

30.025.0

20.015.0

10.05.0

0.0

1200

1000

800

600

400

200

0

Std. Dev = 15.68 Mean = 16.6

N = 2981.00

HPEC 2004

Software PerformanceSoftware Performance

Software platformSoftware platformUMFPACKUMFPACKPentium 4 (2.6GHz)Pentium 4 (2.6GHz)8KB L1 Data Cache8KB L1 Data CacheMandrake 9.2Mandrake 9.2gccgcc v3.3.1

# Bus# Bus TimeTime FP FP EffEff

1,6481,648 0.07 sec0.07 sec 1.05%1.05%

7,9177,917 0.37 sec0.37 sec 1.33%1.33%

10,27810,278 0.47 sec0.47 sec 0.96%0.96%

26,82926,829 1.39 sec1.39 sec 3.45%3.45%

v3.3.1

HPEC 2004

Hardware Model & Hardware Model & RequirementsRequirements

Store row & column indices for nonStore row & column indices for non--zero entrieszero entriesUse column indices to search for pivot. Overlap Use column indices to search for pivot. Overlap pivot search and division by pivot element with pivot search and division by pivot element with row reads.row reads.Use multiple Use multiple FPUsFPUs to do simultaneous updates to do simultaneous updates (enough parallelism for 8 (enough parallelism for 8 –– 32, avg. col. size) 32, avg. col. size) Use cache to store updated rows from iteration Use cache to store updated rows from iteration to iteration (70% overlap, memory to iteration (70% overlap, memory ≈≈ 400KB 400KB --largest). Can be used for largest). Can be used for prefetchingprefetching..Total memory required Total memory required ≈≈ 22MB (largest system)22MB (largest system)

HPEC 2004

ArchitectureArchitecture

SDRAM MemorySRAM Cache

SDRAM ControllerCACHE Controller

Processing Logic

FPGA

HPEC 2004

Pivot HardwarePivot Hardware

Translateto virtual

Indexreject

Readcolmap

Physical index

Virtual index

Memoryread

FPcompare

Column value

Pivot index

Pivot value

Pivot column Pivot

Pivot logic

HPEC 2004

Parallel Parallel FPUsFPUs

HPEC 2004

Update HardwareUpdate Hardware

Update RowFMUL

Column Word

Merge Logic

FADDSelect Write Logic

colmapUpdate

Memoryread row

Submatrix Row

HPEC 2004

Performance ModelPerformance ModelC program which simulates the computation C program which simulates the computation (data transfer and arithmetic operations) and (data transfer and arithmetic operations) and estimates the architecture’s performance (clock estimates the architecture’s performance (clock cycles and seconds).cycles and seconds).

Model AssumptionsModel AssumptionsSufficient internal buffersSufficient internal buffersCache write hits 100%Cache write hits 100%Simple static memory allocationSimple static memory allocationNo penalty on cache writeNo penalty on cache write--back to SDRAMback to SDRAM

HPEC 2004

PerformancePerformance

…… Speedup (P4) vs. # Update Unitsby Matrix Size

0.001.002.003.004.005.006.00

2 4 8 16

# Update Units

Spee

dup

Jac2kJac7kJac10kJac27k

HPEC 2004

GEPP BreakdownGEPP BreakdownRelative portion of LU Solve Time

Quad Update Units

26%

10%64%

Pivot

Divide

Update

Relative portion of LU Solve TimeSingle Update Unit

10%

4%

86%

Pivot

DivideUpdate

Cycle CountCycle CountBy # of Update UnitsBy # of Update Units

11 44 1616

PivotPivot 259401259401 259401259401 259401259401

DivideDivide 9643996439 9643996439 9643996439

UpdateUpdate 24093122409312 642662642662 248295

Relative portion of LU Solve Time16 Update Units

43%

16%

41%PivotDivideUpdate

248295

sparse linear solver for power system analyis using fpga · sparse linear solver for power system...

Documents