hptf 2240 final

53
1 May 8, 2022 Hanging By a Thread: Using Capacity Planning to Survive Session 2240 Surf F 08:00 Wednesday Paul O’Sullivan

Upload: prosullivan

Post on 02-Dec-2014

460 views

Category:

Documents


0 download

DESCRIPTION

Case studies of performance analysis and capacity planning

TRANSCRIPT

Page 1: Hptf 2240 Final

1 April 9, 2023

Hanging By a Thread: Using Capacity Planning to Survive

Session 2240 Surf F 08:00 Wednesday

Paul O’Sullivan

Page 2: Hptf 2240 Final

Topics Up for Discussion

•Introduction•Current Status•Case Study 1 – Capacity Planning•Case Study 2 – Performance Analysis•Findings•Future

Page 3: Hptf 2240 Final

3

Introduction

•Paul O’Sullivan•Capacity Management Consultant•Capacity Planning/Performance Analyst since 1994−Infrastructure and Fixed Income

•Investment Banking/Insurance applications

•PerfCap Corporation

Page 4: Hptf 2240 Final

4

Current State of Performance Analysis and Capacity Planning

Capacity Planning−Different climate today to even 5 years ago

•Massive Proliferation of Servers•Multi-platform, and Multi-tier•Management non-interest−High level data only

−Capacity Planning: • ‘too difficult to do so we will not bother’

• Buy more servers – (not any more)

Page 5: Hptf 2240 Final

5

Issues

•Lack of specialists•Too much data to collect•Hard to correlate different platforms and treat application as an entity

•Top down approach−Processes first, data later

•Diffused Responsibility•…and....

Page 6: Hptf 2240 Final

6

Issues

•Lack of specialists•Too much data to collect•Hard to correlate different platforms and treat application as an entity

•Top down approach−Processes first, data later

•Diffused Responsibility•…and....

Page 7: Hptf 2240 Final

7

Falling hardware costs

•Following is quotation for typical 4 way database server:−4 x CPU GBP 8,000

−1 x Storage Array GBP 13,235

− 3 x Power supplies GBP 750

− 15 x Drives for Array GBP 4500

− 2 x 1GB Memory GBP 10,000

−Total 35,500

−Year: 2000

−Refurbished!

Page 8: Hptf 2240 Final

8

OK anyone can complain….

•…But how can we fix it?•Two examples of recent work−Capacity Planning

• Itanium

−Performance Analysis

SQL Server and EVA

•Futures

Page 9: Hptf 2240 Final

9 April 9, 2023

Capacity PlanningOracle RAC on Itanium Linux

Page 10: Hptf 2240 Final

A Sample StudyOracle RAC Capacity Planning

•Currently 3-node RAC running on IA64 Linux

•Expect 3x workload on current Oracle RAC within next two years.

•Must evaluate capacity of current cluster.

•Examine upgrade alternatives if current configuration not capable of sustaining expected load.

Page 11: Hptf 2240 Final

11

RAC Node CPU Utilizations, July-Sept 2008

Page 12: Hptf 2240 Final

12

Selection of Peak Benchmark Load

Page 13: Hptf 2240 Final

13

CPU by Image / Disk I/O Rate

Page 14: Hptf 2240 Final

14

CPU Utilization by Core

Reasonable core load balance at heavy loads.

Page 15: Hptf 2240 Final

15

Overall Disk I/O Rates

Page 16: Hptf 2240 Final

16

Overall Disk Data Rate

Page 17: Hptf 2240 Final

17

Disk Response Times

Page 18: Hptf 2240 Final

18

Memory Allocation

Page 19: Hptf 2240 Final

19

eCAP Workload Definition

Page 20: Hptf 2240 Final

20

oracleNDSPRD1 oracleLockProcs oracleProcs asmProcs

Disk I/OCPU

CPU

CPU

CPU

Disk I/O Disk I/O Disk I/O

Workload Class

Process

Count

Multi-

ProcessingLevel

Process

CreationRate (/sec)

CPUUtilization

Disk I/ORate (/sec)

oracleNDSPRD1 1110 547.1 0.925 73% 639

oracleLockProcs 8 3.2 0.007 5% 277

oracleWorkProcs 46 31.8 0.038 1% 14

ASM processes 20 9.7 0.017 0.2% 10

daemons 6 2.4 0.005 0.05% 4

data collector 1 0.4 0.001 0.3% 26

root processes 1161 266.0 0.968 3% 233

other processes 774 47.5 0.645 2% 311

Workload Characteristics

Primary Response Time Components

Page 21: Hptf 2240 Final

21

Current System Response Time Curve

9%

Headroom 9%

Page 22: Hptf 2240 Final

22

Capacity 100%

Headroom 9%

%910100

10

100

wthToKneepercentGro

wthToKneepercentGroheadRoom

Current System Headroom

Page 23: Hptf 2240 Final

23

Findings - Current System

•At peak sustained load, 9% headroom

•CPU is primary resource bottleneck

•Possible solutions:−Horizontal scaling

−Integrity upgrade

−Alternate hardware platform

Page 24: Hptf 2240 Final

24

Platform Alternatives(3 or 4 nodes)

HP rx7620 (1.1 GHz, Itanium 2) – current configuration

HP rx8640 (1.6 GHz, 24MB L3 cache), 16 core

HP rx8640 (1.6 GHz, 25MB L3 cache), 32 core

IBM p 570 (2.2 GHz, Power 5), 16 core

IBM p 570 (2.2 GHz, Power 5), 32 core

IBM p 570 (4.7 GHz, Power 6), 16 core

Sun SPARC Enterprise M8000 (2.4 GHz) , 16 core

Sun SPARC Enterprise M8000 (2.4 GHz) , 32 core

Configuration must support 200%

workload growth

Page 25: Hptf 2240 Final

25

Response Time vs Workload Growth3-node RAC

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

1.6

1.8

2.0

-100 -50 0 50 100 150 200 250 300 350 400

% Workload Growth from Benchmark

Re

lati

ve

Re

sp

on

se

Tim

e

HP rx7620 (1.1 GHz Itanium 2), 16-core

HP rx8640 (1.6 GHz, 24MB, Itanium 2), 16-core

HP rx8640 (1.6 GHz, 24MB, Itanium 2), 32-core

IBM p570 (2.2 GHz, Power 5), 16-core

IBM p570 (2.2 GHz, Power 5), 32-core

IBM p570 (4.7 GHz, Power 6), 16-core

Sun SPARC Enterprise M8000 (2.4 GHz), 16-core

Sun SPARC Enterprise M8000 (2.4 GHz), 32-core

Note: CPU is primary resource bottleneck; disk and memory will support 200% growth

Page 26: Hptf 2240 Final

26

Response Time vs Workload Growth4-node RAC

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

1.6

1.8

2.0

-100 -50 0 50 100 150 200 250 300 350 400

% Workload Growth from Benchmark

Rela

tive R

esp

on

se T

ime

HP rx7620 (1.1 GHz, Itanium 2), 16-core

HP rx8640 (1.6 GHz, 24 MB L3 cache), 16-core

HP rx8640 (1.6 GHz, 24 MB L3 cache), 32-core

IBM p570 (2.2 GHz, Power 5), 16-core

IBM p570 (2.2 GHz, Power 5), 32-core

IBM p570 (4.7 GHz, Power 6), 16-core

Sun SPARC Enterprise M8000 (2.4 GHz), 16-core

Sun SPARC Enterprise M8000 (2.4 GHz), 32-core

Page 27: Hptf 2240 Final

27

Qualifying Platforms

•3 configuration platforms support growth:− HP rx8640 (1.6 GHz, 25MB L3 cache), 32 core

− IBM p 570 (2.2 GHz, Power 5), 32 core

− IBM p 570 (4.7 GHz, Power 6), 16 core

− Sun SPARC Enterprise M8000 (2.4 GHz) , 32 core

•Horizontal scaling to 4 nodes will not change qualifying platforms.

Page 28: Hptf 2240 Final

Response Time vs Workload Growth(reduced core, 3-node configurations)

28

0.0

0.2

0.4

0.6

0.8

1.0

1.2

-100 -50 0 50 100 150 200 250 300

% Workload Growth from Benchmark

Re

lati

ve

Re

sp

on

se

Tim

e

Sun SPARC Enterprise M8000 (2.4 GHz), 32 cores

HP rx8640 (1.6 GHz, 25MB L3 cache), 30 cores

IBM p 570 (2.2 GHz, Power 5), 26 cores

IBM p 570 (4.7 GHz, Power 6), 12 cores

Page 29: Hptf 2240 Final

0.0

0.2

0.4

0.6

0.8

1.0

1.2

-100 -50 0 50 100 150 200 250 300

% Workload Growth from Benchmark

Re

lati

ve

Re

sp

on

se

Tim

e

Sun SPARC Enterprise M8000 (2.4 GHz), 24 cores

HP rx8640 (1.6 GHz, 25MB L3 cache), 24 cores

IBM p 570 (2.2 GHz, Power 5), 20 cores

IBM p 570 (4.7 GHz, Power 6), 10 cores

Response Time vs Workload Growth(reduced core, 4-node configurations)

29

Page 30: Hptf 2240 Final

Optimized Configurations

30

Platform 3-node 4-node

Sun SPARC Enterprise M8000 (2.4 GHz) 32 24

HP rx8640 (1.6 GHz, 25MB L3 cache) 30 24

IBM p 570 (2.2 GHz, Power 5) 26 20

IBM p 570 (4.7 GHz, Power 6) 12 10

Final choice based on cost and management issues.

Page 31: Hptf 2240 Final

31 April 9, 2023

Performance AnalysisSQL Server on HP Blades and EVA

Page 32: Hptf 2240 Final

Performance Analysis 1

•Large Insurance firm acquisition•Migrating applications •Requirement of 10x times growth•Much new hardware purchased•160 servers in environments•Application still slow −SQL Developers under the microscope

Page 33: Hptf 2240 Final

Performance Analysis

•Asked to examine SQL Server Application•Theory was that EVA 6000 could not cope with IO load generated by SQL

•Used PAWZ Performance Analysis and Capacity Planning tool to find performance issues.

•EVA performance data ‘unavailable’, so used SAN modeling ability on PAWZ Capacity Planner

Page 34: Hptf 2240 Final

Hardware Configuration

−16 way Quad Core HP Blade 460c

−2 x FC 4Gb fibre cards

−SQL Server 2000

−EVA 6000• 96 disk disk group, 300Gb 15k drives

• Shared with other window servers

Page 35: Hptf 2240 Final

Initial Analysis

•SQL Server processes was generated very high response times on SAN drives

•SQL Server processes were themselves paging (flushing data onto disk) at regular intervals

•Overall IO rates were low 1000 IO/Sec.•CPU Usage is low (10%) for a server of this type. (?)

•Memory Usage is low (15%)for a server of this type (?)

Page 36: Hptf 2240 Final

36 April 9, 2023

Not really high IO counts these days….

IO Rates

Page 37: Hptf 2240 Final

37 April 9, 2023

Very high D: drive response time….

Disk Response Time

Page 38: Hptf 2240 Final

38 April 9, 2023

Very high D: drive response time….

IO Sizes

Page 39: Hptf 2240 Final

39 April 9, 2023

SQL Server process generating all the IO

Obviously, something wrong with the application, right?

Process-based IO Rates

Page 40: Hptf 2240 Final

40 April 9, 2023

1.7Gb. Excuse me?

But the server has 24Gb of memory

SQL Server Memory

Page 41: Hptf 2240 Final

41 April 9, 2023

Soft paging into the free list

SQL Server paging

Page 42: Hptf 2240 Final

42 April 9, 2023

Soft paging into the free list huge IO load generated as data I

s moved to and from the SQL Server process

SQL Server paging

Page 43: Hptf 2240 Final

So what happened?

•Although SQL Server Enterprise can be configured to use all available memory it will not use more than 1.7Gb actual memory until Address Windowing Extensions (AWE) is enabled.

•AWE has to be configured by the sp_configure utility (show advanced options)

•AWE has to be enabled and then provided a required memory size.

•AWE will not operate if there is less than 3Gb of free memory on the server: SQL Server will disable it.

Page 44: Hptf 2240 Final

44 April 9, 2023

Production: IO before

Page 45: Hptf 2240 Final

45 April 9, 2023

Production: IO After

Page 46: Hptf 2240 Final

46 April 9, 2023

Production: IO Q Before

Page 47: Hptf 2240 Final

47 April 9, 2023

Production: IO Q After

Page 48: Hptf 2240 Final

48 April 9, 2023

Production: Disk Busy Q Before

Page 49: Hptf 2240 Final

49 April 9, 2023

: Production: Disk Busy Q after

HUGE reduction in disk busy

Page 50: Hptf 2240 Final

Result

•CPU increased•Application could handle more concurrent users in test

•Customer very happy−No hardware purchase, no project, no application change

−Rapid resolution to problem• Took 2 hours to work it out -

• Problem was bad since January

•Relieved pressure on SAN−Until another SQL Server with the same problem….

Page 51: Hptf 2240 Final

Lessons

•Even if performance tool is already in place, few people were using it well.

• Blame game without looking at the facts (data)

•Need to improve fault-finding capabilities−Better ways to correlate data

−Automatic methods of alerting as to real problem and nature of problem

•Classic case of the ‘cause behind the cause’

Page 52: Hptf 2240 Final

So what do we need?

•1st hurdle overcome – obtaining data•2nd hurdle overcome – presenting data efficiently

•3rd hurdle overcome – scalability of performance data from clients

•4th hurdle overcome – automatic capacity planning data

•5th hurdle – to do – making sense of the data−Expert reports

−Just showing the issues

−Removing the need for manual analysis

Page 53: Hptf 2240 Final

Want to know more?

•Booth Number 631

•http://www.perfcap.com

[email protected][email protected][email protected]