ercot project update ercot outage evaluation phase 2 (scr745) tdtwg may 7, 2008

7
ERCOT Project Update ERCOT Outage Evaluation Phase 2 (SCR745) TDTWG May 7, 2008

Upload: ethan-hamilton

Post on 04-Jan-2016

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: ERCOT Project Update ERCOT Outage Evaluation Phase 2 (SCR745) TDTWG May 7, 2008

ERCOT Project UpdateERCOT Outage Evaluation Phase 2 (SCR745)

TDTWG May 7, 2008

Page 2: ERCOT Project Update ERCOT Outage Evaluation Phase 2 (SCR745) TDTWG May 7, 2008

2

PR60006_01 Phase 2 ERCOT Update - Overview

Background:

SCR 745: To achieve improved Market performance and reliability through a reduction of ERCOT Retail Systems unplanned outages.

This effort was planned to be implemented in two subprojects; PR60006_01: ERCOT Outage Evaluation Phase I and Phase II• Phase I, NAESB and Proxy Clustered (Delivered 02/2007-Goal Achieved)• Phase II, Paperfree Clustered environment with File Server Redundancy and

High AvailabilityPR60006_02: Phase III, Database Clustered environment (Cancelled per

recommendations at 04/02/2008 TDTWG)

Phase II Status:

02/10/2007 – Implemented Veritas clustered solution resulted in rollback due to unsuccessful failover.

03/08/2008 – Implemented Polyserve clustered solution resulted in rollback due to performance and stability issues (This would have delivered Redundancy and Failover)

05/07/2008 – Seeking recommendations from TDTWG for Next Steps

Page 3: ERCOT Project Update ERCOT Outage Evaluation Phase 2 (SCR745) TDTWG May 7, 2008

3

Recommendations from HP for Performance improvement will require Architectural changes, server rebuilds, and testing

ERCOT Recommends pursuing one of the following Options:

1) Place project “On Hold” due to the following (preferred):

• Stabilization of San Switch Replacement Project (Polyserve known issue with loss of connectivity to SAN)

• Test Environment Lock down until December 2008 due to Ts and Cs, MarkeTrak, and Nodal

• Resource constraints due to Ts and Cs, MarkeTrak, and Nodal

• Eliminate additional Finance charges by placing project on Hold

• Allow to move forward in 2009 with implementation that will deliver Failover capabilities (High Availability and Redundancy Goal of SCR)

2) Close project and complete effort as O & M:

• Additional funding will be required for remaining efforts

• Total Project estimated at $1M approved by Board in 2005

• Committed approximately $885K, will require Board approval for additional funding

PR60006_01 Phase 2 ERCOT Update – Next Steps

Page 4: ERCOT Project Update ERCOT Outage Evaluation Phase 2 (SCR745) TDTWG May 7, 2008

4

PR60006_01 Phase 2 ERCOT Update – Outages

Retail Transaction Processing Unplanned Outages by # of Incidents

  NAESBSeebeyond /

TIBCO Paperfree Siebel TML Retail Databases

2004 15 8 5 3 6 7

2005 8 6 2 1 1 0

2006 2 1 2 0 0 1

2007 3 4 4 1 11 2

2008 1 0 3 0 6 0

Total 29 19 16 5 24 10

* Based on IT Incident Report on 04/02/2008 and Metrics in SCR745

Page 5: ERCOT Project Update ERCOT Outage Evaluation Phase 2 (SCR745) TDTWG May 7, 2008

5

Retail Transaction Processing Unplanned Outages by Approx. # of Minutes

  NAESBSeebeyond / TIBCO Paperfree Siebel TML Retail Databases

2004 6778 5842 434 450 80 5600

2005 528 104 648 120 120 0

2006 55 540 847 0 0 502

2007 232 310 680 210 1031 820

2008 131 0 492 0 743 0

  7724 6796 3101 780 1974 6922

PR60006_01 Phase 2 ERCOT Update – Outages

Based on IT incident Report and SCR Metrics

Page 6: ERCOT Project Update ERCOT Outage Evaluation Phase 2 (SCR745) TDTWG May 7, 2008

6

PR60006_01 Phase 2 ERCOT Update – PF Outage Details (3yrs)

PaperFree Availability Metrics Prior to March 2008 as a result of 2007 Intermediate Resolutions• Previous Logged incident for PaperFree file server – 02/2007.• Until March, 2008 – Paperfree Application was 100% available due to intermediate solutions (meeting SCR Goal for reliability).

Issue Date

Duration(min

s)SLA

ImpactedApplication Impacted Issue Description Root Cause

Service Impact

Service Impact Detail

9/25/06 829 Retail Paperfree Paperfree File Server not responding Infrastructure OutageUnplanned

Outage

10/2/06 18 Retail Paperfree Paperfree File Server network outage Infrastructure OutageUnplanned

Outage

1/3/07 130 Retail PaperfreeMemory failure in the clustered environment Infrastructure Outage

Unplanned Outage

1/5/07 270 Retail Paperfree Problem pulling data from NAESB Infrastructure OutageUnplanned

Outage

1/8/07 195 Retail Paperfree

Attempted to replace the Paperfree architecture as identified by the on-going Paperfree issues analysis Infrastructure Outage

Unplanned Outage

2/7/07 85 Retail PaperfreeConnectivity issue between application and SAN Infrastructure Outage

Unplanned Outage

3/19/08 147 Retail Paperfree SAN Hardware failure Infrastructure OutageUnplanned

Outage

3/20/08 105Retail Market

Retail Market

Degradation Issues Post SCR745 Phase 2 solution

Polyserve Applicaton/PF Outage

Unplanned Outage

3/22/08 240Retail Market

Retail Market

Rollback from SCR745 Phase 2 implementation

Polyserve Applicaton/PF Outage

Unplanned Outage

Page 7: ERCOT Project Update ERCOT Outage Evaluation Phase 2 (SCR745) TDTWG May 7, 2008

7

PR60006_01 Phase 2 ERCOT Update – TDTWG Recommendations

Discussion