ercot project update ercot outage evaluation phase 2 (scr745) tdtwg may 7, 2008
TRANSCRIPT
ERCOT Project UpdateERCOT Outage Evaluation Phase 2 (SCR745)
TDTWG May 7, 2008
2
PR60006_01 Phase 2 ERCOT Update - Overview
Background:
SCR 745: To achieve improved Market performance and reliability through a reduction of ERCOT Retail Systems unplanned outages.
This effort was planned to be implemented in two subprojects; PR60006_01: ERCOT Outage Evaluation Phase I and Phase II• Phase I, NAESB and Proxy Clustered (Delivered 02/2007-Goal Achieved)• Phase II, Paperfree Clustered environment with File Server Redundancy and
High AvailabilityPR60006_02: Phase III, Database Clustered environment (Cancelled per
recommendations at 04/02/2008 TDTWG)
Phase II Status:
02/10/2007 – Implemented Veritas clustered solution resulted in rollback due to unsuccessful failover.
03/08/2008 – Implemented Polyserve clustered solution resulted in rollback due to performance and stability issues (This would have delivered Redundancy and Failover)
05/07/2008 – Seeking recommendations from TDTWG for Next Steps
3
Recommendations from HP for Performance improvement will require Architectural changes, server rebuilds, and testing
ERCOT Recommends pursuing one of the following Options:
1) Place project “On Hold” due to the following (preferred):
• Stabilization of San Switch Replacement Project (Polyserve known issue with loss of connectivity to SAN)
• Test Environment Lock down until December 2008 due to Ts and Cs, MarkeTrak, and Nodal
• Resource constraints due to Ts and Cs, MarkeTrak, and Nodal
• Eliminate additional Finance charges by placing project on Hold
• Allow to move forward in 2009 with implementation that will deliver Failover capabilities (High Availability and Redundancy Goal of SCR)
2) Close project and complete effort as O & M:
• Additional funding will be required for remaining efforts
• Total Project estimated at $1M approved by Board in 2005
• Committed approximately $885K, will require Board approval for additional funding
PR60006_01 Phase 2 ERCOT Update – Next Steps
4
PR60006_01 Phase 2 ERCOT Update – Outages
Retail Transaction Processing Unplanned Outages by # of Incidents
NAESBSeebeyond /
TIBCO Paperfree Siebel TML Retail Databases
2004 15 8 5 3 6 7
2005 8 6 2 1 1 0
2006 2 1 2 0 0 1
2007 3 4 4 1 11 2
2008 1 0 3 0 6 0
Total 29 19 16 5 24 10
* Based on IT Incident Report on 04/02/2008 and Metrics in SCR745
5
Retail Transaction Processing Unplanned Outages by Approx. # of Minutes
NAESBSeebeyond / TIBCO Paperfree Siebel TML Retail Databases
2004 6778 5842 434 450 80 5600
2005 528 104 648 120 120 0
2006 55 540 847 0 0 502
2007 232 310 680 210 1031 820
2008 131 0 492 0 743 0
7724 6796 3101 780 1974 6922
PR60006_01 Phase 2 ERCOT Update – Outages
Based on IT incident Report and SCR Metrics
6
PR60006_01 Phase 2 ERCOT Update – PF Outage Details (3yrs)
PaperFree Availability Metrics Prior to March 2008 as a result of 2007 Intermediate Resolutions• Previous Logged incident for PaperFree file server – 02/2007.• Until March, 2008 – Paperfree Application was 100% available due to intermediate solutions (meeting SCR Goal for reliability).
Issue Date
Duration(min
s)SLA
ImpactedApplication Impacted Issue Description Root Cause
Service Impact
Service Impact Detail
9/25/06 829 Retail Paperfree Paperfree File Server not responding Infrastructure OutageUnplanned
Outage
10/2/06 18 Retail Paperfree Paperfree File Server network outage Infrastructure OutageUnplanned
Outage
1/3/07 130 Retail PaperfreeMemory failure in the clustered environment Infrastructure Outage
Unplanned Outage
1/5/07 270 Retail Paperfree Problem pulling data from NAESB Infrastructure OutageUnplanned
Outage
1/8/07 195 Retail Paperfree
Attempted to replace the Paperfree architecture as identified by the on-going Paperfree issues analysis Infrastructure Outage
Unplanned Outage
2/7/07 85 Retail PaperfreeConnectivity issue between application and SAN Infrastructure Outage
Unplanned Outage
3/19/08 147 Retail Paperfree SAN Hardware failure Infrastructure OutageUnplanned
Outage
3/20/08 105Retail Market
Retail Market
Degradation Issues Post SCR745 Phase 2 solution
Polyserve Applicaton/PF Outage
Unplanned Outage
3/22/08 240Retail Market
Retail Market
Rollback from SCR745 Phase 2 implementation
Polyserve Applicaton/PF Outage
Unplanned Outage
7
PR60006_01 Phase 2 ERCOT Update – TDTWG Recommendations
Discussion