1 rms workshop retail systems disaster recovery ercot may 6 th, 2014

28
1 RMS Workshop Retail Systems Disaster Recovery ERCOT May 6 th , 2014

Upload: iris-simon

Post on 17-Dec-2015

215 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: 1 RMS Workshop Retail Systems Disaster Recovery ERCOT May 6 th, 2014

1

RMS WorkshopRetail Systems Disaster Recovery

ERCOTMay 6th, 2014

Page 2: 1 RMS Workshop Retail Systems Disaster Recovery ERCOT May 6 th, 2014

2

Addressing Retail DR NeedsMandy Bauld

Page 3: 1 RMS Workshop Retail Systems Disaster Recovery ERCOT May 6 th, 2014

3

1. Understand the Current State

2. Define & Understand the Requirement

3. Close the Gap (Next Steps)

Addressing Retail DR Needs

Page 4: 1 RMS Workshop Retail Systems Disaster Recovery ERCOT May 6 th, 2014

4

1. Understand the Current State

2. Define & Understand the Requirement

•What is the RMS position on a Recovery Time Objective (RTO) for the retail systems in the event of retail system outages?

•This information is required in order for ERCOT to effectively take any “next step” actions regarding DR for retail systems.

3. Close the Gap (Next Steps)

ERCOT’s Primary Objective for Today

Page 5: 1 RMS Workshop Retail Systems Disaster Recovery ERCOT May 6 th, 2014

5

1. Understand the Current State• Market needs education on what it takes for ERCOT to failover to level set expectations of the current state• Market needs to understand dependencies and decision making process around a failover • Market needs information about ERCOT’s failover testing and planning for returning to the primary site

2. Define & Understand the Requirement1. Various opinions expressed regarding downtime (need a limit on the downtime; need full operation within 24 hours;

24 hours downtime is unacceptable)

2. Even after systems were “back up” there were still configuration issues in the DR environment and still had backlog to work through… more than acceptable… it was not fully restored

3. Cleanup work was more than expected

4. The market’s manual work-around capability is limited, is not sustainable, and is not without risk

5. Incomplete or delayed communication has a direct impact on market ability to make decisions and manage customer expectation through the event

• Close the Gap (Next Steps)• Identify options that improve ERCOT’s and the market’s ability to meet the defined requirements

RMS Feedback Summarized (April 1, 2014 RMS Meeting)

Page 6: 1 RMS Workshop Retail Systems Disaster Recovery ERCOT May 6 th, 2014

6

• March 11th Outage Timeline• Overview – Retail DR Capability & Process• System Outage Communications• System Outage Requirements• Work-Around Processes• Planned Failover Communications

Agenda

Page 7: 1 RMS Workshop Retail Systems Disaster Recovery ERCOT May 6 th, 2014

7

March 11th Outage TimelineDave Pagliai

Page 8: 1 RMS Workshop Retail Systems Disaster Recovery ERCOT May 6 th, 2014

8

System Outage on March 11, 2014

Impacted Services Retail Transaction Processing, MarkeTrak, eService (service requests and settlement disputes), Retail Data Access & Transparency through MIS and ercot.com, Settlement & Billing processes, ercot.com, Retail Flight Testing (CERT), MPIM, Texas Renewables website (REC), NDCRC through the MIS 

Page 9: 1 RMS Workshop Retail Systems Disaster Recovery ERCOT May 6 th, 2014

9

System Outage on March 11, 2014

Infrastructure Failure- Tuesday 03/11/14 @ 9:27 AM Restoration of Market Facing Systems

Tuesday 03/11/14Texas Renewables (REC) – www.texasrenewables.com – 03/11/14 @ 4:01 PMSettlement & Billing processes  – 03/11/14 @ 4:36 PMErcot.com – 03/11/14 @ 5:20 PM

Wednesday 03/12/14Registration (Siebel) – 03/12/14 @ 1:15 PMMPIM – 03/12/14 @ 1:15 PMRetail Transaction Processing  – 03/12/14 @ 2:10 PMMarkeTrak API - 03/12/14 @ 2:10 PMMarkeTrak GUI - 03/12/14 @ 2:10 PMMIS Retail Applications - 03/12/14 @ 2:10 PMNDCRC through the MIS – 03/12/14 @ 3:02 PMRetail Transaction Processing backlog complete – 03/12/14 @ 10:15 PM 

Page 10: 1 RMS Workshop Retail Systems Disaster Recovery ERCOT May 6 th, 2014

10

System Outage on March 11, 2014

Restoration of Market Facing Systems (continued)

Thursday 03/13/14CERT – 03/13/14 @ 3:13 PM   This was actually unrelated to the infrastructure outage. Friday 03/14/14eService (service requests and settlement disputes) – 03/14/14 @ 5:30 PM 

Page 11: 1 RMS Workshop Retail Systems Disaster Recovery ERCOT May 6 th, 2014

11

Communications

• 1st Market Notice: Tuesday, 3/11 at 12:27 PM• 1st Retail Market Conference Call: Tuesday, 3/11 at 1:00 PM• Regular Market Notices and Retail Market Conference Calls through

Friday, 3/14• Final notice on Monday, 3/17

Page 12: 1 RMS Workshop Retail Systems Disaster Recovery ERCOT May 6 th, 2014

12

• The outage impacted various internal ERCOT applications and tools required to communicate both internally and externally, and access procedural documentation

• ERCOT’s registration system failed to initialize in the alternate data center, requiring application servers to be rebuilt, delaying the restoration of several other Retail systems/services

Challenges

Page 13: 1 RMS Workshop Retail Systems Disaster Recovery ERCOT May 6 th, 2014

13

Overview – Retail DR Capability & ProcessAaron Smallwood

Page 14: 1 RMS Workshop Retail Systems Disaster Recovery ERCOT May 6 th, 2014

14

• Pre-2011– No DR environment or failover capability, strategy was to utilize iTest environment if extended outage occurred

• December 2010 – Retail/Commercial Systems DR environment delivered and tested by the data center project– Evolving state of maturity– Strategy: failover in a major outage event when primary system

cannot be restored within required timeframe– Recovery Time Objective (RTO): generally, 24 hour recovery of core

systems and services– Recovery Point Objective (RPO): 0 data loss– Testing Strategy : recovery site operability tested annually

Retail/Commercial Systems DR Background

Page 15: 1 RMS Workshop Retail Systems Disaster Recovery ERCOT May 6 th, 2014

15

Retail/Commercial Systems – Current DR Capability

• Capable of operating out of the primary or alternate data center

• Capable of a 24 hour Recovery Time Objective (RTO)– Moving operations to an alternate data center within 24 hours– Historically have accomplished in less than 24 hours

• Capable of a 0 data loss Recovery Point Objective (RPO)

• Historical use of DR capability:

December 2012 – Unplanned failover• First “real” use of the DR environment

June 2013 – Planned failover• First transition back to the primary environment

March 2014 – Unplanned failover

Page 16: 1 RMS Workshop Retail Systems Disaster Recovery ERCOT May 6 th, 2014

16

• Outage events are like fingerprints… each is unique.

• Issue are initially handled as an IT Incident

– First Priority – find the issue

– Identify the scope of impacted systems and functions

– Identify options and limitations for restoration and recovery

– Make recommendations and coordinate restoration

– Engage Business SMEs and Market Communications team

• Potential for an extended outage?

– Mobilize the Disaster Management Team• Executive management and Director-level leadership across the company

– The DMT monitors the situation throughout the event to understand the scope of problem, impacts, and make “the big decisions”

– Examples: the 12/3/2012 and the 3/11/2014 outages

ERCOT Retail DR - ContextFor security reasons ERCOT cannot provide specific details regarding environments or business continuity plans.

Page 17: 1 RMS Workshop Retail Systems Disaster Recovery ERCOT May 6 th, 2014

17

• The systems are not currently capable of “real-time” failover

• Once the failover decision is made, IT follows procedures to:

– Verify readiness of the alternate environment– Control data replication streams during the transition– Configure integrated systems to point to the alternate

environment• Business follows procedures to:

– Prioritize recovery efforts – Work with IT to determine where processes left off and where

they should start after recovery– Determine and mitigate potential for data loss– Determine if/what work-arounds may be necessary upon

recovery– Verify ability to use systems in the alternate environment– Support Market Communications

ERCOT Retail DR - Process

Page 18: 1 RMS Workshop Retail Systems Disaster Recovery ERCOT May 6 th, 2014

18

System Outage CommunicationsTed Hailu

Page 19: 1 RMS Workshop Retail Systems Disaster Recovery ERCOT May 6 th, 2014

19

• Initial and Subsequent Communications (when/how)

• Contingency Plans for email communication

• Communication outside the Market Notice process

System Outage - Communications

Page 20: 1 RMS Workshop Retail Systems Disaster Recovery ERCOT May 6 th, 2014

20

System Outage RequirementsDave Michelsen

Page 21: 1 RMS Workshop Retail Systems Disaster Recovery ERCOT May 6 th, 2014

21

• What is the market’s priority for recovery of retail services?– What is the quantitative and/or qualitative impact of the

unavailability of each service?

• What is the RTO for each? May need to Consider:– Operating timelines – Importance of time of day (AM vs PM) or day of week and time

of day– Tolerance level relative to invoking safety net procedures– Cost to support increases as the RTO decreases

System Outage Requirements

Page 22: 1 RMS Workshop Retail Systems Disaster Recovery ERCOT May 6 th, 2014

22

System Outage Requirements

Retail Service Priority RTO

Retail Transaction Processing

Disputes & Issues

Flight Testing

Data Access & Transparency (reports/extracts)

Data Access & Transparency / Ad-Hoc Requests

Page 23: 1 RMS Workshop Retail Systems Disaster Recovery ERCOT May 6 th, 2014

23

Work-Around ProcessesDave Michelsen

Page 24: 1 RMS Workshop Retail Systems Disaster Recovery ERCOT May 6 th, 2014

24

• Are the current processes sufficient?– Move-In(s)

• Switch Hold Removals

– Move-Outs(s)– Others?

Workaround Process

Page 25: 1 RMS Workshop Retail Systems Disaster Recovery ERCOT May 6 th, 2014

25

Planned Failover CommunicationsDave Michelsen

Page 26: 1 RMS Workshop Retail Systems Disaster Recovery ERCOT May 6 th, 2014

26

• Communication for a planned transition between sites– i.e., transition from the alternate data center back to

the primary data center – Follow normal processes regarding

outages/maintenance communication– Transition would be performed in a Sunday

maintenance window

Planned Failover Communications

Page 27: 1 RMS Workshop Retail Systems Disaster Recovery ERCOT May 6 th, 2014

27

The Parking Lot

Page 28: 1 RMS Workshop Retail Systems Disaster Recovery ERCOT May 6 th, 2014

28

• TBD

Parking Lot Items