redo configuration in a real world environment

1

Redo Configuration for Sustained and Scalable Performance in a

Real World Environment

Krishna Manoharan

[email protected]

2

Background

During routine capacity planning of an Operational Data Store, it was noticed that large extracts were not scaling as predicted.

Among other things, higher log related waits were noticed when conducting the analysis.

A study was performed to reduce/eliminate log related waits.

3

The Environment

Oracle 10.2.0.3 Enterprise Edition 64bit on Solaris 9

Database Size – 4 TB

SunFire E4800 with 12 CPU’s, 48GB RAM and 2 IO Boards.

Work Load – Numerous small transactions due to replication and large batch process through ETL. Operations in logging mode.

Commits done for every row change (replication) to commits after million+ rows changes (ETL).

4

Work Load Profile (redo) of the instance (Peak Hours)

# Stat ValueSession commit wait requested 72/secSession commit wait performed 72/secSession redo entries generated 13326 entries/secSession redo size 9.45MB/secLGWR redo synch writes 72 sync writes/secLGWR redo writes 74 writes/secLGWR lgwr throughput 9.8 MB/sec

5

Objectives of the study

Deliver predicted scalability by

Reduce/Eliminate log related waits.

Improve on log related statistics.

Eliminate LGWR/ARCH as a bottleneck.

Performance as measured by

Improved transaction rates.

Meeting pre-defined thresholds for waits and stats.

Build a standard for an optimal and scalable

redo/archive configuration.

Performance tuning measure – not capacity planning.

Limited to configuration only, no code changes.

6

Symptoms from Oracle

Top log related wait statistics (Peak Hours)

Event Waits/sec

Existing

(average wait time)

Threshold

(average wait time)

latch: redo allocation Negligible Negligible Eliminatelatch: redo writing Negligible Negligible Eliminatelog buffer space 3.5 waits/sec 28.6 ms Eliminatelog file parallel write 74 waits/sec 11.8 ms < 2mslog file sequential read 10 waits/sec 7.02 ms < 10mslog file switch completion 0.5 waits/sec 20.82 ms ?log file sync 72 waits/sec 26.45 ms < 5 mslog file switch (checkpoint incomplete) 0.5 waits/sec 0.38 ms Eliminate

7

Symptoms from Oracle – contd.

Top log related instance statistics (During peak hours)

# Statistic Existing ThresholdLGWR redo sync response time 0.27 ms/sync write < 0.1 msLGWR redo write response time 0.12ms/write < 0.1 ms/writeSession redo buffer allocation retries 0.0002 retries/entry, 3.8 retries/second EliminateSession redo log space requests 0.000081/entry, 1 request/second EliminateSession redo log space wait time 0.12ms/wait for space < 0.1 ms/wait for space

8

Symptoms from the System – contd.

Top system related statistics (During peak hours)

# Parameter Before Threshold

CPU Average Run Queue 1.2 No Change

CPU

Average Involuntary context switches for

LGWR 30% Eliminate

Storage Average Redo Lun response time

8ms (Reads)

12ms (Write)

10ms (Reads)

2ms (Write)

Storage Average File response time (redo logs)

8ms (Reads)

12ms (Write)

10ms (Reads)

2ms (Write)

Storage Average Redo Volume Response Time

8ms (Reads)

12ms (Write)

10ms (Reads)

2ms (Write)

9

Existing Log Configuration (Instance)

No _ parameters set.

log_buffer – Default (Seen as 4MB)

Redo log groups – 3

Size of members – 500M

log_archive_max_processes –

Default (seen as 2)

Using VRTS ODM and Vxfs

10

Existing System/Storage Configuration

Default scheduling class – TS and Default priorities for Oracle.

Thread Affinity set to 150.

Storage Foundation 4.1 MP2 for Oracle Enterprise Edition.

Maxphys set to 8M (system and Vxvm)

Lun Queue Depth – 30 with a max of 256/Target.

All luns – Raid 1 using 72GB, 15K RPM FC Drives. Storage – Hitachi AMS1000

Dual Simultaneous Active (2Gbit) Paths to each lun. Load Balancing via vxdmp.

11

Existing Physical Log Configuration (Filesystem)

72GB, 15K RPMFC

Raid 1Single Lun of 66.4GB}

72GB, 15K RPMFC

Filesystem 1 (vxfs)/u05/redo1

Filesystem 2 (vxfs)/u05/redo2

Redo MembersLOG01A.dbf - 500M (Primary of Group 1)LOG02A.dbf - 500M (Primary of Group 2)LOG03A.dbf - 500M (Primary of Group 3)

Redo MembersLOG01B.dbf - 500M (Mirror of Group 1)LOG02B.dbf - 500M (Mirror of Group 2)LOG03B.dbf - 500M (Mirror of Group 3)

12

Log Waits - Schematic

LGWR

ARCH

Redo Logs

SGA

Shared Log Buffer Private Redo Strands

Archive Log Files

CPU Subsystem

IO Su

bsyste

m

CPU Subsystem

EventLog File Parallel Write

Event Log File Sync

Event Log Buffer Space

Event Log File Sequential Read

Event Log File Switch Completion

Log Group 1

Log Group 2

Event Log Archive IO

User Session

}

13

Analysis of the symptoms

LGWR related

Wait - log file parallel write - “Writing redo records to the redo

log files from the log buffer. The wait time is the time it takes for

the I/Os to complete.”

High average wait time (11.8 ms).

Correlating Stats

High Lun response time for the redo log filesystem.

High redo sync time (0.27 ms/sync write).

High redo write time (0.12 ms/write).

Log redo sync writes (72 writes/sec).

Higher buffer allocation retries (3.8 retries/sec).

High degree of involuntary context switches for the

LGWR process.

14

Analysis of the symptoms (contd.)Wait - log file parallel write (contd.)

With a high degree of commits, it is more important to review the average response time/wait rather than the number of waits. Every commit write wait results in a increment to the log file parallel write event.

The physical layout of the redo log filesystems show a single lunused for all the groups. Since redo log members are relatively small sized, it is common practice for the Storage/System Admin to assign a single lun which is then used for all the redo filesystems. This invariably leads to poor IO performance and a slower LGWR.

A slower LGWR also results in poor commit performance (sync writes) as evidenced by correlating stats.

A slower LGWR results in higher buffer allocation retries because the LGWR is unable to write the redo entries and flush the buffer to the disk fast enough to meet the session requirements.

The overall run queue on the system was low, however involuntary context switching (~30%) indicated that LGWR was being switched out of the CPU before it could complete it’s task.

The high lun response time for the redo log filesystem’s indicated that IO was a bottleneck.

15

Analysis of the symptoms (contd.)

LGWR related

Wait - log file switch completion – “Waiting for a log switch to

complete”

Wait - log file switch (checkpoint incomplete) – “Waiting for a

log switch because the session cannot wrap into the next log.

Wrapping cannot be performed because the checkpoint for that

log has not completed”

Large number of waits (0.5 waits/sec) with high average

wait time (20.82 ms).

Correlating Stats

redo log space requests – 1 request/sec

redo log space wait time – 0.12ms/entry

16


Wait - log file switch completion (contd.)

During a log file switch, redo generation is disabled. So this wait

directly impacts session performance.

The log members were only 500M in size and thus causing frequent

log switches (every 1 minute). This will result in higher waits.

The log_buffer is 4M in size and during a log switch, the log buffer is

flushed to disk. If there is an IO bottleneck to the redo log files, then

flushing 4M of log buffer could result in higher response times.

Since the redo log groups were on the same set of Filesystems,

there could possibly be a IO conflict between the checkpoint and

LGWR processes when doing a log switch as shown in the wait log

file switch (checkpoint incomplete).

However a bigger log file can also cause slower log file switches.

The impact of increasing the log member size needs to be studied

with respect to the event – log file switch completion.

17


ARCH related

Wait - log file sequential read – “Waiting for the

read from this logfile to return. This is used to read

redo records from the log file – either for recovery or

archival. The wait time is the time it takes to complete

the physical I/O (read)”

High number of waits (10 waits/sec) with

high average wait time (7 ms).

Correlating Stats

High Lun response time for the redo log

filesystem.

Event – log file parallel write (high average

wait time – 11.8 ms)

18

Analysis of the symptoms (contd.)Wait - log file sequential read (contd.)

Small sized redo log members cause frequent log switches (1/minute). These logs need to be archived and thus indirectly impacts the event -log file sequential read.

Members of the redo groups were located on the same filesystems and share the same physical LUNS.

This results in IO contention because the ARCH process is reading from the previous group as the LGWR is writing to the present group .

This in turn impacts LGWR write performance thus resulting in increased response time for the events - log file parallel writes and log file sync waits.

Poor archival performance can also indirectly impact log switches as reported in the event – log file switch (archival incomplete) and thus session performance.

For 500 M log members, the average response time is on the higher side again indicating an IO contention.

Since redo log members are relatively small sized, it is common practice for the Storage/System Admin to assign a single lun which is then used for all the redo filesystems.

The nature of the access being sequential, this problem is multiplied in effect – especially if the lun is Raid 5.

Increasing log file sizes can also cause this event to report higher wait times.

19


Session related

Wait - log buffer space – “Waiting for space in

the log buffer because the session is writing data

into the log buffer faster than LGWR can write it

out”

High number of log buffer space waits (3.5

waits/sec) with an average response time of

28.6 ms.

Correlating Stats


wait time)

redo buffer allocation retries (3.8

retries/sec).

20


Wait - log buffer space (contd.)

This along with the high response time for the

log file parallel writes wait shows a slow LGWR.

The presence of higher redo log buffer

allocation retries also correlate this wait.

It also can mean that the default log buffer

(4MB) is too small for the rate of redo

generation (9.45 MB/sec).

During a log switch, LGWR flushes the

log_buffer to disk. So the impact of increasing

the size of the log_buffer needs to be analyzed

with respect to the event – log file switch

completion.

21


Session related

Wait - log file sync – “When a user session commits,

the session's redo information needs to be flushed to

the redo logfile. The user session will post the LGWR

to write the log buffer to the redo log file. When the

LGWR has finished posting, it will post the user

session. The wait time includes the writing of the log

buffer and the post.”

The average wait time was 26.45 ms.

Correlating Stats


wait time – 11.8 ms)

High degree of involuntary context switches

for both user session and LGWR.

22


Wait - log file sync (contd.)

Every commit write wait/immediate will result in an

increment of the wait counter and a redo write (resulting in

an increment to the log file parallel write wait counter).

Rather than the number of waits, the average wait time is

important for this wait event.

Under ideal circumstances, the average wait time for a log

file sync event should be the about the same as the

average wait time for the wait – log file parallel write. If

there is a difference, then it probably indicates a CPU

bottleneck for the session.

Higher wait times can be a result of slow LGWR as well

as CPU bottleneck (evidenced by high involuntary context

switches for session processes)

23

Initial Conclusions

From the waits and stats, we came to

following conclusions

LGWR

The underlying IO subsystem for the

redo logs needed to be improved.

The redo log members needed to be

resized from 500M to a suitable size.

Also increase the groups from 3 to 4.

Reduce LGWR involuntary switches by

addressing OS scheduling issues.

24

Initial Conclusions (contd.)

ARCH

Separate the redo log groups onto

dedicated filesystems to prevent

contention between ARCH and LGWR.

Session

Increase log buffer from the default to a

suitable value taking into consideration

impact on the event log file switch

completion.

25

Final Configuration Details

After 30 or so runs, we finally arrived at the below optimal configuration.

Redo Filesystem configuration (to address IO issues)

Striped filesystems on dedicated Raid 1 luns configured for the redo

logs as shown in the next slide.

Filesystem is vxfs with 8k block size.

Stripe Width = 1M

LGWR configuration (to address involuntary context

switches)

The FX scheduling class was set for the LGWR process. The CPU time

quantum was increased to 1000 and the priority set to 59.

# priocntl -s -c FX -m 59 -p 59 -t 1000 -i pid <LGWR process>

The thread affinity was set to 150 for the entire system, however we

decided it to be best if the LGWR was bound to a specific CPU.

# psrset –c <CPU>

# psrset –b 1 <LGWR process>

26

New Filesystem Layout and redo group placement

72GB, 15K RPMFC

Raid 1Single Lun of

66.4GB

}

72GB, 15K RPMFC

72GB, 15K RPMFC

Raid 1Single Lun of

66.4GB

}72GB, 15K RPM

FC

}

Filesystem 1 (vxfs)Layout=StripeSTWidth = 1M

/u05/redo1

Redo MembersLOG01A.dbf - 1500M (Primary of Group 1)LOG03A.dbf -1 500M (Primary of Group 3)


/u05/redo2

Redo MembersLOG02B.dbf - 1500M (Mirror of Group 2)LOG04B.dbf -1 500M (Mirror of Group 4)


/u05/redo3

Redo MembersLOG02A.dbf - 1500M (Primary of Group 2)LOG04A.dbf -1 500M (Primary of Group 4)


/u05/redo4

Redo MembersLOG01B.dbf - 1500M (Mirror of Group 1)LOG03B.dbf -1 500M (Mirror of Group 3)

27

Final Configuration Details (contd.)

Redo groups

4 redo groups configured with 2 members each.

The log members were placed in such a manner on the redo filesystems to eliminate LGWR and ARCH IO contention.

Each member was 8G in size (8G log members would reduce the log switches from 1 switch per minute to 1 switch every 7 minutes).

Reducing log switches improves performance as during a log switch, redo generation is disabled.

8G was an ideal size – log archiving completed within 2 minutes whereas log switches happened every 7 minutes.

Increasing the log member size resulted in higher wait times for the events – log file switch completion and log file sequential read.

However the overall performance gain was well worth it.

28


Session

The log buffer was set to 72M (after several iterations).

A 72M log buffer along with 8G log file members resulted in an higher response time for the event –log file switch completion.

However we completely eliminated the wait event –log buffer space (even when simulating 1.5X load).

72M appears to be an ideal size for a redo generation rate up to 14MB/sec.

The _log_io_size is set to a maximum of 1M irrespective of the log_buffer size once the log_buffercrosses ~ 6MB. Also since we had a Storage sub-system which was quite capable of handling upto32M in a single write within acceptable response time, we did not downsize the log_io_size.

29


Session (contd.)

Improving the LGWR write performance however resulted in the redo allocation latch contention.

To reduce the redo allocation latch contention, we increased the parallelism for the shared redo buffer from the default of 2 to 12.

_log_parallelism_max = 12 # Default is 2. Max - Limited to CPU count

_log_parallelism = 4 # Default is 1

By enabling log_parallelism, the shared log buffer is split into log_parallelism_max sections each assigned a redo allocation latch.

As per oracle documentation, the redo allocation latch for the shared log buffer is randomly assigned to the requestor and then does a round-robin allocation. We did notice that this was not an optimal way of assignment.

30


ARCH

The ARCH process reads OS sized blocks as set by

the _log_archive_buffer_size parameter.

The default and maximum value on Solaris with Oracle

10g is 2048 OS blocks (equates to 1MB reads).

So the archive logs filesystem was also created as a

stripe filesystem with 1MB stwidth.

Performance improved as the redo logs filesystems

and the archive filesystems were both stripe

filesystems with 1MB stripe width. Average ARCH

throughput was around 150MB/sec.

However we did notice that the ARCH process reads

from the primary group member only. It does not read

simultaneously from both the members.

We did not change the log_archive_max_processes

from default (2).

31

Final Results

Peak Work load showed an

improvement of 7x.

Least improvement was 4x.

At 1.5X load, the scalability

was near linear.

32

The results – Work Load Profile (redo)

# Stat Before AfterSession commit wait requested 72/sec 520/secSession commit wait performed 72/sec 520/secSession redo entries generated 13326 entries/sec 14677 entries/secSession redo size 9.45MB/sec 10.1MB/secLGWR redo synch writes 72 sync writes/sec 520 sync writes/secLGWR redo writes 74 writes/sec 845 writes/secLGWR lgwr throughput 9.8 MB/sec 10.75 MB/sec

33

The results – Waits

Target

Waits/sec

Before

(average wait time) Waits/sec

After

(average wait time)

Threshold

(average wait time)

latch: redo allocation Negligible Negligible 0.002 waits/sec 0.9 ms Eliminatelatch: redo writing Negligible Negligible 0 waits/sec 0 ms Eliminatelog buffer space 3.5 waits/sec 28.6 ms 0 waits/sec 0 ms Eliminatelog file parallel write 74 waits/sec 11.8 ms 845 waits/sec 0.55 ms < 2mslog file sequential read 10 waits/sec 7.02 ms 10.5 waits/sec 16.62 ms ?log file switch completion 0.5 waits/sec 20.82 ms 0.02 waits/sec 31.5 ms ?log file sync 72 waits/sec 26.45 ms 519 waits/sec 2.13 ms < 5 mslog file switch (checkpoint incomplete) 0.5 waits/sec 0.38 ms 0 waits/sec 0 ms Eliminate

Before

Event

After

34

The results – Stats

35

The results – System

# Parameter Before After Threshold

CPU Average Run Queue 1.2 1.2 No Change

CPU

Average Involuntary context switches for

LGWR 30% < 0.1 % Eliminate

Storage Average Redo Lun response time

8ms (Reads)

12ms (Write)

16ms (Reads)

< 1ms (Write)

10ms (Reads)

2ms (Write)

Storage Average File response time (redo logs)

8ms (Reads)

12ms (Write)

16ms (Reads)

< 1ms (Write)

10ms (Reads)

2ms (Write)

Storage Average Redo Volume Response Time

8ms (Reads)

12ms (Write)

16ms (Reads)

< 1ms (Write)

10ms (Reads)

2ms (Write)

36

Final Thoughts

In order of biggest impact to performance (in descending order),

1. IO Subsystem (50%)

2. Redo Groups layout and sizing of log file members (20%)

3. CPU Scheduling (15%)

4. Log Buffer (10%)

5. Log Parallelism (5%)

The LGWR process in 10g is incredibly efficient requiring minimal tuning, however it would have been ideal if there was dedicated LGWR for each shared strand.

One can only imagine the performance gain with multiple LGWR each servicing distinct log buffers.

redo configuration in a real world environment

Documents