redo configuration in a real world environment
DESCRIPTION
Redo logs configuration for a high transaction environment.TRANSCRIPT
1
Redo Configuration for Sustained and Scalable Performance in a
Real World Environment
Krishna Manoharan
2
Background
During routine capacity planning of an Operational Data Store, it was noticed that large extracts were not scaling as predicted.
Among other things, higher log related waits were noticed when conducting the analysis.
A study was performed to reduce/eliminate log related waits.
3
The Environment
Oracle 10.2.0.3 Enterprise Edition 64bit on Solaris 9
Database Size – 4 TB
SunFire E4800 with 12 CPU’s, 48GB RAM and 2 IO Boards.
Work Load – Numerous small transactions due to replication and large batch process through ETL. Operations in logging mode.
Commits done for every row change (replication) to commits after million+ rows changes (ETL).
4
Work Load Profile (redo) of the instance (Peak Hours)
# Stat ValueSession commit wait requested 72/secSession commit wait performed 72/secSession redo entries generated 13326 entries/secSession redo size 9.45MB/secLGWR redo synch writes 72 sync writes/secLGWR redo writes 74 writes/secLGWR lgwr throughput 9.8 MB/sec
5
Objectives of the study
Deliver predicted scalability by
Reduce/Eliminate log related waits.
Improve on log related statistics.
Eliminate LGWR/ARCH as a bottleneck.
Performance as measured by
Improved transaction rates.
Meeting pre-defined thresholds for waits and stats.
Build a standard for an optimal and scalable
redo/archive configuration.
Performance tuning measure – not capacity planning.
Limited to configuration only, no code changes.
6
Symptoms from Oracle
Top log related wait statistics (Peak Hours)
Event Waits/sec
Existing
(average wait time)
Threshold
(average wait time)
latch: redo allocation Negligible Negligible Eliminatelatch: redo writing Negligible Negligible Eliminatelog buffer space 3.5 waits/sec 28.6 ms Eliminatelog file parallel write 74 waits/sec 11.8 ms < 2mslog file sequential read 10 waits/sec 7.02 ms < 10mslog file switch completion 0.5 waits/sec 20.82 ms ?log file sync 72 waits/sec 26.45 ms < 5 mslog file switch (checkpoint incomplete) 0.5 waits/sec 0.38 ms Eliminate
7
Symptoms from Oracle – contd.
Top log related instance statistics (During peak hours)
# Statistic Existing ThresholdLGWR redo sync response time 0.27 ms/sync write < 0.1 msLGWR redo write response time 0.12ms/write < 0.1 ms/writeSession redo buffer allocation retries 0.0002 retries/entry, 3.8 retries/second EliminateSession redo log space requests 0.000081/entry, 1 request/second EliminateSession redo log space wait time 0.12ms/wait for space < 0.1 ms/wait for space
8
Symptoms from the System – contd.
Top system related statistics (During peak hours)
# Parameter Before Threshold
CPU Average Run Queue 1.2 No Change
CPU
Average Involuntary context switches for
LGWR 30% Eliminate
Storage Average Redo Lun response time
8ms (Reads)
12ms (Write)
10ms (Reads)
2ms (Write)
Storage Average File response time (redo logs)
8ms (Reads)
12ms (Write)
10ms (Reads)
2ms (Write)
Storage Average Redo Volume Response Time
8ms (Reads)
12ms (Write)
10ms (Reads)
2ms (Write)
9
Existing Log Configuration (Instance)
No _ parameters set.
log_buffer – Default (Seen as 4MB)
Redo log groups – 3
Size of members – 500M
log_archive_max_processes –
Default (seen as 2)
Using VRTS ODM and Vxfs
10
Existing System/Storage Configuration
Default scheduling class – TS and Default priorities for Oracle.
Thread Affinity set to 150.
Storage Foundation 4.1 MP2 for Oracle Enterprise Edition.
Maxphys set to 8M (system and Vxvm)
Lun Queue Depth – 30 with a max of 256/Target.
All luns – Raid 1 using 72GB, 15K RPM FC Drives. Storage – Hitachi AMS1000
Dual Simultaneous Active (2Gbit) Paths to each lun. Load Balancing via vxdmp.
11
Existing Physical Log Configuration (Filesystem)
72GB, 15K RPMFC
Raid 1Single Lun of 66.4GB}
72GB, 15K RPMFC
Filesystem 1 (vxfs)/u05/redo1
Filesystem 2 (vxfs)/u05/redo2
Redo MembersLOG01A.dbf - 500M (Primary of Group 1)LOG02A.dbf - 500M (Primary of Group 2)LOG03A.dbf - 500M (Primary of Group 3)
Redo MembersLOG01B.dbf - 500M (Mirror of Group 1)LOG02B.dbf - 500M (Mirror of Group 2)LOG03B.dbf - 500M (Mirror of Group 3)
12
Log Waits - Schematic
LGWR
ARCH
Redo Logs
SGA
Shared Log Buffer Private Redo Strands
Archive Log Files
CPU Subsystem
IO Su
bsyste
m
CPU Subsystem
EventLog File Parallel Write
Event Log File Sync
Event Log Buffer Space
Event Log File Sequential Read
Event Log File Switch Completion
Log Group 1
Log Group 2
Event Log Archive IO
User Session
}
13
Analysis of the symptoms
LGWR related
Wait - log file parallel write - “Writing redo records to the redo
log files from the log buffer. The wait time is the time it takes for
the I/Os to complete.”
High average wait time (11.8 ms).
Correlating Stats
High Lun response time for the redo log filesystem.
High redo sync time (0.27 ms/sync write).
High redo write time (0.12 ms/write).
Log redo sync writes (72 writes/sec).
Higher buffer allocation retries (3.8 retries/sec).
High degree of involuntary context switches for the
LGWR process.
14
Analysis of the symptoms (contd.)Wait - log file parallel write (contd.)
With a high degree of commits, it is more important to review the average response time/wait rather than the number of waits. Every commit write wait results in a increment to the log file parallel write event.
The physical layout of the redo log filesystems show a single lunused for all the groups. Since redo log members are relatively small sized, it is common practice for the Storage/System Admin to assign a single lun which is then used for all the redo filesystems. This invariably leads to poor IO performance and a slower LGWR.
A slower LGWR also results in poor commit performance (sync writes) as evidenced by correlating stats.
A slower LGWR results in higher buffer allocation retries because the LGWR is unable to write the redo entries and flush the buffer to the disk fast enough to meet the session requirements.
The overall run queue on the system was low, however involuntary context switching (~30%) indicated that LGWR was being switched out of the CPU before it could complete it’s task.
The high lun response time for the redo log filesystem’s indicated that IO was a bottleneck.
15
Analysis of the symptoms (contd.)
LGWR related
Wait - log file switch completion – “Waiting for a log switch to
complete”
Wait - log file switch (checkpoint incomplete) – “Waiting for a
log switch because the session cannot wrap into the next log.
Wrapping cannot be performed because the checkpoint for that
log has not completed”
Large number of waits (0.5 waits/sec) with high average
wait time (20.82 ms).
Correlating Stats
redo log space requests – 1 request/sec
redo log space wait time – 0.12ms/entry
16
Analysis of the symptoms (contd.)
Wait - log file switch completion (contd.)
During a log file switch, redo generation is disabled. So this wait
directly impacts session performance.
The log members were only 500M in size and thus causing frequent
log switches (every 1 minute). This will result in higher waits.
The log_buffer is 4M in size and during a log switch, the log buffer is
flushed to disk. If there is an IO bottleneck to the redo log files, then
flushing 4M of log buffer could result in higher response times.
Since the redo log groups were on the same set of Filesystems,
there could possibly be a IO conflict between the checkpoint and
LGWR processes when doing a log switch as shown in the wait log
file switch (checkpoint incomplete).
However a bigger log file can also cause slower log file switches.
The impact of increasing the log member size needs to be studied
with respect to the event – log file switch completion.
17
Analysis of the symptoms (contd.)
ARCH related
Wait - log file sequential read – “Waiting for the
read from this logfile to return. This is used to read
redo records from the log file – either for recovery or
archival. The wait time is the time it takes to complete
the physical I/O (read)”
High number of waits (10 waits/sec) with
high average wait time (7 ms).
Correlating Stats
High Lun response time for the redo log
filesystem.
Event – log file parallel write (high average
wait time – 11.8 ms)
18
Analysis of the symptoms (contd.)Wait - log file sequential read (contd.)
Small sized redo log members cause frequent log switches (1/minute). These logs need to be archived and thus indirectly impacts the event -log file sequential read.
Members of the redo groups were located on the same filesystems and share the same physical LUNS.
This results in IO contention because the ARCH process is reading from the previous group as the LGWR is writing to the present group .
This in turn impacts LGWR write performance thus resulting in increased response time for the events - log file parallel writes and log file sync waits.
Poor archival performance can also indirectly impact log switches as reported in the event – log file switch (archival incomplete) and thus session performance.
For 500 M log members, the average response time is on the higher side again indicating an IO contention.
Since redo log members are relatively small sized, it is common practice for the Storage/System Admin to assign a single lun which is then used for all the redo filesystems.
The nature of the access being sequential, this problem is multiplied in effect – especially if the lun is Raid 5.
Increasing log file sizes can also cause this event to report higher wait times.
19
Analysis of the symptoms (contd.)
Session related
Wait - log buffer space – “Waiting for space in
the log buffer because the session is writing data
into the log buffer faster than LGWR can write it
out”
High number of log buffer space waits (3.5
waits/sec) with an average response time of
28.6 ms.
Correlating Stats
Event – log file parallel write (high average
wait time)
redo buffer allocation retries (3.8
retries/sec).
20
Analysis of the symptoms (contd.)
Wait - log buffer space (contd.)
This along with the high response time for the
log file parallel writes wait shows a slow LGWR.
The presence of higher redo log buffer
allocation retries also correlate this wait.
It also can mean that the default log buffer
(4MB) is too small for the rate of redo
generation (9.45 MB/sec).
During a log switch, LGWR flushes the
log_buffer to disk. So the impact of increasing
the size of the log_buffer needs to be analyzed
with respect to the event – log file switch
completion.
21
Analysis of the symptoms (contd.)
Session related
Wait - log file sync – “When a user session commits,
the session's redo information needs to be flushed to
the redo logfile. The user session will post the LGWR
to write the log buffer to the redo log file. When the
LGWR has finished posting, it will post the user
session. The wait time includes the writing of the log
buffer and the post.”
The average wait time was 26.45 ms.
Correlating Stats
Event – log file parallel write (high average
wait time – 11.8 ms)
High degree of involuntary context switches
for both user session and LGWR.
22
Analysis of the symptoms (contd.)
Wait - log file sync (contd.)
Every commit write wait/immediate will result in an
increment of the wait counter and a redo write (resulting in
an increment to the log file parallel write wait counter).
Rather than the number of waits, the average wait time is
important for this wait event.
Under ideal circumstances, the average wait time for a log
file sync event should be the about the same as the
average wait time for the wait – log file parallel write. If
there is a difference, then it probably indicates a CPU
bottleneck for the session.
Higher wait times can be a result of slow LGWR as well
as CPU bottleneck (evidenced by high involuntary context
switches for session processes)
23
Initial Conclusions
From the waits and stats, we came to
following conclusions
LGWR
The underlying IO subsystem for the
redo logs needed to be improved.
The redo log members needed to be
resized from 500M to a suitable size.
Also increase the groups from 3 to 4.
Reduce LGWR involuntary switches by
addressing OS scheduling issues.
24
Initial Conclusions (contd.)
ARCH
Separate the redo log groups onto
dedicated filesystems to prevent
contention between ARCH and LGWR.
Session
Increase log buffer from the default to a
suitable value taking into consideration
impact on the event log file switch
completion.
25
Final Configuration Details
After 30 or so runs, we finally arrived at the below optimal configuration.
Redo Filesystem configuration (to address IO issues)
Striped filesystems on dedicated Raid 1 luns configured for the redo
logs as shown in the next slide.
Filesystem is vxfs with 8k block size.
Stripe Width = 1M
LGWR configuration (to address involuntary context
switches)
The FX scheduling class was set for the LGWR process. The CPU time
quantum was increased to 1000 and the priority set to 59.
# priocntl -s -c FX -m 59 -p 59 -t 1000 -i pid <LGWR process>
The thread affinity was set to 150 for the entire system, however we
decided it to be best if the LGWR was bound to a specific CPU.
# psrset –c <CPU>
# psrset –b 1 <LGWR process>
26
New Filesystem Layout and redo group placement
72GB, 15K RPMFC
Raid 1Single Lun of
66.4GB
}
72GB, 15K RPMFC
72GB, 15K RPMFC
Raid 1Single Lun of
66.4GB
}72GB, 15K RPM
FC
}
Filesystem 1 (vxfs)Layout=StripeSTWidth = 1M
/u05/redo1
Redo MembersLOG01A.dbf - 1500M (Primary of Group 1)LOG03A.dbf -1 500M (Primary of Group 3)
Filesystem 2 (vxfs)Layout=StripeSTWidth = 1M
/u05/redo2
Redo MembersLOG02B.dbf - 1500M (Mirror of Group 2)LOG04B.dbf -1 500M (Mirror of Group 4)
Filesystem 3 (vxfs)Layout=StripeSTWidth = 1M
/u05/redo3
Redo MembersLOG02A.dbf - 1500M (Primary of Group 2)LOG04A.dbf -1 500M (Primary of Group 4)
Filesystem 4 (vxfs)Layout=StripeSTWidth = 1M
/u05/redo4
Redo MembersLOG01B.dbf - 1500M (Mirror of Group 1)LOG03B.dbf -1 500M (Mirror of Group 3)
27
Final Configuration Details (contd.)
Redo groups
4 redo groups configured with 2 members each.
The log members were placed in such a manner on the redo filesystems to eliminate LGWR and ARCH IO contention.
Each member was 8G in size (8G log members would reduce the log switches from 1 switch per minute to 1 switch every 7 minutes).
Reducing log switches improves performance as during a log switch, redo generation is disabled.
8G was an ideal size – log archiving completed within 2 minutes whereas log switches happened every 7 minutes.
Increasing the log member size resulted in higher wait times for the events – log file switch completion and log file sequential read.
However the overall performance gain was well worth it.
28
Final Configuration Details (contd.)
Session
The log buffer was set to 72M (after several iterations).
A 72M log buffer along with 8G log file members resulted in an higher response time for the event –log file switch completion.
However we completely eliminated the wait event –log buffer space (even when simulating 1.5X load).
72M appears to be an ideal size for a redo generation rate up to 14MB/sec.
The _log_io_size is set to a maximum of 1M irrespective of the log_buffer size once the log_buffercrosses ~ 6MB. Also since we had a Storage sub-system which was quite capable of handling upto32M in a single write within acceptable response time, we did not downsize the log_io_size.
29
Final Configuration Details (contd.)
Session (contd.)
Improving the LGWR write performance however resulted in the redo allocation latch contention.
To reduce the redo allocation latch contention, we increased the parallelism for the shared redo buffer from the default of 2 to 12.
_log_parallelism_max = 12 # Default is 2. Max - Limited to CPU count
_log_parallelism = 4 # Default is 1
By enabling log_parallelism, the shared log buffer is split into log_parallelism_max sections each assigned a redo allocation latch.
As per oracle documentation, the redo allocation latch for the shared log buffer is randomly assigned to the requestor and then does a round-robin allocation. We did notice that this was not an optimal way of assignment.
30
Final Configuration Details (contd.)
ARCH
The ARCH process reads OS sized blocks as set by
the _log_archive_buffer_size parameter.
The default and maximum value on Solaris with Oracle
10g is 2048 OS blocks (equates to 1MB reads).
So the archive logs filesystem was also created as a
stripe filesystem with 1MB stwidth.
Performance improved as the redo logs filesystems
and the archive filesystems were both stripe
filesystems with 1MB stripe width. Average ARCH
throughput was around 150MB/sec.
However we did notice that the ARCH process reads
from the primary group member only. It does not read
simultaneously from both the members.
We did not change the log_archive_max_processes
from default (2).
31
Final Results
Peak Work load showed an
improvement of 7x.
Least improvement was 4x.
At 1.5X load, the scalability
was near linear.
32
The results – Work Load Profile (redo)
# Stat Before AfterSession commit wait requested 72/sec 520/secSession commit wait performed 72/sec 520/secSession redo entries generated 13326 entries/sec 14677 entries/secSession redo size 9.45MB/sec 10.1MB/secLGWR redo synch writes 72 sync writes/sec 520 sync writes/secLGWR redo writes 74 writes/sec 845 writes/secLGWR lgwr throughput 9.8 MB/sec 10.75 MB/sec
33
The results – Waits
Target
Waits/sec
Before
(average wait time) Waits/sec
After
(average wait time)
Threshold
(average wait time)
latch: redo allocation Negligible Negligible 0.002 waits/sec 0.9 ms Eliminatelatch: redo writing Negligible Negligible 0 waits/sec 0 ms Eliminatelog buffer space 3.5 waits/sec 28.6 ms 0 waits/sec 0 ms Eliminatelog file parallel write 74 waits/sec 11.8 ms 845 waits/sec 0.55 ms < 2mslog file sequential read 10 waits/sec 7.02 ms 10.5 waits/sec 16.62 ms ?log file switch completion 0.5 waits/sec 20.82 ms 0.02 waits/sec 31.5 ms ?log file sync 72 waits/sec 26.45 ms 519 waits/sec 2.13 ms < 5 mslog file switch (checkpoint incomplete) 0.5 waits/sec 0.38 ms 0 waits/sec 0 ms Eliminate
Before
Event
After
34
The results – Stats
35
The results – System
# Parameter Before After Threshold
CPU Average Run Queue 1.2 1.2 No Change
CPU
Average Involuntary context switches for
LGWR 30% < 0.1 % Eliminate
Storage Average Redo Lun response time
8ms (Reads)
12ms (Write)
16ms (Reads)
< 1ms (Write)
10ms (Reads)
2ms (Write)
Storage Average File response time (redo logs)
8ms (Reads)
12ms (Write)
16ms (Reads)
< 1ms (Write)
10ms (Reads)
2ms (Write)
Storage Average Redo Volume Response Time
8ms (Reads)
12ms (Write)
16ms (Reads)
< 1ms (Write)
10ms (Reads)
2ms (Write)
36
Final Thoughts
In order of biggest impact to performance (in descending order),
1. IO Subsystem (50%)
2. Redo Groups layout and sizing of log file members (20%)
3. CPU Scheduling (15%)
4. Log Buffer (10%)
5. Log Parallelism (5%)
The LGWR process in 10g is incredibly efficient requiring minimal tuning, however it would have been ideal if there was dedicated LGWR for each shared strand.
One can only imagine the performance gain with multiple LGWR each servicing distinct log buffers.