forming range-based break groups with advanced sql

56
FORMING RANGE-BASED BREAK GROUPS WITH ADVANCED SQL Author: Brendan Furey Creation Date: 12 June 2011 Version: 1.4 Last Updated: 25 September 2012 document.doc . Page 1 of 56

Upload: brendan-furey

Post on 24-Mar-2015

716 views

Category:

Documents


2 download

DESCRIPTION

Records in a database often include range fields, such as a start and end time for some activity, and it is sometimes desired to group the records by range. There are several possible ways of grouping by range: In one case the records do not overlap, but additional breaking fields may be present; in a second case, records may overlap, but additional breaking fields do not then make sense; in the third case considered ('bursts of activity'), only a single start field is used and break groups consist of all the records whose range start is within a given distance from the starting point. For each problem, we consider two variations that affect the choice of SQL: In the first, we are looking for all break groups, while in the second we want to retrieve only a single one.This article provides solutions for these problems, using three SQL techniques, namely: Analytic Functions, Model Clause, and Recursive Subquery Factoring. Diagrams are used extensively to depict query structures and help explain the solutions.Performance analyses are included that compare performance of the three methods (only two for the third problem) on each problem across a two-dimensional domain of size and depth. The analyses follow an approach described in an earlier article ('SQL Pivot and Prune Queries – Keeping an Eye on Performance'). The results show that the best method depends on the depth of the groups, with Analytic Functions being best for deep groups and Recursive Subquery Factoring best for shallow groups where only a single group is required. The Model Clause performs best where an Analytic Functions solution is not available (the ‘bursts of activity’ problem) and either all groups are required or a single deep group is required. The Model Clause also gives very stable performance across depth range, and is surprisingly simple in structure. The article may be of interest to developers who have yet to learn about some of these techniques.An important performance glitch was discovered in using the analytic function First_Value with the Ignore Nulls option, and methods for avoiding it presented.This document replaces a preliminary version (‘Forming Range-Based Break Groups with SQL Analytic Functions’) with only analytic solutions, two problems, and no performance analysis.

TRANSCRIPT

Page 1: Forming Range-Based Break Groups With Advanced SQL

FORMING RANGE-BASED BREAK GROUPS WITH ADVANCED SQL

Author: Brendan Furey

Creation Date: 12 June 2011

Version: 1.4

Last Updated: 25 September 2012

document.doc . Page 1 of 49

Page 2: Forming Range-Based Break Groups With Advanced SQL

Table of Contents

Introduction 4

Hardware/Software Summary 4

Problem Definitions and Examples 5

Problem Definitions 5Problem 1: Contiguous Ranges 5Problem 2: Overlapping Ranges 5Problem 3: Bursts of Activity 5

Functional Test Data 5Activity_nov, Activity Table 5Indexes 5Test Cases 6Test Data 6

Test Data Grouping Diagram7Performance Testing Strategy 7

SQL Change for Single Break Group Problems 8

Problem 1: Contiguous Ranges 9

Analytics Solution 9How It Works 9Query Diagram 9SQL10Inline View Diagram 10Solution Stage Table 10

Model Solution 11How It Works 11Query Diagram 11SQL12

Recursive Subquery Factor Solution12How It Works 12Query Diagram 13SQL14

Performance Analysis 14Test Data Sets 14Output Record Counts 14CPU Times 15Slice Graphs 17Explain Plans (Data Point W256-D1)17Discussion of Results 18

Problem 2: Overlapping Ranges 19

Analytics Solution 19How It Works 19Query Diagram 19SQL19Inline View Diagram 20Solution Stage Table 20

Model Solution 21How It Works 21Query Diagram 22SQL22

Recursive Subquery Factor Solution22How It Works 22Query Diagram 24SQL25

Performance Analysis 25document.doc Page 2 of 49

Page 3: Forming Range-Based Break Groups With Advanced SQL

Test Data Sets 25Output Record Counts 26CPU Times 26Slice Graphs 29Explain Plans (Data Point W64-D1) 29Discussion of Results 31

Problem 3: Bursts of Activity 32

Analytics Solution (None) 32Model Solution 32

How It Works 32Query Diagram 33SQL33

Recursive Subquery Factoring Solution 33How It Works 33Query Diagram 34SQL34

Performance Analysis 35Test Data Sets 35Output Row Counts 35CPU Times 35Slice Graphs 37Explain Plans (Data Point W128-D1)38Discussion of Results 39

Analytics Anomaly Analysis 40

Analytic Query Variations 40Problem 1: Contiguous Ranges 40Problem 2: Overlapping Ranges 42

Performance Analysis 44Problem 1: Contiguous Ranges 44Problem 2: Overlapping Ranges 45CPU Times 45

Conclusions 47

References 48

Change Record

Date Author Version Change Reference

12-Jun-2011 BPF 1.0Initial covering 2 problems, analytic solutions only, no performance analysis

14-Jun-2011 BPF 1.1 Added test case 5, and tabulated intermediate solutions

19-Jul-2011 BPF 1.2Restructured, adding third problem, Model and RSF solutions, and performance analysis

02-Aug-2011 BPF 1.3 Analytics anomaly analysis25-Sep-2012 BPF 1.4 References now hyperlinks

document.doc Page 3 of 49

Page 4: Forming Range-Based Break Groups With Advanced SQL

IntroductionRecords in a database often include range fields, such as a start and end time for some activity, and it is sometimes desired to group the records by range. There are several possible ways of grouping by range: In one case the records do not overlap, but additional breaking fields may be present; in a second case, records may overlap, but additional breaking fields do not then make sense; in the third case considered ('bursts of activity'), only a single start field is used and break groups consist of all the records whose range start is within a given distance from the starting point. For each problem, we consider two variations that affect the choice of SQL: In the first, we are looking for all break groups, while in the second we want to retrieve only a single one.

This article provides solutions for these problems, using three SQL techniques, namely: Analytic Functions, Model Clause, and Recursive Subquery Factoring. Diagrams are used extensively to depict query structures and help explain the solutions.

Performance analyses are included that compare performance of the three methods (only two for the third problem) on each problem across a two-dimensional domain of size and depth. The analyses follow an approach described in an earlier article (SQL Pivot and Prune Queries – Keeping an Eye on Performance). The results show that the best method depends on the depth of the groups, with Analytic Functions being best for deep groups and Recursive Subquery Factoring best for shallow groups where only a single group is required. The Model Clause performs best where an Analytic Functions solution is not available (the ‘bursts of activity’ problem) and either all groups are required or a single deep group is required. The Model Clause also gives very stable performance across depth range, and is surprisingly simple in structure. The article may be of interest to developers who have yet to learn about some of these techniques.

An important performance glitch was discovered in using the analytic function First_Value with the Ignore Nulls option, and methods for avoiding it presented.

This document replaces a preliminary version (‘Forming Range-Based Break Groups with SQL Analytic Functions’) with only analytic solutions, two problems, and no performance analysis.

Hardware/Software Summary

Component DescriptionDatabase Oracle Database 11g Express Edition Release 11.2.0.2.0 - BetaDiagrammer Microsoft Visio 2003 (11.3216.5606)Operating System Microsoft Windows 7 Home Premium (32 bit)Computer Samsung X120, 3GB memory, Intel U4100 @ 1.3GHz x 2

document.doc . Page 4 of 49

Page 5: Forming Range-Based Break Groups With Advanced SQL

Problem Definitions and Examples

Problem Definitions

In this section, we define the problems generically. Consider the fields in a record set to divide into the following categories:

key - partition by fields

range start, range end - range fields (range end is just viewed as another attribute in problem 3)

break - break fields (where allowed)

other - any other fields

For each problem, we consider two variations that affect the choice of SQL: In the first, we are looking for all break groups, while in the second we want to retrieve only a single one enclosing (or, starting from, for the third problem) a particular value.

Problem 1: Contiguous Ranges

The first problem is to obtain for each record a group start, group end pair that are the range start and range end values for the records that respectively start and end the break group of the current record. The records are to be ordered by range start within the partitioning key, and a new break group starts when, between successive records, either there is a gap between range end and range start fields, or any of the break fields change value. No overlaps are allowed in the ranges within a key.

Problem 2: Overlapping Ranges

The second problem is the same as the first but with no break fields and overlapping is allowed. In other words, groups consist of all records that overlap, counting contiguity as overlapping.

Problem 3: Bursts of Activity

The third problem is to determine the break groups using distance from the group start point, with overlapping allowed (since the range end is here just another attribute). In other words, once a group starts, all records that start within a fixed distance from the group start are in the group, and the first record after the end of a group defines the next group start.

Functional Test Data

The problem data structure is based on a question posed in Tom Kyte’s Oracle forum, see Activities and breaks, while the test data are my own. We will use it for all three problems, but the first problem will use a separate table of the same structure but with indexes different from those for the others.

Activity_nov, Activity Table

Column Typeactivity_id Numberperson_id Numberstart_date Dateend_date Dateactivity_name Char(10)

Indexes

Activity_nov (problem 1, indexes unique)

Index Column

ACTIVITY_NOV_U1person_idstart_date

ACTIVITY_NOV_U2person_idend_date

document.doc Page 5 of 49

Page 6: Forming Range-Based Break Groups With Advanced SQL

Activity (problems 2 and 3, indexes non-unique)

Index Column

ACTIVITY_N1

person_idstart_dateNvl(end_date, To_Date(' 3000-01-01 00:00:00', 'syyyy-mm-dd hh24:mi:ss')

ACTIVITY_N2

person_idNvl(end_date, To_Date(' 3000-01-01 00:00:00', 'syyyy-mm-dd hh24:mi:ss')start_date

Test Cases

There are five test cases, two for the first problem, three for the other two, which can use the same data sets, with a person for each case. The groups for the third problem are defined by a burst size limit of 3 days. Oracle standard dates have 1 second precision, but we’ll take a time component of zero in the test data for simplicity as this causes no loss of generality.

Test Case ScenarioTest Cases T1 and T2 - Non-Overlapping with Additional BreaksT1 3 records, gap, 2 records, gap, 1 recordT2 3 records, gap, 2 records (names differ), gap, 1 record null end

dateTest Cases T3, T4, T5 - Overlapping without Additional BreaksT3 3 records (with overlaps), gap, 2 records (second enclosed by

first), gap, 1 recordT4 3 records (with overlaps), gap, 3 records, second overlaps first, with null

end dateT5 3 records (with overlaps), gap, 2 records (second enclosed by

first), gap but not with respect to first, 1 record

Test DataPer Id

Act Id

Activity Name

Start Date End DateGroup Start

Group End

Burst Date

1

1 LEAVE01-Jun-11 02-Jun-11 01-Jun-

1107-Jun-11

01-Jun-11

2 LEAVE02-Jun-11 04-Jun-11 01-Jun-

1107-Jun-11

01-Jun-11

3 LEAVE04-Jun-11 07-Jun-11 01-Jun-

1107-Jun-11

01-Jun-11

4 LEAVE08-Jun-11 09-Jun-11 08-Jun-

1114-Jun-11

08-Jun-11

5 LEAVE09-Jun-11 14-Jun-11 08-Jun-

1114-Jun-11

08-Jun-11

6 LEAVE20-Jun-11 30-Jun-11 20-Jun-

1130-Jun-11

20-Jun-11

2

7 LEAVE01-Jun-11 02-Jun-11 01-Jun-

1107-Jun-11

01-Jun-11

8 LEAVE02-Jun-11 04-Jun-11 01-Jun-

1107-Jun-11

01-Jun-11

9 LEAVE04-Jun-11 07-Jun-11 01-Jun-

1107-Jun-11

01-Jun-11

10 LEAVE08-Jun-11 09-Jun-11 08-Jun-

1109-Jun-11 08-Jun-

11

11 TRAINING09-Jun-11 14-Jun-11 08-Jun-

1114-Jun-11 08-Jun-

1112 TRAINING 20-Jun-11 20-Jun-11 20-Jun-11

313 LEAVE

01-Jun-11 03-Jun-11 01-Jun-11

07-Jun-11

01-Jun-11

14 LEAVE02-Jun-11 05-Jun-11 01-Jun-

1107-Jun-11

01-Jun-11

15 LEAVE 04-Jun-11 07-Jun-11 01-Jun- 07-Jun- 01-Jun-

document.doc Page 6 of 49

Page 7: Forming Range-Based Break Groups With Advanced SQL

11 11 11

16 LEAVE08-Jun-11 16-Jun-11 08-Jun-

1116-Jun-11

08-Jun-11

17 TRAINING09-Jun-11 14-Jun-11 08-Jun-

1116-Jun-11

08-Jun-11

18 TRAINING20-Jun-11 30-Jun-11 20-Jun-

1130-Jun-11

20-Jun-11

4

19 LEAVE01-Jun-11 03-Jun-11 01-Jun-

1107-Jun-11

01-Jun-11

20 LEAVE02-Jun-11 05-Jun-11 01-Jun-

1107-Jun-11

01-Jun-11

21 LEAVE04-Jun-11 07-Jun-11 01-Jun-

1107-Jun-11

01-Jun-11

22 LEAVE08-Jun-11 16-Jun-11 08-Jun-

1108-Jun-11

23 TRAINING09-Jun-11 08-Jun-

1108-Jun-11

24 TRAINING20-Jun-11

30-Jun-1108-Jun-11

20-Jun-11

5

25 LEAVE01-Jun-11 03-Jun-11 01-Jun-

1107-Jun-11

01-Jun-11

26 LEAVE02-Jun-11 05-Jun-11 01-Jun-

1107-Jun-11

01-Jun-11

27 LEAVE04-Jun-11 07-Jun-11 01-Jun-

1107-Jun-11

01-Jun-11

28 LEAVE08-Jun-11 16-Jun-11 08-Jun-

1130-Jun-11

08-Jun-11

29 TRAINING09-Jun-11 14-Jun-11 08-Jun-

1130-Jun-11

08-Jun-11

30 TRAINING15-Jun-11 30-Jun-11 08-Jun-

1130-Jun-11

15-Jun-11

Test Data Grouping Diagram

The red and yellow boxes in the diagram show the required groupings. The numeric column headers are the days of this month of June.

document.doc Page 7 of 49

Page 8: Forming Range-Based Break Groups With Advanced SQL

Performance Testing Strategy

In SQL Pivot and Prune Queries – Keeping an Eye on Performance we applied an approach to performance testing of SQL queries whereby the queries are tested across a 2-dimensional domain, using a testing framework developed for that work. The same approach has been followed here, using the same framework (note that minor changes to the PL/SQL package and tables were made for this article, such as excluding file writing times from the recorded times). Further details can be found in the referenced article, from which the following description is extracted:

In order to provide a realistic scenario, the queries are executed within the context of a simple outbound interface that writes each record to a file as a comma-separated string. A small PL/SQL package has been written to automate the testing. The program loops over width and depth dimensions, and for each data set point makes a call to a separate package to set up the test data and have the CBO statistics gathered; it then loops over a set of queries defined in the same separate package as strings that are executed by the main package.

The execution plan is obtained in each case, using an Oracle API, and is written to the generic log. The query string includes a random number that guarantees a hard-parse and thus recalculation of the execution plans at each data set point.

For this work, width was taken to correspond to the total number of records, while depth was taken to correspond to group size. The definitions of the test data vary by problem and are described separately later.

document.doc Page 8 of 49

Page 9: Forming Range-Based Break Groups With Advanced SQL

SQL Change for Single Break Group Problems

One of the solution techniques is only applicable to the form of problem where a single break group is required, and so for consistency that is the form used for all solutions in performance testing. It is worth noting that the other two solution techniques solve this form by obtaining all groups within an inline view, then applying a restriction outside the view. This means the timings for these should be very similar to what would be obtained for finding all groups. The change required looks like this:

SELECT … FROM (SQL for all groups minus ORDER BY) WHERE To_Date (root_date, 'DD-MON-YYYY HH24:MI:SS') BETWEEN group_start AND group_endORDER BY …

document.doc Page 9 of 49

Page 10: Forming Range-Based Break Groups With Advanced SQL

Problem 1: Contiguous Ranges

Analytics Solution

How It Works

The first solution for this problem uses analytic functions (see Oracle® Database SQL Language Reference 11g Release 2 (11.2)), partitioned by person and ordered by start date in a two level query structure.

1. Within an inline view, use Lag and Lead functions with CASE expressions to set group start and group end dates on the respective start and end records of the groups, leaving other values null.

2. Select all the original fields from the inline view, as well as the new fields within First_Value, Last_Value functions with the IGNORE NULLS option

3. The output from step 2 obtains all groups, and if necessary, can be used within another inline view to restrict the output to certain groups only (e.g. a 'current' group)

The query diagram, SQL and functional testing use the form for obtaining all groups, while the performance testing uses the form for obtaining a single group, for consistency with the third solution method.

Query Diagram

Notes

The diagram notation follows and extends notation developed earlier, including in SQL Pivot and Prune Queries – Keeping an Eye on Performance. The key can be referred to for subsequent diagrams.

document.doc Page 10 of 49

Page 11: Forming Range-Based Break Groups With Advanced SQL

SQLSELECT /* NO_OVERLAP */ person_id, start_date, end_date, activity_name, activity_id id, Last_Value (group_start IGNORE NULLS) OVER (PARTITION BY person_id ORDER BY start_date) group_start, First_Value (group_end IGNORE NULLS) OVER (PARTITION BY person_id ORDER BY start_date RANGE BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) group_end FROM (SELECT person_id, start_date, end_date, activity_name, activity_id, CASE WHEN (start_date > Nvl (Lag (end_date) OVER (PARTITION BY person_id ORDER BY start_date), start_date-1)) OR (activity_name != Lag (activity_name) OVER (PARTITION BY person_id ORDER BY start_date)) THEN start_date END group_start, CASE WHEN (Nvl (Lead (start_date) OVER (PARTITION BY person_id ORDER BY start_date), end_date+1) > end_date) OR (activity_name != Lead (activity_name) OVER (PARTITION BY person_id ORDER BY start_date)) THEN end_date END group_end FROM activity_nov) ORDER BY person_id, start_date

Inline View Diagram

The diagram below attempts to show how the inline view obtains the group start and end dates. The start points of the red arrows indicate the records what have group start dates set in the inline view (1, 4, 6 for T1, 7, 10. 11, 12 for T2); the end points, which have group end dates set (3, 5, 6 for T1, 9, 10, 11, 12 for T2). Since all other group values are null, the outer query can set the correct values by looking for the last not null value from the past, for the group start date, and the first not null value in the future, for the group end date.

Solution Stage Table

The table below shows how the solution proceeds in stages, through level 1.

Per Id

Act Id

Activity Name

Record Level Level 1 View SolutionStart Date

End Date Start Date

End Date Start Date

End Date

1

1 LEAVE01-Jun-11 02-Jun-11 01-Jun-11 01-Jun-

1107-Jun-11

2 LEAVE02-Jun-11 04-Jun-11 01-Jun-

1107-Jun-11

3 LEAVE04-Jun-11 07-Jun-11 07-Jun-11 01-Jun-

1107-Jun-11

4 LEAVE08-Jun-11 09-Jun-11 08-Jun-11 08-Jun-

1114-Jun-11

5 LEAVE09-Jun-11 14-Jun-11 14-Jun-11 08-Jun-

1114-Jun-11

6 LEAVE20-Jun-11 30-Jun-11 20-Jun-11 30-Jun-11 20-Jun-

1130-Jun-11

2 7 LEAVE 01-Jun-11 02-Jun-11 01-Jun-11 01-Jun-11

07-Jun-11

document.doc Page 11 of 49

Page 12: Forming Range-Based Break Groups With Advanced SQL

8 LEAVE02-Jun-11 04-Jun-11 01-Jun-

1107-Jun-11

9 LEAVE04-Jun-11 07-Jun-11 07-Jun-11 01-Jun-

1107-Jun-11

10 LEAVE08-Jun-11 09-Jun-11 08-Jun-11 09-Jun-11 08-Jun-

1109-Jun-11

11 TRAINING09-Jun-11 14-Jun-11 09-Jun-11 14-Jun-11 08-Jun-

1114-Jun-11

12 TRAINING 20-Jun-11 20-Jun-11 20-Jun-11

Model Solution

How It Works

The key to solving this problem using Oracle’s Model clause (Oracle® Database SQL Language Reference 11g Release 2 (11.2)) is to realise that the solution can be represented as simple inductions, forward for the group start dates, then backward for the group end dates. If, a, s, e, S, E are the current activity, start date, end date, group start date, end date and (pa, ps, pe, pS, pE) and (na, ns, ne, nS, nE) are the prior and next values then (using C-like terminology for brevity):

Initial, S = s; later, S = (a != pa or s > pe) ? s : pS

Final, E = e; earlier, E = nS > S ? e : nE

These inductions can easily be implemented as rules within the model clause:

1. Form the basic Select, with all the table columns required, and append placeholders group_start and group_end

2. Add the Model keyword, partitioning by person, dimensioning by analytic function Row_Number, ordering by start date within person, and with the remaining columns as measures

3. Initialise group start and end to start and end dates in the measures clause

4. Define the first rule to obtain the group start date for all rows after the first as the previous group start date, unless there is a gap or the activity changes, relative to the previous record, in which case take the new start date. This rule will be processed in the default ascending row order.

5. Define the second rule to obtain the group end date for all rows as the next group end date, unless the next group start date is greater than the current one, or there is no next (i.e. at the last row), in which case take the current end date. This rule must be processed in descending row order, and this is specified as it is not the default.

6. The output from the above obtains all groups, but if necessary, can be used within an inline view to restrict the output to certain groups only (e.g. a 'current' group)

The query diagram, SQL and functional testing use the form for obtaining all groups, while the performance testing uses the form for obtaining a single group, for consistency with the third solution method.

document.doc Page 12 of 49

Page 13: Forming Range-Based Break Groups With Advanced SQL

Query Diagram

Notes

Queries with the Model clause have a structure that is rather different from other queries, and the diagrams attempt to reflect that structure for these problems. The main query feeds its output into an array processing component with a set of rules that specify how any additional (here) data items (called measures) are to be calculated, in a mostly declarative fashion.

The model box above contains 4 specification types:

Partition - processing is to be performed separately by one or more columns; the same meaning as in analytic functions

Dimension - columns by which the array is dimensioned; can included analytic functions, as here

Measures - remaining columns that may be calculated or updated by the rules, possibly including placeholders from the main query

Rules - a set of rules that specify measure calculation; rules are processed sequentially, unless otherwise specified; in the diagram:

o n - the current dimension value, here row number ordered by start

o N - maximum dimension value

o f(n-1,n) - denotes that the value depends on values from previous and current rows (and so on)

o ^ - denotes that the calculation progresses in ascending order by dimension; this is the default so does not have to be coded

o v - denotes that the calculation progresses in descending order by dimension; this is not the default so does have to be coded

SQLSELECT /* MOD_OVL */ person_id, start_date, end_date, activity_name, activity_id, group_start, group_end FROM activity_nov MODEL PARTITION BY (person_id) DIMENSION BY (Row_Number() OVER (PARTITION BY person_id ORDER BY start_date) rn) MEASURES (start_date, end_date, activity_name, activity_id, start_date group_start, end_date group_end)

document.doc Page 13 of 49

Page 14: Forming Range-Based Break Groups With Advanced SQL

RULES ( group_start[rn > 1] = CASE WHEN start_date[cv()] > end_date[cv()-1] OR activity_name[cv()] != activity_name[cv()-1] THEN start_date[cv()] ELSE group_start[cv()-1] END, group_end[ANY] ORDER BY rn DESC = PRESENTV (group_start[cv()+1], CASE WHEN group_start[cv()] < group_start[cv()+1] THEN group_end[cv()] ELSE group_end[cv()+1] END, end_date[cv()]) )ORDER BY 1, 2

Recursive Subquery Factor Solution

How It Works

This approach is based on new Oracle SQL functionality available only from Oracle Database v11.2, called Recursive Subquery Factor (RSF) (Oracle® Database SQL Language Reference 11g Release 2 (11.2)).

1. Define a recursive subquery factor.

2. The ‘anchoring’ branch of the RSF selects records defined by the start point. A direction column is defined that here is set to ‘E’ for ‘Either’, meaning extend in either direction in the recursive branch.

3. The recursive branch extends the record set by joining records that link to extreme parent records and that ‘push the envelope’. The direction column is set to ‘B’ or ‘F’ according as the direction of extension (‘Forward’ or ‘Backward’).

4. Select all records from the RSF, applying analytic Min, Max to get the group start and end dates.

The idea here is that for cases where the group is small this will avoid expensive processing of the entire record set. We’ll demonstrate this saving in our performance analysis section. This solution only applies to the form of problem where a single group is required.

document.doc Page 14 of 49

Page 15: Forming Range-Based Break Groups With Advanced SQL

Query Diagram

Notes

Queries with a recursive subquery factor have a special structure, and the diagrams attempt to reflect that structure for these problems. The recursive factor is a subquery having a Union All structure in which there are two branches:

Anchor Branch - this is a normal query from which the recursion begins

Recursive Branch - this is a query that references the recursive factor itself by alias

Notice the use of subtypes in the diagram: records in the recursive branch can be split into ‘back’ and ‘front’ subtypes.

SQLWITH rsq (person_id, start_date, end_date, activity_name, activity_id, direction) AS (SELECT person_id, start_date, end_date, activity_name, activity_id, 'E' direction FROM activity_nov WHERE start_date <= '&TODAY' AND Nvl(end_date, To_Date ('&TODAY', 'DD-MON-YYYY') + 1) > '&TODAY' UNION ALL

document.doc Page 15 of 49

Page 16: Forming Range-Based Break Groups With Advanced SQL

SELECT act.person_id, act.start_date, act.end_date, act.activity_name, act.activity_id, CASE WHEN act.start_date = rsq.end_date THEN 'F' ELSE 'B' END FROM rsq JOIN activity_nov act ON ((act.start_date = rsq.end_date AND direction IN ('E', 'F')) OR (act.end_date = rsq.start_date AND direction IN ('E', 'B'))) AND act.person_id = rsq.person_id AND act.activity_name = rsq.activity_name)SELECT /* RSQ_NON '&TODAY' */ person_id, start_date, end_date, activity_name, activity_id, Min (start_date) OVER (PARTITION BY person_id) grp_start, Max (end_date) OVER (PARTITION BY person_id) grp_end FROM rsqORDER BY person_id, start_date

Performance Analysis

Test Data Sets

For the performance analysis it is simpler to generate test date using a single activity, with groups determined only by the dates. If w and d are the numeric width and depth points, records are generated for three persons as follows:

Let random(d) be a random integer between 1 and d (generated afresh on each access)

Start date = '01-JAN-1900'

Record limit = 3 * 100 * w

Loop while number of records <= record limit

Add group of records for person 1, with group size = random(d), as follows:

o First start date = last start date + random(d)

o Subsequent start date = previous start date + 1

o End date = start date + 1

o Exit if record limit reached

Repeat for persons 2 and 3

End loop

Store the root date as the mid point of the first group of records generated

This generation process ensures that the size of the record set is proportional to the width point, while the groups are of random sizes but within a scale determined by the depth point. The width and depth points, together with the (randomized) size of the root group, are shown in the next section.

Output Record Counts

The output consists of all the records in the root group, which is defined as the group containing the root date, and has at least one record by definition. Of course, each solution method operates on the same data set, and so the number of records written to file is always the same for both (which was checked).

Depth/Width

W1 W2 W4 W8 W16 W32 W64 W128 W256

Total Records>

300 600 1200 2400 4800 9600 19200

38400

76800

D1 1 1 1 1 1 1 1 1 1D3 3 1 2 3 2 1 2 1 3D9 5 8 3 5 7 8 8 2 2D27 8 16 8 26 10 2 13 2 9D81 80 11 25 55 26 72 67 41 42D243 6 219 135 196 49 131 90 132 168D729 300 93 75 290 134 68 547 627 446D2187 300 600 1196 1501 972 1300 346 437 1265D6561 300 600 717 2400 4330 3737 4243 1331 4103

document.doc Page 16 of 49

Page 17: Forming Range-Based Break Groups With Advanced SQL

CPU Times

Analytics

Query W1 W2 W4 W8 W16 W32 W64 W128 W256D1 0.02 0.05 0.17 0.64 2.42 9.73 38.45 147.35 604.91D3 0.01 0.03 0.10 0.33 1.27 4.87 19.28 72.8 298.00D9 0.02 0.01 0.05 0.16 0.50 1.85 7.24 28.41 113.02D27 0.02 0.03 0.03 0.08 0.22 0.71 2.62 10.18 39.16D81 0.00 0.01 0.03 0.04 0.12 0.32 1.00 3.81 14.23D243 0.02 0.04 0.03 0.06 0.09 0.22 0.57 1.55 5.27D729 0.02 0.03 0.03 0.05 0.10 0.17 0.41 0.99 2.64D2187 0.03 0.07 0.10 0.18 0.15 0.25 0.34 0.68 1.68D6561 0.02 0.06 0.08 0.14 0.33 0.31 0.51 0.67 1.42

Notes

The graph generated with Microsoft Excel 2007 may be slightly misleading as the pale blue peak does not appear to reach 605.

Performance for a given width improves dramatically with depth

Model

Query W1 W2 W4 W8 W16 W32 W64 W128 W256D1 0.03 0.03 0.06 0.10 0.19 0.38 0.74 1.43 2.98D3 0.03 0.03 0.06 0.11 0.19 0.37 0.75 1.53 3.01D9 0.02 0.03 0.07 0.11 0.20 0.37 0.77 1.52 2.99D27 0.01 0.03 0.06 0.10 0.17 0.37 0.74 1.51 2.99D81 0.05 0.03 0.06 0.11 0.21 0.39 0.73 1.50 3.00D243 0.02 0.05 0.07 0.11 0.18 0.38 0.75 1.54 3.06D729 0.04 0.03 0.04 0.12 0.20 0.39 0.77 1.51 3.02D2187 0.04 0.08 0.12 0.19 0.22 0.47 0.79 1.50 3.09D6561 0.03 0.06 0.12 0.19 0.44 0.56 0.95 1.60 3.18

document.doc Page 17 of 49

Page 18: Forming Range-Based Break Groups With Advanced SQL

Notes

Performance for a given width is essentially independent of depth

Recursive Subquery Factor

Query W1 W2 W4 W8 W16 W32 W64 W128 W256D1 0.02 0.02 0.02 0.01 0.02 0.01 0.01 0.03 0.05D3 0.01 0.02 0.00 0.01 0.01 0.02 0.02 0.02 0.09D9 0.01 0.02 0.02 0.01 0.03 0.05 0.06 0.05 0.07D27 0.02 0.02 0.01 0.03 0.04 0.01 0.08 0.05 0.25D81 0.03 0.02 0.02 0.08 0.06 0.27 0.44 0.53 1.05D243 0.01 0.08 0.09 0.19 0.11 0.42 0.60 1.61 4.33D729 0.11 0.03 0.03 0.29 0.16 0.19 2.82 7.95 12.02D2187 0.14 0.42 1.45 2.26 1.02 3.01 2.15 4.51 33.49D6561 0.12 0.40 0.54 5.55 17.8 13.29 27.03 17.83 76.35

Notes

Performance for a given width worsens dramatically with depth

document.doc Page 18 of 49

Page 19: Forming Range-Based Break Groups With Advanced SQL

Slice Graphs

Wide Slice

Deep Slice

Explain Plans (Data Point W256-D1)

Analytics

---------------------------------------------------------------------------------------| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |---------------------------------------------------------------------------------------| 0 | SELECT STATEMENT | | | | 6 (100)| || 1 | SORT ORDER BY | | 12 | 828 | 6 (50)| 00:00:01 ||* 2 | VIEW | | 12 | 828 | 5 (40)| 00:00:01 || 3 | WINDOW SORT | | 12 | 828 | 5 (40)| 00:00:01 || 4 | VIEW | | 12 | 828 | 4 (25)| 00:00:01 || 5 | WINDOW SORT | | 12 | 336 | 4 (25)| 00:00:01 || 6 | TABLE ACCESS FULL| ACTIVITY_NOV | 12 | 336 | 3 (0)| 00:00:01 |---------------------------------------------------------------------------------------

Predicate Information (identified by operation id):---------------------------------------------------

2 - filter(("GROUP_START"<=TO_DATE(' 1900-01-02 12:00:00', 'syyyy-mm-dd hh24:mi:ss') AND "GROUP_END">=TO_DATE(' 1900-01-02 12:00:00', 'syyyy-mm-dd hh24:mi:ss')))

Model

--------------------------------------------------------------------------------------| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |

document.doc Page 19 of 49

Page 20: Forming Range-Based Break Groups With Advanced SQL

--------------------------------------------------------------------------------------| 0 | SELECT STATEMENT | | | | 5 (100)| || 1 | SORT ORDER BY | | 12 | 828 | 5 (40)| 00:00:01 ||* 2 | VIEW | | 12 | 828 | 4 (25)| 00:00:01 || 3 | SQL MODEL ORDERED | | 12 | 336 | 4 (25)| 00:00:01 || 4 | WINDOW SORT | | 12 | 336 | 4 (25)| 00:00:01 || 5 | TABLE ACCESS FULL| ACTIVITY_NOV | 12 | 336 | 3 (0)| 00:00:01 |--------------------------------------------------------------------------------------

Predicate Information (identified by operation id):---------------------------------------------------

2 - filter(("GROUP_START"<=TO_DATE(' 1900-01-02 12:00:00', 'syyyy-mm-dd hh24:mi:ss') AND "GROUP_END">=TO_DATE(' 1900-01-02 12:00:00', 'syyyy-mm-dd hh24:mi:ss')))

Recursive Subquery Factor

---------------------------------------------------------------------------------------------------------| Id | Operation | Name | Rows | Bytes |C(%CPU)| Time |---------------------------------------------------------------------------------------------------------| 0 | SELECT STATEMENT | | | |6 (100)| || 1 | WINDOW SORT | | 2 | 102 |6 (0)| 00:00:01 || 2 | VIEW | | 2 | 102 |6 (0)| 00:00:01 || 3 | UNION ALL (RECURSIVE WITH) BREADTH FIRST| | | | | ||* 4 | TABLE ACCESS BY INDEX ROWID | ACTIVITY_NOV | 1 | 28 |2 (0)| 00:00:01 ||* 5 | INDEX SKIP SCAN | ACTIVITY_NOV_U1 | 1 | |1 (0)| 00:00:01 || 6 | NESTED LOOPS | | | | | || 7 | NESTED LOOPS | | 1 | 69 |4 (0)| 00:00:01 || 8 | RECURSIVE WITH PUMP | | | | | ||* 9 | INDEX RANGE SCAN | ACTIVITY_NOV_U1 | 6 | |1 (0)| 00:00:01 ||* 10 | TABLE ACCESS BY INDEX ROWID | ACTIVITY_NOV | 1 | 28 |2 (0)| 00:00:01 |---------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):---------------------------------------------------

4 - filter("END_DATE">TO_DATE(' 1900-01-02 12:00:00', 'syyyy-mm-dd hh24:mi:ss')) 5 - access("START_DATE"<=TO_DATE(' 1900-01-02 12:00:00', 'syyyy-mm-dd hh24:mi:ss')) filter("START_DATE"<=TO_DATE(' 1900-01-02 12:00:00', 'syyyy-mm-dd hh24:mi:ss')) 9 - access("ACT"."PERSON_ID"="RSQ"."PERSON_ID") 10 - filter(((("ACT"."START_DATE"="RSQ"."END_DATE" AND INTERNAL_FUNCTION("DIRECTION")) OR ("ACT"."END_DATE"="RSQ"."START_DATE" AND "ACT"."END_DATE" IS NOT NULL AND INTERNAL_FUNCTION("DIRECTION"))) AND "ACT"."ACTIVITY_NAME"="RSQ"."ACTIVITY_NAME"))

Discussion of Results

The best method for deep data sets is Analytics

The best method for shallow data sets is Recursive Subquery Factor

The Model method is independent of depth and performs in the wide slice at a level between the two other methods, except for one intermediate data point where it is better than both

document.doc Page 20 of 49

Page 21: Forming Range-Based Break Groups With Advanced SQL

Problem 2: Overlapping Ranges

Analytics Solution

How It Works

The solution for the second problem is derived from that for the first, but without the additional break checking, and with an extra starting step to obtain a ‘running’ end date that is the largest end date up to the current record, ordered by start date. The running end date then replaces the end date in the next step. The query thus has one more level.

0. Within an inline view, use Max to set a running end date on each record, converting null end dates to a ‘large value’

1. Within an inline view, select all the original fields from the level-0 inline view, and use Lag and Lead functions with CASE expressions to set group start and group end dates on the respective start and running end dates of the break groups, leaving other values null.

2. Select all the original fields from the inline view, as well as the new fields within First_Value, Last_Value functions with the IGNORE NULLS option, and convert back any ‘large value’s to null

3. The output from step 2 solves the problems as defined, but if necessary, can be used within another inline view to restrict the output to certain groups only (e.g. a 'current' group)

Query Diagram

SQLSELECT /* OVERLAP */ person_id, start_date, end_date, activity_name, activity_id id, Last_Value (group_start IGNORE NULLS) OVER (PARTITION BY person_id ORDER BY start_date) group_start, CASE First_Value (group_end IGNORE NULLS) OVER (PARTITION BY person_id ORDER BY start_date RANGE BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) WHEN To_Date('01-JAN-3000', 'DD-MON-YY') THEN NULL ELSE First_Value (group_end IGNORE NULLS) OVER (PARTITION BY person_id ORDER BY start_date RANGE BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) END group_end FROM (

document.doc Page 21 of 49

Page 22: Forming Range-Based Break Groups With Advanced SQL

SELECT person_id, start_date, end_date, activity_name, activity_id, CASE WHEN (start_date > Nvl (Lag (running_end) OVER (PARTITION BY person_id ORDER BY start_date), start_date-1)) THEN start_date END group_start, CASE WHEN (Nvl (Lead (start_date) OVER (PARTITION BY person_id ORDER BY start_date), running_end+1) > running_end) THEN running_end END group_end FROM (SELECT person_id, start_date, end_date, activity_name, activity_id, Max (Nvl(end_date, '01-JAN-3000')) OVER (PARTITION BY person_id ORDER BY start_date) running_end FROM activity WHERE person_id IN (3, 4))) ORDER BY person_id, start_date

Inline View Diagram

The diagram below shows how the level-0 inline view obtains the running end dates, which are denoted by the end points of the red arrows. The red extension blocks denote records where the running end date is greater than the current end date (17 and 24). Now, although we still have overlaps, the solution for the first class will work because the running end date ensures the current record always has the latest date for the current group: Thus in record 17 below, we won’t wrongly set the group end date to 14 on seeing the gap to record 18, and had record 18 started at 15, we would have correctly assigned it to G2, not left it in G3 (I have added a test case T5 for that, but not included it in the diagram).

Solution Stage Table

The table below shows how the solution proceeds in stages from level 0, through level 1.

Per id

Act id

Activity name

Record Level Running (Level 0)

Level 1 View Solution

Start date

End date End date Start date

End date Start date

End date

3

13 LEAVE01-Jun-11 03-Jun-11 03-Jun-11 01-Jun-11 01-Jun-

1107-Jun-11

14 LEAVE02-Jun-11 05-Jun-11 05-Jun-11 01-Jun-

1107-Jun-11

15 LEAVE04-Jun-11 07-Jun-11 07-Jun-11 07-Jun-11 01-Jun-

1107-Jun-11

16 LEAVE08-Jun-11 16-Jun-11 16-Jun-11 08-Jun-11 08-Jun-

1116-Jun-11

17 TRAINING09-Jun-11 14-Jun-11 16-Jun-11 16-Jun-11 08-Jun-

1116-Jun-11

18 TRAINING20-Jun-11 30-Jun-11 30-Jun-11 20-Jun-11 30-Jun-11 20-Jun-

1130-Jun-11

document.doc Page 22 of 49

Page 23: Forming Range-Based Break Groups With Advanced SQL

4

19 LEAVE01-Jun-11 03-Jun-11 03-Jun-11 01-Jun-11 01-Jun-

1107-Jun-11

20 LEAVE02-Jun-11 05-Jun-11 05-Jun-11 01-Jun-

1107-Jun-11

21 LEAVE04-Jun-11 07-Jun-11 07-Jun-11 07-Jun-11 01-Jun-

1107-Jun-11

22 LEAVE08-Jun-11 16-Jun-11 16-Jun-11 08-Jun-11 08-Jun-

11

23 TRAINING09-Jun-11 01-Jan-00 08-Jun-

11

24 TRAINING20-Jun-11

30-Jun-1101-Jan-00 01-Jan-00 08-Jun-

11

5

25 LEAVE01-Jun-11 03-Jun-11 03-Jun-11 01-Jun-11 01-Jun-

1107-Jun-11

26 LEAVE02-Jun-11 05-Jun-11 05-Jun-11 01-Jun-

1107-Jun-11

27 LEAVE04-Jun-11 07-Jun-11 07-Jun-11 07-Jun-11 01-Jun-

1107-Jun-11

28 LEAVE08-Jun-11 16-Jun-11 16-Jun-11 08-Jun-11 08-Jun-

1130-Jun-11

29 TRAINING09-Jun-11 14-Jun-11 16-Jun-11 08-Jun-

1130-Jun-11

30 TRAINING15-Jun-11 30-Jun-11 30-Jun-11 30-Jun-11 08-Jun-

1130-Jun-11

Model Solution

How It Works

The key to solving this problem using Oracle’s Model clause is to realise that the solution can be represented as three simple inductions. If s, e, S, E are the current start date, end date, group start date, end date and (ps, pe, pS, pE) and (ns, ne, nS, nE) are the prior and next values, ordering by start date, then (using C-like terminology for brevity):

Initial, E = e; later, E = (e > pE) ? e : pE -- this gets the running latest end dates

Initial, S = s; later, S = (s > pE) ? s : pS -- this gets group start dates

Final, E = e; earlier, E = (S < nS) ? E : nE -- this gets group end dates

These inductions can easily be implemented as rules within the model clause:

1. Form the basic Select, with all the table columns required, and append placeholders group_start and group_end

2. Add the Model keyword, partitioning by person, dimensioning by analytic function Row_Number, ordering by start date within person, and with the remaining columns as measures

3. Initialise group start and end dates to start and end dates in the measures clause

4. Define the first rule to obtain a running latest end date for all rows after the first as the previous running end date, unless the current end date is greater than the previous running end date, in which case take the new end date. This rule will be processed in the default ascending row order.

5. Define the second rule to obtain the group start date for all rows after the first as the start date, unless the start date is greater than the previous running latest end date,, in which case take the previous group start date. This rule will be processed in the default ascending row order.

6. Define the third rule to obtain the group end date for all rows before the last as the next running latest end date, unless the group start date is less than the previous group start date, in which case take the next group end date. This rule must be processed in descending row order, and this is specified as it is not the default.

7. The output from the above obtains all groups, but if necessary, can be used within an inline view to restrict the output to certain groups only (e.g. a 'current' group)

document.doc Page 23 of 49

Page 24: Forming Range-Based Break Groups With Advanced SQL

The query diagram, SQL and functional testing use the form for obtaining all break groups, while the performance testing uses the form for obtaining a single break group, for consistency with the third solution method.

Query Diagram

SQLSELECT /* MOD_OVL */ person_id, start_date, CASE end_date WHEN To_Date ('01-JAN-3000', 'DD-MON-YYYY') THEN NULL ELSE end_date END end_date, activity_name, activity_id, group_start, CASE group_end WHEN To_Date ('01-JAN-3000', 'DD-MON-YYYY') THEN NULL ELSE group_end END group_end FROM activity MODEL PARTITION BY (person_id) DIMENSION BY (Row_Number() OVER (PARTITION BY person_id ORDER BY start_date) rn) MEASURES (start_date, Nvl (end_date, '01-JAN-3000') end_date, activity_name, activity_id, start_date group_start, Nvl (end_date, '01-JAN-3000') group_end) RULES ( group_end[rn > 1] = CASE WHEN end_date[cv()] > group_end[cv()-1] THEN end_date[cv()] ELSE group_end[cv()-1] END, group_start[rn > 1] = CASE WHEN start_date[cv()] > group_end[cv()-1] THEN start_date[cv()] ELSE group_start[cv()-1] END, group_end[ANY] ORDER BY rn DESC = PRESENTV (group_start[cv()+1], CASE WHEN group_start[cv()] < group_start[cv()+1] THEN group_end[cv()] ELSE group_end[cv()+1] END, group_end[cv()]) )ORDER BY 1, 2, 3

Recursive Subquery Factor Solution

How It Works

This approach is based on new Oracle SQL functionality available only from Oracle Database v11.2, called Recursive Subquery Factor (RSF).

1. Define a recursive subquery factor.

2. The ‘anchoring’ branch of the RSF selects records defined by the start point. A direction column is defined that here is set to ‘E’ for ‘Either’, meaning extend in either direction in the recursive branch.

document.doc Page 24 of 49

Page 25: Forming Range-Based Break Groups With Advanced SQL

3. Add analytic function columns for row number by start date and by end date descending, and for the minimum start date and maximum end dates. These go in both branches.

4. The recursive branch extends the record set by joining records that link to extreme parent records and that ‘push the envelope’. The direction column is set to ‘B’ or ‘F’ according as the direction of extension (‘Forward’ or ‘Backward’).

5. Define a subquery factor for the envelope that simply obtains the minimum start date and maximum end dates from the recursive factor grouped by person

6. Select all records from the envelope factor, joining the activity table for all records within the envelope by person to get all the group records with the group start and end dates being the envelope values.

Note that we need the additional subquery factor because the recursive factor may exclude some records that do not extend the envelope but are contained within it; for example, record 29 in data set T5 above.

The idea here is that for cases where the break group is small this will avoid expensive processing of the entire record set. We’ll demonstrate this saving in our performance analysis section.

document.doc Page 25 of 49

Page 26: Forming Range-Based Break Groups With Advanced SQL

Query Diagram

document.doc Page 26 of 49

Page 27: Forming Range-Based Break Groups With Advanced SQL

SQLWITH rsq (person_id, start_date, end_date, activity_name, activity_id, env_start, env_end, rn_asc, rn_dsc, direction) AS (SELECT person_id, start_date, end_date, activity_name, activity_id, Min (start_date) OVER (PARTITION BY person_id) env_start, Max (Nvl (end_date, '01-JAN-3000')) OVER (PARTITION BY person_id) env_end, Row_Number () OVER (PARTITION BY person_id ORDER BY start_date) rn_asc, Row_Number () OVER (PARTITION BY person_id ORDER BY Nvl (end_date, '01-JAN-3000') DESC) rn_dsc, 'E' direction FROM activity WHERE '&TODAY' BETWEEN start_date AND Nvl(end_date, '&TODAY') AND person_id IN (3, 4, 5) UNION ALLSELECT act.person_id, act.start_date, act.end_date, act.activity_name, act.activity_id, Min (act.start_date) OVER (PARTITION BY act.person_id) env_start, Max (Nvl (act.end_date, '01-JAN-3000')) OVER (PARTITION BY act.person_id) env_end, Row_Number () OVER (PARTITION BY act.person_id ORDER BY act.start_date) rn_asc, Row_Number () OVER (PARTITION BY act.person_id ORDER BY Nvl (act.end_date, '01-JAN-3000') DESC) rn_dsc, CASE WHEN act.start_date < rsq.env_start THEN 'B' ELSE 'F' END FROM rsq JOIN activity act ON act.person_id = rsq.person_id AND (( act.start_date < rsq.env_start AND Nvl (act.end_date, '01-JAN-3000') >= rsq.env_start AND rsq.rn_asc = 1 AND rsq.direction IN ('E', 'B') ) OR ( Nvl (act.end_date, '01-JAN-3000') > rsq.env_end AND act.start_date <= rsq.env_end AND rsq.rn_dsc = 1 AND rsq.direction IN ('E', 'F') ) )), env AS (SELECT person_id, Min (env_start) env_start, Max (env_end) env_end FROM rsq GROUP BY person_id)SELECT /* RSQ_OVL '&TODAY' */ act.person_id, act.start_date, act.end_date, act.activity_name, act.activity_id, env.env_start, CASE WHEN env.env_end = '01-JAN-3000' THEN NULL ELSE env.env_end END env_end FROM env JOIN activity act ON act.person_id = env.person_id WHERE act.start_date BETWEEN env.env_start AND env.env_end AND Nvl (act.end_date, '01-JAN-3000') BETWEEN env.env_start AND env.env_endORDER BY act.person_id, act.start_date, act.end_date

Performance Analysis

Test Data Sets

If w and d are the numeric width and depth points, records are generated for three persons as follows:

Let random(x) be a random integer between 1 and x (generated afresh on each access)

Century start date = '01-JAN-1900'

Record limit (per person) = 500 * w

Loop for record limit (per person)

Add record for person 1, as follows:

o Start date = random day in 20’th century

o End date = start date + random (Ceil (sqrt(d)) + 1

Repeat for persons 2 and 3

End loop

Store the root date as the mid point of the last record generated

This generation process ensures that the size of the record set is proportional to the width point, while the ranges are of random sizes but within a scale determined by the depth point; larger ranges correlate with larger groups. The width and depth points, together with the (randomized) size of the root group, are shown in the next section.

document.doc Page 27 of 49

Page 28: Forming Range-Based Break Groups With Advanced SQL

Output Record Counts

The output consists of all the records in the root group, which is defined as the group containing the root date, and has at least one record by definition.. Of course, each solution method operates on the same data set, and so the number of records written to file is always the same for both (and this was checked).

Note that the output record count reached its maximum in the shaded data points below.

Depth/Width

W1 W2 W4 W8 W16 W32 W64

Total Records>

1500 3000 6000 12000 24000 48000 96000

D1 1 1 1 1 7 3 11D2 1 1 1 3 4 3 30D4 1 1 3 1 5 10 62D8 1 1 3 8 11 45 556D16 1 3 6 6 47 231 7814D32 2 7 20 19 117 6531 96000D64 2 10 20 134 8893 48000 96000D128 1 11 94 4778 24000 48000 96000

CPU Times

Analytics

Depth/Width

W1 W2 W4 W8 W16 W32 W64

D1 0.28 1.09 3.93 13.96 46.65 126.2 396.46D2 0.28 1.51 3.51 12.39 40.73 96.72 160.89D4 0.28 0.95 3.68 11.08 31.73 62.95 63.96D8 0.29 0.96 3.17 9.13 20.02 24.98 12.04D16 0.28 0.78 2.42 5.48 8.19 4.86 2.31D32 0.20 0.61 1.71 2.29 1.67 1.28 7.76D64 0.18 0.44 0.67 0.63 1.16 3.73 7.30D128 0.12 0.22 0.25 0.58 2.03 3.93 7.33

Notes

Performance for a given width improves dramatically with depth

Model

Depth/Width

W1 W2 W4 W8 W16 W32 W64

document.doc Page 28 of 49

Page 29: Forming Range-Based Break Groups With Advanced SQL

D1 0.11 0.23 0.41 0.71 1.54 2.82 5.74D2 0.11 0.19 0.36 0.75 1.48 2.79 5.74D4 0.12 0.18 0.37 0.74 1.40 2.77 5.87D8 0.11 0.20 0.40 0.75 1.44 3.16 5.80D16 0.11 0.20 0.37 0.73 1.85 2.82 6.15D32 0.11 0.23 0.39 0.77 1.53 3.27 10.80D64 0.08 0.19 0.40 0.75 2.00 6.39 10.32D128 0.08 0.20 0.44 1.06 2.94 5.47 10.63

Notes

Performance for a given width is largely independent of depth, except where it starts to drop off at the maximum depths on the wider data points

Recursive Subquery Factor (No Hint)

Depth/Width

W1 W2 W4 W8 W16 W32 W64

D1 0.03 0.01 0.03 0.05 0.09 0.10 0.38D2 0.01 0.03 0.02 0.05 0.04 0.09 0.71D4 0.02 0.02 0.03 0.01 0.06 0.16 1.30D8 0.01 0.03 0.03 0.05 0.14 0.57 10.03D16 0.04 0.00 0.04 0.05 0.26 2.25 121.22D32 0.02 0.05 0.07 0.06 0.58 59.10 1255.15D64 0.03 0.03 0.06 0.39 39.28 330.43 1240.12D128 0.03 0.05 0.16 9.86 90.30 325.51 1213.75

document.doc Page 29 of 49

Page 30: Forming Range-Based Break Groups With Advanced SQL

Notes

Performance for a given width worsens dramatically with depth

Recursive Subquery Factor (Hint)

This query had the following hint added to the anchor branch of the recursive union:

/*+ INDEX (activity ACTIVITY_N1) */

And this to the recursive branch (the first hint means resolve the ‘Or’ into a Union):

/*+ USE_CONCAT INDEX (act ACTIVITY_N1) */

Depth/Width

W1 W2 W4 W8 W16 W32 W64

D1 0.03 0.03 0.02 0.14 0.79 0.69 5.64D2 0.02 0.03 0.03 0.19 0.40 0.65 4.14D4 0.02 0.03 0.11 0.10 0.03 0.71 6.21D8 0.02 0.04 0.03 0.20 0.78 3.27 48.11D16 0.03 0.06 0.10 0.19 1.27 9.27 242.47D32 0.03 0.10 0.32 0.25 0.06 113.83 735.15D64 0.03 0.13 0.24 0.10 68.10 186.70 731.55D128 0.03 0.03 0.42 0.61 48.46 93.41 158.94

Notes

Performance for a given width worsens dramatically with depth, although less so than for the unhinted query

document.doc Page 30 of 49

Page 31: Forming Range-Based Break Groups With Advanced SQL

Slice Graphs

Wide Slice

Deep Slice

Explain Plans (Data Point W64-D1)

Analytics

---------------------------------------------------------------------------------------------| Id | Operation | Name | Rows | Bytes |TempSpc| Cost (%CPU)| Time |---------------------------------------------------------------------------------------------| 0 | SELECT STATEMENT | | | | | 4369 (100)| || 1 | SORT ORDER BY | | 96660 | 6513K| 8416K| 4369 (1)| 00:00:53 ||* 2 | VIEW | | 96660 | 6513K| | 2794 (1)| 00:00:34 || 3 | WINDOW SORT | | 96660 | 6513K| 8416K| 2794 (1)| 00:00:34 || 4 | VIEW | | 96660 | 6513K| | 1219 (1)| 00:00:15 || 5 | WINDOW BUFFER | | 96660 | 5663K| | 1219 (1)| 00:00:15 || 6 | VIEW | | 96660 | 5663K| | 1219 (1)| 00:00:15 || 7 | WINDOW SORT | | 96660 | 3964K| 5696K| 1219 (1)| 00:00:15 || 8 | TABLE ACCESS FULL| ACTIVITY | 96660 | 3964K| | 171 (1)| 00:00:03 |---------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):---------------------------------------------------

2 - filter(("GROUP_START"<=TO_DATE(' 1966-04-03 12:00:00', 'syyyy-mm-dd hh24:mi:ss') AND "GROUP_END">=TO_DATE(' 1966-04-03 12:00:00', 'syyyy-mm-dd hh24:mi:ss')))

document.doc Page 31 of 49

Page 32: Forming Range-Based Break Groups With Advanced SQL

Model

------------------------------------------------------------------------------------------| Id | Operation | Name | Rows | Bytes |TempSpc| Cost (%CPU)| Time |------------------------------------------------------------------------------------------| 0 | SELECT STATEMENT | | | | | 2794 (100)| || 1 | SORT ORDER BY | | 96660 | 6513K| 8416K| 2794 (1)| 00:00:34 ||* 2 | VIEW | | 96660 | 6513K| | 1219 (1)| 00:00:15 || 3 | SQL MODEL ORDERED | | 96660 | 3964K| | 1219 (1)| 00:00:15 || 4 | WINDOW SORT | | 96660 | 3964K| 5696K| 1219 (1)| 00:00:15 || 5 | TABLE ACCESS FULL| ACTIVITY | 96660 | 3964K| | 171 (1)| 00:00:03 |------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):---------------------------------------------------

2 - filter(("GROUP_START"<=TO_DATE(' 1966-04-03 12:00:00', 'syyyy-mm-dd hh24:mi:ss') AND "GROUP_END">=TO_DATE(' 1966-04-03 12:00:00', 'syyyy-mm-dd hh24:mi:ss')))

Recursive Subquery Factor

--------------------------------------------------------------------------------------------------------| Id | Operation |Name |Rows | Bytes|TempSpc|Cost (%CPU)|Time |--------------------------------------------------------------------------------------------------------| 0 | SELECT STATEMENT | | | | | 4673 (100)| || 1 | SORT ORDER BY | | 1| 63 | | 4673 (62)|00:00:57|| 2 | NESTED LOOPS | | 1| 63 | | 4672 (62)|00:00:57|| 3 | VIEW | | 3| 63 | | 4666 (63)|00:00:56|| 4 | HASH GROUP BY | | 3| 63 | | 4666 (63)|00:00:56|| 5 | VIEW | |21964| 450K| | 4664 (63)|00:00:56|| 6 | UNION ALL (RECURSIVE WITH) BREAD F | | | | | | || 7 | WINDOW SORT | |21616| 886K| 1280K| 645 (1)|00:00:08|| 8 | WINDOW SORT | |21616| 886K| 1280K| 645 (1)|00:00:08||* 9 | TABLE ACCESS FULL |ACTIVITY |21616| 886K| | 172 (2)|00:00:03|| 10 | WINDOW SORT | | 348|35496 | | 4019 (72)|00:00:49|| 11 | WINDOW SORT | | 348|35496 | | 4019 (72)|00:00:49||* 12 | HASH JOIN | | 348|35496 | 1520K| 4017 (72)|00:00:49|| 13 | RECURSIVE WITH PUMP | | | | | | || 14 | TABLE ACCESS FULL |ACTIVITY |96660| 3964K| | 171 (1)|00:00:03|| 15 | TABLE ACCESS BY INDEX ROWID |ACTIVITY | 1| 42 | | 2 (0)|00:00:01||* 16 | INDEX RANGE SCAN |ACTIVITY_N1| 1| | | 1 (0)|00:00:01|--------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):---------------------------------------------------

9 - filter(("START_DATE"<=TO_DATE(' 1966-04-03 12:00:00', 'syyyy-mm-dd hh24:mi:ss') AND NVL("END_DATE",TO_DATE(' 3000-01-01 00:00:00', 'syyyy-mm-dd hh24:mi:ss'))>= TO_DATE(' 1966-04-03 12:00:00','syyyy-mm-dd hh24:mi:ss'))) 12 - access("ACT"."PERSON_ID"="RSQ"."PERSON_ID") filter((("ACT"."START_DATE"<"RSQ"."ENV_START" AND "RSQ"."ENV_START"<=NVL("END_DATE",TO_DATE(' 3000-01-01 00:00:00', 'syyyy-mm-dd hh24:mi:ss')) AND "RSQ"."RN_ASC"=1 AND INTERNAL_FUNCTION("RSQ"."DIRECTION")) OR ("RSQ"."ENV_END"<NVL("END_DATE", TO_DATE(' 3000-01-01 00:00:00','syyyy-mm-dd hh24:mi:ss')) AND "ACT"."START_DATE"<="RSQ"."ENV_END" AND "RSQ"."RN_DSC"=1 AND INTERNAL_FUNCTION("RSQ"."DIRECTION")))) 16 - access("ACT"."PERSON_ID"="ENV"."PERSON_ID" AND "ACT"."START_DATE">="ENV"."ENV_START" AND "ENV"."ENV_START"<="ACT"."SYS_NC00006$" AND "ACT"."START_DATE"<="ENV"."ENV_END" AND "ENV"."ENV_END">="ACT"."SYS_NC00006$") filter(("ENV"."ENV_START"<="ACT"."SYS_NC00006$" AND "ENV"."ENV_END">="ACT"."SYS_NC00006$"))

Recursive Subquery Factor with Hint

--------------------------------------------------------------------------------------------------------| Id |Operation | Name |Rows |Bytes|TempSpc| Cost (%CPU)| Time |--------------------------------------------------------------------------------------------------------| 0 |SELECT STATEMENT | | | | | 306K(100)| || 1 | SORT ORDER BY | | 1 | 63 | | 306K (2)| 01:01:21 || 2 | NESTED LOOPS | | 1 | 63 | | 306K (2)| 01:01:21 || 3 | VIEW | | 3 | 63 | | 306K (2)| 01:01:21 || 4 | HASH GROUP BY | | 3 | 63 | | 306K (2)| 01:01:21 || 5 | VIEW | | 3503K| 70M| | 306K (2)| 01:01:19 || 6 | UNION ALL (RECURSIVE WITH) BRE F| | | | | | || 7 | WINDOW SORT | |21616 | 886K| 1280K| 22328 (1)| 00:04:28 || 8 | WINDOW BUFFER | |21616 | 886K| | 22328 (1)| 00:04:28 || 9 | TABLE ACCESS BY INDEX ROWID | ACTIVITY |21616 | 886K| | 22092 (1)| 00:04:26 ||* 10 | INDEX FULL SCAN | ACTIVITY_N1|21616 | | | 517 (1)| 00:00:07 || 11 | WINDOW SORT | | 3482K| 338M| | 284K (3)| 00:56:51 || 12 | WINDOW SORT | | 3482K| 338M| | 284K (3)| 00:56:51 || 13 | CONCATENATION | | | | | | || 14 | MERGE JOIN | | 1741K| 169M| | 121K (3)| 00:24:23 || 15 | TABLE ACCESS BY INDEX ROWID| ACTIVITY |96660 |3964K| | 96331 (1)| 00:19:16 || 16 | INDEX FULL SCAN | ACTIVITY_N1|96000 | | | 517 (1)| 00:00:07 |

document.doc Page 32 of 49

Page 33: Forming Range-Based Break Groups With Advanced SQL

|* 17 | FILTER | | | | | | ||* 18 | SORT JOIN | |21616 |1266K| 3256K| 22642 (1)| 00:04:32 || 19 | RECURSIVE WITH PUMP | | | | | | || 20 | MERGE JOIN | | 1741K| 169M| | 121K (3)| 00:24:23 || 21 | TABLE ACCESS BY INDEX ROWID| ACTIVITY |96660 |3964K| | 96331 (1)| 00:19:16 || 22 | INDEX FULL SCAN | ACTIVITY_N1|96000 | | | 517 (1)| 00:00:07 ||* 23 | FILTER | | | | | | ||* 24 | SORT JOIN | |21616 |1266K| 3256K| 22642 (1)| 00:04:32 || 25 | RECURSIVE WITH PUMP | | | | | | || 26 | TABLE ACCESS BY INDEX ROWID | ACTIVITY | 1 | 42 | | 2 (0)| 00:00:01 ||* 27 | INDEX RANGE SCAN | ACTIVITY_N1| 1 | | | 1 (0)| 00:00:01 |--------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):---------------------------------------------------

10 - access("ACTIVITY"."SYS_NC00006$">=TO_DATE(' 1966-04-03 12:00:00', 'syyyy-mm-dd hh24:mi:ss') AND "START_DATE"<=TO_DATE(' 1966-04-03 12:00:00', 'syyyy-mm-dd hh24:mi:ss')) filter(("START_DATE"<=TO_DATE(' 1966-04-03 12:00:00', 'syyyy-mm-dd hh24:mi:ss') AND "ACTIVITY"."SYS_NC00006$">=TO_DATE(' 1966-04-03 12:00:00', 'syyyy-mm-dd hh24:mi:ss'))) 17 - filter(("ACT"."START_DATE"<="RSQ"."ENV_END" AND "RSQ"."ENV_END"<NVL("END_DATE", TO_DATE(' 3000-01-0100:00:00', 'syyyy-mm-dd hh24:mi:ss')))) 18 - access("ACT"."PERSON_ID"="RSQ"."PERSON_ID") filter("ACT"."PERSON_ID"="RSQ"."PERSON_ID") 23 - filter(("ACT"."START_DATE"<"RSQ"."ENV_START" AND "RSQ"."ENV_START"<=NVL("END_DATE",TO_DATE(' 3000-01-01 00:00:00', 'syyyy-mm-dd hh24:mi:ss')) AND (LNNVL("RSQ"."ENV_END"<NVL("END_DATE", TO_DATE('3000-01-01 00:00:00', 'syyyy-mm-dd hh24:mi:ss'))) OR LNNVL("ACT"."START_DATE"<="RSQ"."ENV_END") OR LNNVL("RSQ"."RN_DSC"=1) OR (LNNVL("RSQ"."DIRECTION"='E') AND LNNVL("RSQ"."DIRECTION"='F'))))) 24 - access("ACT"."PERSON_ID"="RSQ"."PERSON_ID") filter("ACT"."PERSON_ID"="RSQ"."PERSON_ID") 27 - access("ACT"."PERSON_ID"="ENV"."PERSON_ID" AND "ACT"."START_DATE">="ENV"."ENV_START" AND "ENV"."ENV_START"<="ACT"."SYS_NC00006$" AND "ACT"."START_DATE"<="ENV"."ENV_END" AND "ENV"."ENV_END">="ACT"."SYS_NC00006$") filter(("ENV"."ENV_START"<="ACT"."SYS_NC00006$" AND "ENV"."ENV_END">="ACT"."SYS_NC00006$"))

Discussion of Results

The best method for deep data sets is Analytics

The best method for shallow data sets is Recursive Subquery Factor. The hinted version levels the performance off at the extremes, but does not make a preferred option

The Model method is largely independent of depth and performs in the wide slice at a level between the two other methods, except for one intermediate data point where it is better than both

document.doc Page 33 of 49

Page 34: Forming Range-Based Break Groups With Advanced SQL

Problem 3: Bursts of Activity

Analytics Solution (None)

I am unaware of a solution to this problem using analytic functions alone.

Model Solution

How It Works

The key to solving this problem using Oracle’s Model clause is to realise that the solution can be represented as simple inductions, forward for the group start dates, then backward for the group end dates. If D is the distance parameter, s, e, S, E are the current start date, end date, group start date, end date and (ps, pe, pS, pE) and (ns, ne, nS, nE) are the prior and next values then (using C-like terminology for brevity):

Initial, S = s; later, S = (s – pS > D) ? s : pS

Final, E = e; earlier, E = nS > S ? e : nE

These inductions can easily be implemented as rules within the model clause:

1. Form the basic Select, with all the table columns required, and append placeholders group_start and group_end

2. Add the Model keyword, partitioning by person, dimensioning by analytic function Row_Number, ordering by start date within person, and with the remaining columns as measures

3. Initialise group start and end dates to start and end dates in the measures clause

4. Define the first rule to obtain the group start date for all rows after the first as the start date, unless the start date is less than the distance parameter from the previous group start date, in which case take that value. This rule will be processed in the default ascending row order.

5. Define the second rule to obtain the group end date for all rows before the last as the next group end date, unless the group start date is less than the next group start date, in which case take the current end date. This rule must be processed in descending row order, and this is specified as it is not the default.

6. The output from the above obtains all groups, but if necessary, can be used within an inline view to restrict the output to certain groups only (e.g. a 'current' group)

The query diagram, SQL and functional testing use the form for obtaining all break groups, while the performance testing uses the form for obtaining a single break group, for consistency with the second solution method.

document.doc Page 34 of 49

Page 35: Forming Range-Based Break Groups With Advanced SQL

Query Diagram

SQLSELECT /* MOD */ person_id, start_date, end_date, activity_name, activity_id, group_start, group_end FROM activity MODEL PARTITION BY (person_id) DIMENSION BY (Row_Number() OVER (PARTITION BY person_id ORDER BY start_date) rn) MEASURES (start_date, end_date, activity_name, activity_id, start_date group_start, end_date group_end) RULES ( group_start[rn > 1] = CASE WHEN start_date[cv()] - group_start[cv()-1] > 3 THEN start_date[cv()] ELSE group_start[cv()-1] END, group_end[ANY] ORDER BY rn DESC = PRESENTV (group_start[cv()+1], CASE WHEN group_start[cv()] < group_start[cv()+1] THEN end_date[cv()] ELSE group_end[cv()+1] END, end_date[cv()]) )ORDER BY 1, 2, 3

Recursive Subquery Factoring Solution

How It Works

This approach is based on new Oracle SQL functionality available only from Oracle Database v11.2, called Recursive Subquery Factor (RSF).

1. Define a (non-recursive) subquery factor, act, that selects all records after a given root date and obtains a row number by person ordered by start date.

2. Define a recursive subquery factor.

3. The ‘anchoring’ branch of the RSF selects the first record from act, with group start as the start date.

4. The recursive branch extends the record set by joining the next record from act if it is within the distance limit from the previous group start, and retaining the group start at its previous value.

5. Select all records from the RSF, and get the group end date using an analytic Max.

The idea here is that for cases where the break group is small this will avoid expensive processing of the entire record set. We’ll demonstrate this saving in our performance analysis section.

document.doc Page 35 of 49

Page 36: Forming Range-Based Break Groups With Advanced SQL

Query Diagram

.

SQLWITH act AS (SELECT person_id, start_date, end_date, activity_name, activity_id, Row_Number() OVER (PARTITION BY person_id ORDER BY start_date) rn FROM activity WHERE start_date >= '&TODAY'), rsq (person_id, rn, start_date, end_date, activity_name, activity_id, group_start) AS (SELECT person_id, rn, start_date, end_date, activity_name, activity_id, start_date group_start FROM act WHERE rn = 1 UNION ALLSELECT act.person_id, act.rn, act.start_date, act.end_date, act.activity_name, act.activity_id, rsq.group_start

document.doc Page 36 of 49

Page 37: Forming Range-Based Break Groups With Advanced SQL

FROM act JOIN rsq ON rsq.rn = act.rn - 1 AND rsq.person_id = act.person_id AND act.start_date - rsq.group_start <= 3 )SELECT /* RSQ_DST '&TODAY' */ rsq.person_id, rsq.start_date, rsq.end_date, rsq.activity_name, rsq.activity_id, rsq.group_start, Max (rsq.end_date) OVER (PARTITION BY rsq.person_id) FROM rsqORDER BY 1, 2, 3

Performance Analysis

Test Data Sets

If w and d are the numeric width and depth points, records are generated for three persons as follows:

Let random(x) be a random integer between 1 and x (generated afresh on each access)

Record limit (per person) = 500 * w

Loop for record limit (per person)

Add record for person 1, as follows:

o Start date = random day in 20’th century

o End date = start date + random (d) + 1

Repeat for persons 2 and 3

End loop

Store the root date as the earliest start date generated

This generation process ensures that the size of the record set is proportional to the width point, while the ranges are of random sizes but within a scale determined by the depth point; larger ranges have no effect on group size here: the maximum group range is taken to be the depth parameter value in days. In this way, depth correlates with the group sizes.

The width and depth points, together with the (randomized) size of the root group, are shown in the next section.

Output Row Counts

The output consists of all the records in the first group, starting at the root date. Of course, each solution method operates on the same data set, and so the number of records written to file is always the same for both (and this was checked).

Depth/Width

W1 W2 W4 W8 W16 W32 W64 W128

Total Records>

1500 3000 6000 12000

24000

48000

96000

192000

D1 3 3 3 3 4 4 7 12D3 3 4 3 6 5 7 12 23D9 3 3 3 5 9 16 29 71D27 5 9 7 15 28 31 72 138D81 4 9 16 32 64 125 218 438D243 11 21 38 74 150 290 678 1295D729 31 70 138 229 494 907 1959 3794D2187 97 164 361 742 1444 2881 5785 11561

document.doc Page 37 of 49

Page 38: Forming Range-Based Break Groups With Advanced SQL

CPU Times

Model

Depth/Width

W1 W2 W4 W8 W16 W32 W64 W128

Total Records>

1500 3000 6000 12000

24000

48000

96000

192000

D1 0.07 0.16 0.29 0.58 1.19 2.28 4.73 9.39D3 0.10 0.16 0.30 0.59 1.15 2.34 4.64 9.41D9 0.10 0.15 0.29 0.59 1.17 2.31 4.67 9.29D27 0.09 0.14 0.31 0.58 1.19 2.35 4.68 9.41D81 0.09 0.16 0.31 0.58 1.17 2.36 4.69 9.38D243 0.09 0.16 0.32 0.59 1.17 2.29 4.71 9.42D729 0.11 0.15 0.31 0.61 1.20 2.34 4.84 9.57D2187 0.08 0.18 0.29 0.69 1.23 2.43 5.00 10.03

Notes

Performance for a given width is essentially independent of depth

Recursive Subquery Factor

Depth/Width

W1 W2 W4 W8 W16 W32 W64 W128

Total Records>

1500 3000 6000 12000

24000

48000

96000

192000

D1 0.01 0.03 0.05 0.07 0.13 0.22 0.53 1.22D3 0.03 0.03 0.03 0.06 0.12 0.29 0.61 1.48D9 0.03 0.02 0.03 0.06 0.16 0.31 0.83 3.05D27 0.03 0.03 0.05 0.11 0.24 0.47 1.62 4.93D81 0.03 0.05 0.06 0.14 0.36 1.11 3.74 13.61D243 0.05 0.05 0.10 0.22 0.71 2.42 10.58 37.66D729 0.01 0.10 0.21 0.53 2.00 7.00 27.72 107.73D2187 0.08 0.13 0.46 1.62 5.63 21.45 82.37 317.82

document.doc Page 38 of 49

Page 39: Forming Range-Based Break Groups With Advanced SQL

Notes

Performance for a given width worsens dramatically with depth

Slice Graphs

Wide Slice

Deep Slice

document.doc Page 39 of 49

Page 40: Forming Range-Based Break Groups With Advanced SQL

Explain Plans (Data Point W128-D1)

Model

------------------------------------------------------------------------------------------| Id | Operation | Name | Rows | Bytes |TempSpc| Cost (%CPU)| Time |------------------------------------------------------------------------------------------| 0 | SELECT STATEMENT | | | | | 5591 (100)| || 1 | SORT ORDER BY | | 193K| 14M| 18M| 5591 (1)| 00:01:08 ||* 2 | VIEW | | 193K| 14M| | 2073 (1)| 00:00:25 || 3 | SQL MODEL ORDERED | | 193K| 6422K| | 2073 (1)| 00:00:25 || 4 | WINDOW SORT | | 193K| 6422K| 9112K| 2073 (1)| 00:00:25 ||* 5 | TABLE ACCESS FULL| ACTIVITY | 193K| 6422K| | 310 (1)| 00:00:04 |------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):---------------------------------------------------

2 - filter("GROUP_START"="MIN_START") 5 - filter("START_DATE">=TO_DATE(' 1900-01-01 00:00:00', 'syyyy-mm-dd hh24:mi:ss'))

Recursive Subquery Factor

0FD9D6648_110EBBB-------------------------------------------------------------------------------------------------------| Id | Operation |Name |Rows |Bytes |TempSpc|Cost (%CPU)| Time |-------------------------------------------------------------------------------------------------------| 0 | SELECT STATEMENT | | | | | 4455 (100)| || 1 | TEMP TABLE TRANSFORMATION | | | | | | || 2 | LOAD AS SELECT | | | | | | || 3 | WINDOW SORT | | 193K| 6422K| 9112K| 2073 (1)| 00:00:25||* 4 | TABLE ACCESS FULL |ACTIVITY | 193K| 6422K| | 310 (1)| 00:00:04|| 5 | WINDOW SORT | | 6429K| 367M| | 2382 (23)| 00:00:29|| 6 | VIEW | | 6429K| 367M| | 2382 (23)| 00:00:29|| 7 | UNION ALL (RECURSIVE WITH) BREADTH F| | | | | | ||* 8 | VIEW | | 193K| 11M| | 246 (1)| 00:00:03|| 9 | TABLE ACCESS FULL |SYS_TEMP_| 193K| 6422K| | 246 (1)| 00:00:03||* 10 | HASH JOIN | | 6235K| 588M| 8880K| 2136 (25)| 00:00:26|| 11 | RECURSIVE WITH PUMP | | | | | | || 12 | VIEW | | 193K| 11M| | 246 (1)| 00:00:03|| 13 | TABLE ACCESS FULL |SYS_TEMP_| 193K| 6422K| | 246 (1)| 00:00:03|-------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):---------------------------------------------------

4 - filter("START_DATE">=TO_DATE(' 1900-01-01 00:00:00', 'syyyy-mm-dd hh24:mi:ss')) 8 - filter("RN"=1) 10 - access("RSQ"."RN"="ACT"."RN"-1 AND "RSQ"."PERSON_ID"="ACT"."PERSON_ID") filter("ACT"."START_DATE"-"RSQ"."GROUP_START"<=1)

[SYS_TEMP_ was SYS_TEMP_0FD9D6648_110EBBB - truncated to fit the Word box]

Discussion of Results

No solution method using Analytics was found

The best method for shallow data sets is Recursive Subquery Factor

The best method for deep data sets is Model, which also is independent of depth

document.doc Page 40 of 49

Page 41: Forming Range-Based Break Groups With Advanced SQL

Analytics Anomaly AnalysisWe observed in the performance analysis sections for problems 1 and 2 that the analytics solutions behaved in the opposite manner to recursive subquery factoring: performance improved roughly in proportion to depth for given width. This is surprising, since we might expect performance to remain largely independent of depth, as with the model solutions, given that depth does not affect overall problem size. The behaviour of recursive subquery factoring is consistent with expectation, given the construction of the methods.

After completion of the initial performance analysis (v1.2 of the document) this issue was further analysed. It was determined by experiment that variations on the queries could avoid the deterioration in performance with decreasing depth. The problem seems to be due to a glitch in Oracle’s execution of queries with First_Value and the IGNORE NULLS option, and occurs in both 10g and 11g XE. It seems as though Oracle does a lot of unnecessary recalculation for each row processed when there are few null values.

The first variation involves noting that finding the first value in a list looking forward from the current row is the same as finding the last value looking back from the end to the current row. At first it might seem that the latter would be slower, but reuse of processing for previous rows as one progresses through the row set clearly is important.

The second variation involves removing the First_Value from the existing query, then adding an enclosing query that gets the group end as the maximum for person and group start.

The performance analysis was repeated for the two variations, plus the original analytic solutions and the model solution on a single wide slice, using the same data setup programs. As there is no RSF method now, we have taken the original forms of the problems where all groups are obtained. Both variations now perform as well for shallow as for deep data sets. Notice that the explain plans suggest that the variations will perform worse, having additional sort operations and higher estimated costs, but they are wrong.

Analytic Query Variations

Problem 1: Contiguous Ranges

Query NOF (Replace First_Value with Last_Value Inverted)

The query structure is essentially unchanged.

SQL

SELECT /* NOV_NOF */ person_id, start_date, end_date, activity_name, activity_id id, Last_Value (group_start IGNORE NULLS) OVER (PARTITION BY person_id ORDER BY start_date) group_start, Last_Value (group_end IGNORE NULLS) OVER (PARTITION BY person_id ORDER BY start_date DESC) group_end FROM (SELECT person_id, start_date, end_date, activity_name, activity_id, CASE WHEN (start_date > Nvl (Lag (end_date) OVER (PARTITION BY person_id ORDER BY start_date), start_date-1)) OR (activity_name != Lag (activity_name) OVER (PARTITION BY person_id ORDER BY start_date)) THEN start_date END group_start, CASE WHEN (Nvl (Lead (start_date) OVER (PARTITION BY person_id ORDER BY start_date), end_date+1) > end_date) OR (activity_name != Lead (activity_name) OVER (PARTITION BY person_id ORDER BY start_date)) THEN end_date END group_end FROM activity_nov) ORDER BY person_id, start_date

Explain Plan

----------------------------------------------------------------------------------------------| Id | Operation | Name | Rows | Bytes |TempSpc| Cost (%CPU)| Time |----------------------------------------------------------------------------------------------| 0 | SELECT STATEMENT | | | | | 522 (100)| || 1 | WINDOW SORT | | 19200 | 1293K| 1680K| 522 (1)| 00:00:07 || 2 | WINDOW SORT | | 19200 | 1293K| 1680K| 522 (1)| 00:00:07 || 3 | VIEW | | 19200 | 1293K| | 205 (1)| 00:00:03 || 4 | WINDOW SORT | | 19200 | 618K| 912K| 205 (1)| 00:00:03 || 5 | TABLE ACCESS FULL| ACTIVITY_NOV | 19200 | 618K| | 30 (0)| 00:00:01 |----------------------------------------------------------------------------------------------

document.doc Page 41 of 49

Page 42: Forming Range-Based Break Groups With Advanced SQL

Query MAX (Remove First_Value, Adding Max in Outer Level)

SQL

SELECT /* NOV_MAX */ person_id, start_date, end_date, activity_name, id, group_start, Max (end_date) OVER (PARTITION BY person_id, group_start) group_end FROM (SELECT person_id, start_date, end_date, activity_name, activity_id id, Last_Value (group_start IGNORE NULLS) OVER (PARTITION BY person_id ORDER BY start_date) group_start FROM (SELECT person_id, start_date, end_date, activity_name, activity_id, CASE WHEN (start_date > Nvl (Lag (end_date) OVER (PARTITION BY person_id ORDER BY start_date), start_date-1)) OR (activity_name != Lag (activity_name) OVER (PARTITION BY person_id ORDER BY start_date)) THEN start_date END group_start, CASE WHEN (Nvl (Lead (start_date) OVER (PARTITION BY person_id ORDER BY start_date), end_date+1) > end_date) OR (activity_name != Lead (activity_name) OVER (PARTITION BY person_id ORDER BY start_date)) THEN end_date END group_end FROM activity_nov)) ORDER BY person_id, start_date

Explain Plan

------------------------------------------------------------------------------------------------| Id | Operation | Name | Rows | Bytes |TempSpc| Cost (%CPU)| Time |------------------------------------------------------------------------------------------------| 0 | SELECT STATEMENT | | | | | 1041 (100)| || 1 | SORT ORDER BY | | 19200 | 1125K| 1448K| 1041 (1)| 00:00:13 || 2 | WINDOW SORT | | 19200 | 1125K| 1448K| 1041 (1)| 00:00:13 || 3 | VIEW | | 19200 | 1125K| | 484 (1)| 00:00:06 || 4 | WINDOW SORT | | 19200 | 1125K| 1448K| 484 (1)| 00:00:06 || 5 | VIEW | | 19200 | 1125K| | 205 (1)| 00:00:03 || 6 | WINDOW SORT | | 19200 | 618K| 912K| 205 (1)| 00:00:03 || 7 | TABLE ACCESS FULL| ACTIVITY_NOV | 19200 | 618K| | 30 (0)| 00:00:01 |------------------------------------------------------------------------------------------------

Query Analytics (Original)

Explain Plan

---------------------------------------------------------------------------------------------| Id | Operation | Name | Rows | Bytes |TempSpc| Cost (%CPU)| Time |

document.doc Page 42 of 49

Page 43: Forming Range-Based Break Groups With Advanced SQL

---------------------------------------------------------------------------------------------| 0 | SELECT STATEMENT | | | | | 205 (100)| || 1 | WINDOW SORT | | 19200 | 1293K| | 205 (1)| 00:00:03 || 2 | VIEW | | 19200 | 1293K| | 205 (1)| 00:00:03 || 3 | WINDOW SORT | | 19200 | 618K| 912K| 205 (1)| 00:00:03 || 4 | TABLE ACCESS FULL| ACTIVITY_NOV | 19200 | 618K| | 30 (0)| 00:00:01 |---------------------------------------------------------------------------------------------

Query Model (Original)

Explain Plan

---------------------------------------------------------------------------------------------| Id | Operation | Name | Rows | Bytes |TempSpc| Cost (%CPU)| Time |---------------------------------------------------------------------------------------------| 0 | SELECT STATEMENT | | | | | 379 (100)| || 1 | SORT ORDER BY | | 19200 | 618K| 912K| 379 (1)| 00:00:05 || 2 | SQL MODEL ORDERED | | 19200 | 618K| | 379 (1)| 00:00:05 || 3 | WINDOW SORT | | 19200 | 618K| 912K| 379 (1)| 00:00:05 || 4 | TABLE ACCESS FULL| ACTIVITY_NOV | 19200 | 618K| | 30 (0)| 00:00:01 |---------------------------------------------------------------------------------------------

Problem 2: Overlapping Ranges

Query NOF (Replace First_Value with Last_Value Inverted)

The query structure is essentially unchanged.

SQL

SELECT /* ANA_NOF */ person_id, start_date, end_date, activity_name, activity_id id, Last_Value (group_start IGNORE NULLS) OVER (PARTITION BY person_id ORDER BY start_date) group_start, CASE Last_Value (group_end IGNORE NULLS) OVER (PARTITION BY person_id ORDER BY start_date DESC) WHEN To_Date('01-JAN-3000', 'DD-MON-YY') THEN NULL ELSE Last_Value (group_end IGNORE NULLS) OVER (PARTITION BY person_id ORDER BY start_date DESC) END group_end FROM (SELECT person_id, start_date, end_date, activity_name, activity_id, CASE WHEN (start_date > Nvl (Lag (running_end) OVER (PARTITION BY person_id ORDER BY start_date), start_date-1)) THEN start_date END group_start, CASE WHEN (Nvl (Lead (start_date) OVER (PARTITION BY person_id ORDER BY start_date), running_end+1) > running_end) THEN running_end END group_end FROM (SELECT person_id, start_date, end_date, activity_name, activity_id, Max (Nvl(end_date, '01-JAN-3000')) OVER (PARTITION BY person_id ORDER BY start_date) running_end FROM activity)) ORDER BY person_id, start_date

Explain Plan

--------------------------------------------------------------------------------------------| Id | Operation | Name | Rows | Bytes |TempSpc| Cost (%CPU)| Time |--------------------------------------------------------------------------------------------| 0 | SELECT STATEMENT | | | | | 2745 (100)| || 1 | WINDOW SORT | | 94880 | 6393K| 8264K| 2745 (1)| 00:00:33 || 2 | WINDOW SORT | | 94880 | 6393K| 8264K| 2745 (1)| 00:00:33 || 3 | VIEW | | 94880 | 6393K| | 1199 (1)| 00:00:15 || 4 | WINDOW BUFFER | | 94880 | 5559K| | 1199 (1)| 00:00:15 || 5 | VIEW | | 94880 | 5559K| | 1199 (1)| 00:00:15 || 6 | WINDOW SORT | | 94880 | 3891K| 5592K| 1199 (1)| 00:00:15 || 7 | TABLE ACCESS FULL| ACTIVITY | 94880 | 3891K| | 171 (1)| 00:00:03 |--------------------------------------------------------------------------------------------

document.doc Page 43 of 49

Page 44: Forming Range-Based Break Groups With Advanced SQL

Query MAX (Remove First_Value, Adding Max in Outer Level)

SQL

SELECT /* ANA_MAX */ person_id, start_date, end_date, activity_name, id, group_start, CASE Max (Nvl(end_date, To_Date('01-JAN-3000', 'DD-MON-YY'))) OVER (PARTITION BY person_id, group_start) WHEN To_Date('01-JAN-3000', 'DD-MON-YY') THEN NULL ELSE Max (end_date) OVER (PARTITION BY person_id, group_start) END group_end FROM (SELECT /* ANA_OVL */ person_id, start_date, end_date, activity_name, activity_id id, Last_Value (group_start IGNORE NULLS) OVER (PARTITION BY person_id ORDER BY start_date) group_start FROM (SELECT person_id, start_date, end_date, activity_name, activity_id, CASE WHEN (start_date > Nvl (Lag (running_end) OVER (PARTITION BY person_id ORDER BY start_date), start_date-1)) THEN start_date END group_start, CASE WHEN (Nvl (Lead (start_date) OVER (PARTITION BY person_id ORDER BY start_date), running_end+1) > running_end) THEN running_end END group_end FROM (SELECT person_id, start_date, end_date, activity_name, activity_id, Max (Nvl(end_date, '01-JAN-3000')) OVER (PARTITION BY person_id ORDER BY start_date) running_end FROM activity))) ORDER BY person_id, start_date

Explain Plan

--------------------------------------------------------------------------------------------| Id | Operation | Name | Rows | Bytes |TempSpc| Cost (%CPU)| Time |--------------------------------------------------------------------------------------------| 0 | SELECT STATEMENT | | | | | 2745 (100)| || 1 | WINDOW SORT | | 94880 | 6393K| 8264K| 2745 (1)| 00:00:33 || 2 | WINDOW SORT | | 94880 | 6393K| 8264K| 2745 (1)| 00:00:33 || 3 | VIEW | | 94880 | 6393K| | 1199 (1)| 00:00:15 |

document.doc Page 44 of 49

Page 45: Forming Range-Based Break Groups With Advanced SQL

| 4 | WINDOW BUFFER | | 94880 | 5559K| | 1199 (1)| 00:00:15 || 5 | VIEW | | 94880 | 5559K| | 1199 (1)| 00:00:15 || 6 | WINDOW SORT | | 94880 | 3891K| 5592K| 1199 (1)| 00:00:15 || 7 | TABLE ACCESS FULL| ACTIVITY | 94880 | 3891K| | 171 (1)| 00:00:03 |--------------------------------------------------------------------------------------------

Query Analytics (Original)

Explain Plan

-------------------------------------------------------------------------------------------| Id | Operation | Name | Rows | Bytes |TempSpc| Cost (%CPU)| Time |-------------------------------------------------------------------------------------------| 0 | SELECT STATEMENT | | | | | 1199 (100)| || 1 | WINDOW SORT | | 94880 | 6393K| | 1199 (1)| 00:00:15 || 2 | VIEW | | 94880 | 6393K| | 1199 (1)| 00:00:15 || 3 | WINDOW BUFFER | | 94880 | 5559K| | 1199 (1)| 00:00:15 || 4 | VIEW | | 94880 | 5559K| | 1199 (1)| 00:00:15 || 5 | WINDOW SORT | | 94880 | 3891K| 5592K| 1199 (1)| 00:00:15 || 6 | TABLE ACCESS FULL| ACTIVITY | 94880 | 3891K| | 171 (1)| 00:00:03 |-------------------------------------------------------------------------------------------

Query Model (Original)

Explain Plan

-----------------------------------------------------------------------------------------| Id | Operation | Name | Rows | Bytes |TempSpc| Cost (%CPU)| Time |-----------------------------------------------------------------------------------------| 0 | SELECT STATEMENT | | | | | 2226 (100)| || 1 | SORT ORDER BY | | 94880 | 3891K| 5592K| 2226 (1)| 00:00:27 || 2 | SQL MODEL ORDERED | | 94880 | 3891K| | 2226 (1)| 00:00:27 || 3 | WINDOW SORT | | 94880 | 3891K| 5592K| 2226 (1)| 00:00:27 || 4 | TABLE ACCESS FULL| ACTIVITY | 94880 | 3891K| | 171 (1)| 00:00:03 |-----------------------------------------------------------------------------------------

Performance Analysis

Problem 1: Contiguous Ranges

Group Sizes by Depth

The output consists of all the records (76,000) and the table below gives the average group sizes, which are written to the log by a query in the data setup program.

Depth Group SizeD1 1D3 2D9 5D27 14D81 41D243 125D729 356D2187 985

CPU Times

Depth -> D1 D3 D9 D27 D81 D243 D729 D2187

Group Size -> 1 2 5 14 41 125 356 985

ANA 694.71 340.58 130.57 50.47 21.28 10.87 8.63 7.74NOF 7.05 6.63 6.88 7.24 6.53 6.92 7.12 7.27MAX 5.78 5.91 5.46 4.99 5.32 5.60 5.84 5.42MOD 7.78 7.48 7.62 7.55 7.78 7.53 8.08 7.45

document.doc Page 45 of 49

Page 46: Forming Range-Based Break Groups With Advanced SQL

Slice Graph (Wide Slice)

Problem 2: Overlapping Ranges

Group Sizes by Depth

The output consists of all the records (96,000) and the table below gives the average group sizes, which are written to the log by a query in the data setup program.

Depth Group SizeD1 4D3 6D9 9D27 33D81 120D243 2602D729 24615D2187 32000

CPU Times

Depth -> D1 D3 D9 D27 D81 D243 D729 D2187

Group Size -> 4 6 9 33 120 2602 24615 32000

ANA 278.09 180.97 120.87 36.26 16.6 9.13 9.37 8.40NOF 9.18 8.70 8.89 9.53 8.76 8.83 9.15 8.42MAX 9.19 8.95 8.95 8.83 8.63 8.69 8.95 8.60MOD 11.76 12.12 12.33 11.37 11.61 11.64 11.97 11.47

document.doc Page 46 of 49

Page 47: Forming Range-Based Break Groups With Advanced SQL

Slice Graph (Wide Slice)

document.doc Page 47 of 49

Page 48: Forming Range-Based Break Groups With Advanced SQL

ConclusionsSolution methods have been presented for a number of range-based SQL grouping problems, including relatively new techniques from Oracle Database 10.1 and 11.2. It has been shown that the best method depends not just on the size of the data set, but also on its shape. A few summary points may be made in relation to these problems:

The Model clause tends to produce relatively simple SQL that performs consistently across data sets

The new Recursive Subquery Factor feature can be extremely efficient in cases where the records in the solution set are much fewer than the total, but only works for a single group

Solutions using analytic functions are slightly more efficient than model solutions where available, but an important performance glitch in certain cases has been identified and needs to be worked around

Explain plan costings should be treated with caution

SQL developers interested in performance need to be proficient in all three techniques (most are familiar only with the older, from Oracle v8, analytic functions technique)

Performance testing can be more effective when executed by automated methods across multi-dimensional domains

document.doc Page 48 of 49

Page 49: Forming Range-Based Break Groups With Advanced SQL

ReferencesREF Document Details

REF-1 Activities and breaksQuestion by Jayadev on Tom Kyte’s Oracle database forum

REF-2 SQL Pivot and Prune Queries – Keeping an Eye on Performance

BP Furey, June 2011

REF-3 Oracle® Database SQL Language Reference 11g Release 2 (11.2)

http://www.oracle.com/pls/db112

document.doc Page 49 of 49