sausag 69 – 20 feb 2014 smarter sorts jerry le breton (softscape solutions) & doug lean (dhs)...
DESCRIPTION
Sorting - The Obvious First proc sort data=claims; by claim client; Its important to know your data How many variables How many distinct data values for each Sort puts your records in order - BY the values of the variables you list. SAUSAG 69 – 20 Feb 2014TRANSCRIPT
SAUSAG 69 – 20 Feb 2014
Smarter Sorts
Jerry Le Breton (Softscape Solutions) & Doug Lean (DHS)
Beyond the Obvious
Sorting –The Obvious First Why Sort ?
“Data and information is almost always presented in a sorted or structured way”
Sorting - The Obvious Firstproc sort data=claims; by claim client;
Its important to know your data• How many variables• How many distinct data values for each
Sort puts your records in order- BY the values of the variables
you list.
SAUSAG 69 – 20 Feb 2014
Sorting – Do You Need To?proc sort data=claims; by claim;Proc tabulate ...; class claim; ... An unnecessary SORT
Some PROCS do their own sorting:TABULATEMEANSREPORTSQL(which can run out of memory for really big data sets)
SAUSAG 69 – 20 Feb 2014
Sorting – Do You Need To? Only use PROC SORT before REPORT,
TABULATE, MEANS if there’s another reason later. For PROC MEANS substitute BY with CLASS
e.g. PROC MEANS NWAY; CLASS x y z;
Is similar to PROC SORT; BY x y z;
PROC MEANS; BY x y z;
And saves significant time by avoiding the SORT
SAUSAG 69 – 20 Feb 2014
Sort Only What You Needproc sort data=claims out=Sorted_claims; where client =: 'A'; by claim;
Sort just the rows you want…
… and just the columns you want…proc sort data=claims(keep = c:) out=Sorted_claims; by claim;
Leaving out unwanted rows and columns can produce dramatic performance improvements.
SAUSAG 69 – 20 Feb 2014
Sorting – Proc Sort vs Proc SQL/* SORT Procedure */proc sort data=claims; by client claim;run;
/* SQL Procedure */proc sql; create table claims as select * from claims order by client claim; quit;
Both will sort your data. No significant performance difference. Choose according to clarity, functional requirement and
efficiency. Make it as clear and simple as possible!
SAUSAG 69 – 20 Feb 2014
Sorted Status of a Data Set
proc sort data=claims; by claim client;
Sort Information
Sortedby CLAIM CLIENT Validated YES Character Set ANSI
Sort status is saved as part of a SAS data set.
So SAS won’t waste time re-sorting if it’s already in the required order.
SAUSAG 69 – 20 Feb 2014
Setting Sorted Status of a Data Set
data client_claims (sortedby = client ); merge clients claims; by client ;
Sort Information
Sortedby CLIENT Validated NO Character Set ANSI
If you know a data set is sorted, say so with the SORTEDBY= option!.
So SAS won’t waste time re-sorting later.
SAUSAG 69 – 20 Feb 2014
Presorted or Notsorted
SAUSAG 69 – 20 Feb 2014
proc sort data=claims out=sorted presorted; by claim;
PRESORTED option for when data probably sorted!SAS will check and only sort if necessary.
proc print data=grouped_claims; by claim NOTSORTED;
No need to sort if data is grouped BY the required variable – it doesn’t matter its NOTSORTED (you just have to say so).
Sorting and Maintaining Order
proc sort data=claims; by claim ;
By default, SAS maintains the original order of records within a BY group.
proc sort data=claims noequals; by claim ;
Using the NOEQUALS option means SAS won’t necessarily retain the original ordering.
More efficient but, directly affects the results of using NODUPKEY
SAUSAG 69 – 20 Feb 2014
Sorting Duplicatesproc sort data=claims out=no_duplicates nodupkey; by claim;
proc sort data=claims out=no_duplicates
dupout=dups nodupkey; by claim;
NODUPKEY effectively keeps the first record of any duplicates.
DUPOUT= puts the duplicates to a separate table.
SAUSAG 69 – 20 Feb 2014
Separating Unique & Duplicate Rows
proc sort data=claims out=sorted ; by claim;run;data unique_claims dup_claims; set sorted; by claim; if first.claim and last.claim then output unique_claims; else output dup_claims;run;
It works, but needs an extra pass of the data.
SAUSAG 69 – 20 Feb 2014
Separating Unique & Duplicate Rows- the smarter way
proc sort data=claims out=duplicates uniqueout=uniques nouniquekey ; by claim;run;
NOUNIQUEKEY ensures no records with a unique key are
written to the OUT= table.
…and the UNIQUEOUT= option directs the unique records to a
separate table
SAUSAG 69 – 20 Feb 2014
Sorting – Case Insensitiveproc sort data=names out=simply_sorted;by name;
data names2; set names; upcase_name = upcase(name);proc sort data=names2 out=upcase_sorted(keep=name); by upcase_name;
Upper case letters are before lower case in the ASCII collating sequence.
Creating an upper (or lower) case copy of the variable is the old solution.
SAUSAG 69 – 20 Feb 2014
Sorting – Case Insensitive - Smarter
proc sort data=names out=linguistic_sorted sortseq=linguistic;by name;
SORTSEQ option specifies the collating sequence (ASCII/EBCDIC/other languages) or, LINGUISTIC option modifies the current collating sequence.
The affect is to make the sort case insensitive.
SAUSAG 69 – 20 Feb 2014
Sorting – Case Insensitive – with SQL
proc sql;create table sql_sorted asselect * from namesorder by upcase(name);
PROC SQL allows the use of functions in the Order By (and other) clauses.
The result is different from Proc SORT using the sorteq=linguistic.
SAUSAG 69 – 20 Feb 2014
Sorting Out Spaces
proc sort data=names out=simply_sorted;by name;
data names_temp; set names; temp_name = upcase(compress(name));run;proc sort data=names_temp out=temp_sorted(keep=name);by temp_name;
A standard sort is obviously no use.
Creating another variable for sorting, without spaces, is the old solution.
Sorting Out Spaces
Proc SORT can too! This sub-option of the LINGUISTIC sortseq option, effectively
ignores spaces as well as being case-insensitive.
proc sql;create table sql_sorted asselect * from namesorder by upcase(compress(name));
proc sort data=names out=alt_handling_sorted sortseq = linguistic(alternate_handling = shifted);by name;
Proc SQL can do it too.
SAUSAG 69 – 20 Feb 2014
Sorting by Numbers
proc sort data=students out=simply_sorted;by student;
Sorting text with numeric prefixes e.g. student id and name …
… results in nothing useful!
SAUSAG 69 – 20 Feb 2014
Sorting by Numbers
An extra data step can create a numeric variable to sort with (as can SQL of course)
data students_temp; set students; student_num = input(scan(student,1), 2.);run;proc sort data=students_temp out=temp_sorted(keep=student);by student_num;
proc sql;create table sql_sorted asselect * from studentsorder by input(scan(student,1), 2.);
SAUSAG 69 – 20 Feb 2014
Sorting by Numbers
The numeric_collation sub-option of the LINGUISTIC sortseq option, sorts by the
numeric values that prefix the variable values.
proc sort data=students out=num_collation_sorted sortseq = linguistic (numeric_collation=on);by student;
SAUSAG 69 – 20 Feb 2014
Questions? Did you learn something new from this presentation?
SAUSAG 69 – 20 Feb 2014