kirk lafker

Upload: sowajnya-sowji

Post on 05-Apr-2018

229 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/31/2019 Kirk Lafker

    1/44

    1

    a presentation by

    Kirk Paul Lafler

    SAS

    Consultant, Author, and TrainerE-mail: [email protected]

  • 7/31/2019 Kirk Lafker

    2/44

    Copyright 1992-2010 by Kirk Paul Lafler 2

    Copyright Kirk Paul Lafler, 1992-2010.

    All rights reserved.

    SAS is the registered trademark of SAS Institute Inc., Cary, NC, USA.

    SAS Certified Professional is the trademark of SAS Institute Inc., Cary,NC, USA.

    All other company and product names mentioned are used foridentification purposes only and may be trademarks of theirrespective owners.

  • 7/31/2019 Kirk Lafker

    3/44

    Copyright 1992-2010 by Kirk Paul Lafler 3

    Objectives

    What is data federation? Characteristics associated with data federation

    What is a join?

    Why join?

    SASDATA step merge versus a Join

    Cartesian product joins

    Two table joins

    Table aliases to reference tables

    Three table joins

    Left and Right outer joins

    What happens during a join?

    Available join algorithms

  • 7/31/2019 Kirk Lafker

    4/44

    Copyright 1992-2010 by Kirk Paul Lafler 4

    Tables Used in Examples

    Movies

    Actors

  • 7/31/2019 Kirk Lafker

    5/44

    Copyright 1992-2010 by Kirk Paul Lafler

    What is Data Federation?

    Process of integrating data from many different sources Leaves data in place without using resources to copy data

    Makes access to data sources as easy as possible

    Provides a degree of reusability

    A data federation approach often replaces a data warehouse

  • 7/31/2019 Kirk Lafker

    6/44

    Copyright 1992-2010 by Kirk Paul Lafler

    Data Federation Characteristics

    It represents an integration approach Ability to aggregate data from many different sources

    Data sources can be in any location

    It can be implemented within any lifecycle methodology

    Provides flexibility It is user-centric

    Federated data does not copy and store data like a data

    warehouse

    It contains metadata (information about the actual data andits location)

    Is NOT always designed with optimality in mind

  • 7/31/2019 Kirk Lafker

    7/44Copyright 1992-2010 by Kirk Paul Lafler 7

    What is a Join?

    Table One Table Two

    Visually, it would look something like this:

    Process of combining tables side-by-side (horizontally) Consists of a matching process between rows in tables

    Some or all of the tables contents are brought together

    Gather and manipulate data from across tables

    . . .

  • 7/31/2019 Kirk Lafker

    8/44Copyright 1992-2010 by Kirk Paul Lafler 8

    Why Join?

    Data in a database is often stored in separate tables Joins allow data to be combined as if it were stored in one

    huge file

    Provide exciting insights into data relationships

    Types of joins: Inner joins a maximum of 256 tables can be joined

    Outer joins a maximum of 2 tables can be joined

  • 7/31/2019 Kirk Lafker

    9/44Copyright 1992-2010 by Kirk Paul Lafler 9

    DATA Step Merge versus a Join

    Merges process data differently than a standard join The merge process overlays the duplicate by-column

    Joins adhere to ANSI guidelines

    Joins do not automatically overlay the duplicate matching

    column

  • 7/31/2019 Kirk Lafker

    10/44Copyright 1992-2010 by Kirk Paul Lafler 10

    DATA Step Merge Process

    DATA merged;MERGE customers (IN=c)

    movies (IN=m);

    BY cust_no;IF c AND m;

    RUN;

    Customers Movies

    Cust_no Name

    3 Ryan

    5 Anna-liese

    10 Ronnie

    Cust_no Category

    3 Adventure

    5 Comedy

    7 Suspense

    Merged

    Cust_no Name Category

    3 Ryan Adventure5 Anna-liese Comedy

    ==X

  • 7/31/2019 Kirk Lafker

    11/44Copyright 1992-2010 by Kirk Paul Lafler 11

    Join Process

    PROC SQL;SELECT *FROM customers,

    movies

    WHEREcustomers.cust_no =movies.cust_no;

    QUIT;

    Customers Movies

    Cust_no Name

    3 Ryan

    5 Anna-liese

    10 Ronnie

    Cust_no Name Cust_no Category3 Ryan 3 Adventure

    5 Anna-liese 5 Comedy

    ==X

    Cust_no Category

    3 Adventure

    5 Comedy

    7 Suspense

  • 7/31/2019 Kirk Lafker

    12/44Copyright 1992-2010 by Kirk Paul Lafler 12

    Merge versus Join Results

    Cust_no Name Cust_no Category

    3 Ryan 3 Adventure

    5 Anna-liese 5 Comedy

    Merge

    Cust_no Name Category

    3 Ryan Adventure

    5 Anna-liese Comedy

    Features

    1. Data must be sorted using by-value.

    2. Requires variable names to be same.

    3. Duplicate matching column is overlaid.

    4. Results are not automatically printed.

    Features

    1. Data does not have to be sorted using by-value.

    2. Does not require variable names to be same.

    3. Duplicate matching column is not overlaid.

    4. Results are automatically printed unless

    NOPRINT option is specified.

    Versus

    Join

  • 7/31/2019 Kirk Lafker

    13/44Copyright 1992-2010 by Kirk Paul Lafler 13

    Cartesian Product Join (Cross Join)

    PROC SQL;SELECT *

    FROM customers,movies;QUIT;

    Cust_no Name Cust_no Movie_no Category

    3 Ryan 3 1011 Adventure

    3 Ryan 5 3090 Comedy

    3 Ryan 7 4456 Suspense

    5 Anna-liese 3 1011 Adventure

    5 Anna-liese 5 3090 Comedy5 Anna-liese 7 4456 Suspense

    10 Ronnie 3 1011 Adventure

    10 Ronnie 5 3090 Comedy

    10 Ronnie 7 4456 Suspense

    Result represents allpossible combinationsof rows and columns

    Absence of WHEREclause produces aCartesian product

    No WHEREor Key Used

    Customers Movies

    Cust_no Name

    3 Ryan

    5 Anna-liese

    10 Ronnie

    Cust_no Category

    3 Adventure

    5 Comedy

    7 Suspense

  • 7/31/2019 Kirk Lafker

    14/44Copyright 1992-2010 by Kirk Paul Lafler 14

    ExampleCartesian Product

    A Cartesian product join, sometimes referred to as across join, can be very large because it represents allthe possible combinations of rows and columns.

    PROC SQL;

    SELECT *

    FROM MOVIES,

    ACTORS;

    QUIT;

  • 7/31/2019 Kirk Lafker

    15/44Copyright 1992-2010 by Kirk Paul Lafler 15

    The result of anEqui-joinis illustrated by the shadedarea (AB) in the Venn diagram.

    A BAB

    Equi-Join with Two Tables

  • 7/31/2019 Kirk Lafker

    16/44Copyright 1992-2010 by Kirk Paul Lafler 16

    Equi-Join with Two Tables

    =

    EquiJoin

    PROC SQL;SELECT *FROM customers,

    moviesWHEREcustomers.cust_no =movies.cust_no;

    QUIT;

    Cust_no Name Cust_no Movie_no Category

    3 Ryan 3 1011 Adventure5 Anna-liese 5 3090 Comedy

    Customers

    Cust_no Name

    3 Ryan

    5 Anna-liese

    10 Ronnie

    Movies

    Cust_no Category

    3 Adventure

    5 Comedy

    7 Suspense

  • 7/31/2019 Kirk Lafker

    17/44Copyright 1992-2010 by Kirk Paul Lafler 17

    ExampleWHERE-clause

    The most reliable way to join two tables together,and to avoid creating a Cartesian product, is to useaWHEREclause with common columns or keys.

  • 7/31/2019 Kirk Lafker

    18/44Copyright 1992-2010 by Kirk Paul Lafler 18

    Table Aliases

    PROC SQL;SELECT *FROM customers c,

    movies mWHERE c.cust_no =

    m.cust_no;QUIT;

    Cust_no Name Cust_no Movie_no Category

    3 Ryan 3 1011 Adventure

    5 Anna-liese 5 3090 Comedy

    Aliases

    =

    EquiJoin

    Customers

    Cust_no Name

    3 Ryan

    5 Anna-liese

    10 Ronnie

    Movies

    Cust_no Category

    3 Adventure

    5 Comedy

    7 Suspense

  • 7/31/2019 Kirk Lafker

    19/44Copyright 1992-2010 by Kirk Paul Lafler 19

    ExampleTable Alias

    Assigning atable aliasis a not only a useful way toreference a table, but can reduce the number ofkeystrokes typed.

  • 7/31/2019 Kirk Lafker

    20/44Copyright 1992-2010 by Kirk Paul Lafler 20

    Joining Three Tables

    Customers Movies

    Cust_no

    Name

    Cust_no

    Movie_no

    Category

    =

    EquiJoin

    PROC SQL;SELECT c.cust_no, c.name,m.movie_no, c.category,a.lead_actor

    FROM customers c,

    movies m,actors aWHERE

    c.cust_no = m.cust_no ANDm.movie_no = a.movie_no;

    QUIT;

    Cust_no Name Movie_no Category Lead_actor

    3 Ryan 1011 Adventure Mel Gibson

    Actors

    Movie_no

    Lead_actor=

    EquiJoin

  • 7/31/2019 Kirk Lafker

    21/44Copyright 1992-2010 by Kirk Paul Lafler 21

    The result of aLeft Outer joinis illustrated by theshaded areas (A and AB) in the Venn diagram.

    A BAB

    Left Outer Joins

  • 7/31/2019 Kirk Lafker

    22/44

    Copyright 1992-2010 by Kirk Paul Lafler 22

    ExampleLeft Outer Join

    The result of aLeft Outer joinproduces bothmatched rows from both tables plus any unmatchedrows from the left table.

  • 7/31/2019 Kirk Lafker

    23/44

    Copyright 1992-2010 by Kirk Paul Lafler 23

    The results of aRight Outer joinis illustrated by theshaded areas (B and AB) in the Venn diagram.

    A BAB

    Right Outer Joins

  • 7/31/2019 Kirk Lafker

    24/44

    Copyright 1992-2010 by Kirk Paul Lafler 24

    ExampleRight Outer Join

    The result of aRight Outer joinproduces matchedrows from both tables while preserving unmatchedrows from the right table.

  • 7/31/2019 Kirk Lafker

    25/44

    Copyright 1992-2010 by Kirk Paul Lafler 25

    What Happens during a Join?

    When joining two tables:An intermediate Cartesian product is built from the two tables

    Rows are selected that match the WHERE clause, if present

    When joining more than two tables: SQL query optimizer evaluates the available methods for retrieving

    the data and attempts to use the most efficient method

    The join is reconstructed into several two-way joins

    Removes unwanted rows and columns from the intermediate tables

    Determines the order of processing to reduce the size of the

    intermediate Cartesian product

  • 7/31/2019 Kirk Lafker

    26/44

    Copyright 1992-2010 by Kirk Paul Lafler 26

    Join Algorithms

    Users supply the list of tables for joining along with the joinconditions, and the PROC SQL optimizer determines whichjoin algorithm to use for performing the join. The algorithmsinclude:

  • 7/31/2019 Kirk Lafker

    27/44

    Copyright 1992-2010 by Kirk Paul Lafler 27

    Join Algorithms

    Users supply the list of tables for joining along with the joinconditions, and the PROC SQL optimizer determines which ofthe join algorithms to use for performing the join. Thealgorithms include:

    Nested Loop Join (brute-force join) When an equalitycondition is not specified, a read of the complete contents of theright table is processed for each row in the left table.

  • 7/31/2019 Kirk Lafker

    28/44

    Copyright 1992-2010 by Kirk Paul Lafler 28

    Nested Loop Join - Features

    Used with join relations of two tables One or both of the tables is relatively small

    I/O intensive

    This join generally performs fairly well with smaller tables,

    but generally performs poorly with larger join relations

  • 7/31/2019 Kirk Lafker

    29/44

    Copyright 1992-2010 by Kirk Paul Lafler 29

    Join Algorithms

    Users supply the list of tables for joining along with the joinconditions, and the PROC SQL optimizer determines which ofthe join algorithms to use for performing the join. Thealgorithms include:

    Nested Loop Join When an equality condition is not specified, a

    read of the complete contents of the right table is processed foreach row in the left table.

    Sort-Merge Join When the specified tables are already in thedesired sort order, resources are not expended for resorting.

  • 7/31/2019 Kirk Lafker

    30/44

    Copyright 1992-2010 by Kirk Paul Lafler 30

    Sort-Merge Join - Features

    Used with joins of two tables Works best when one or both of the join relations are in the

    desired order

    One or both of the tables are of moderate size

    If the optimizer determines a sort is not needed no sort will

    be performed, otherwise sort resources are expended:

    - using an explicit sort operation

    - by taking advantage of pre-existing ordering

    Generally performs well, particularly when the majority of

    the rows are being joined

  • 7/31/2019 Kirk Lafker

    31/44

    Copyright 1992-2010 by Kirk Paul Lafler 31

    Users supply the list of tables for joining along with the joinconditions, and the PROC SQL optimizer determines which ofthe join algorithms to use for performing the join. Thealgorithms include:

    Nested Loop Join When an equality condition is not specified, a

    read of the complete contents of the right table is processed foreach row in the left table.

    Sort-Merge Join When the specified tables are already in thedesired sort order, resources are not expended for resorting.

    Indexed Join When an index exists on >=1 variable(s) torepresent a key, matching rows may be accessed using the index.

    Join Algorithms

  • 7/31/2019 Kirk Lafker

    32/44

    Copyright 1992-2010 by Kirk Paul Lafler 32

    Indexed Join - Features

    Used with joins of two tables An index must be defined that produces a small subset of

    the total number of rows in a table

    Matching rows are accessed directly using the index

    One or both of the tables are of moderate to large size

    Generally performs well, particularly when a small number of

    rows are being joined

  • 7/31/2019 Kirk Lafker

    33/44

    Copyright 1992-2010 by Kirk Paul Lafler 33

    Join Algorithms

    Users supply the list of tables for joining along with the joinconditions, and the PROC SQL optimizer determines which ofthe join algorithms to use for performing the join. Thealgorithms include:

    Nested Loop Join When an equality condition is not specified, a

    read of the complete contents of the right table is processed foreach row in the left table.

    Sort-Merge Join When the specified tables are already in thedesired sort order, resources are not expended for resorting.

    Indexed Join When an index exists on >=1 variable(s) torepresent a key, matching rows may be accessed using the index.

    Hash Join When an equality relationship exists and thesmaller table is able to fit in memory, set-matching operationsgenerally perform well.

  • 7/31/2019 Kirk Lafker

    34/44

    Copyright 1992-2010 by Kirk Paul Lafler 34

    Hash Join - Features

    Used with joins of two tables The SQL optimizer attempts to estimate the amount memory

    required to build a hash table in memory

    A hash table data structure associates keys with values

    The optimizer tries to use the smaller of the tables as the

    hash table

    Its purpose is to perform more efficient lookups

    Requires an equi-join predicate

    This algorithm generally performs well with small to medium

    join relations

  • 7/31/2019 Kirk Lafker

    35/44

    Copyright 1992-2010 by Kirk Paul Lafler 35

    Options MSGLEVEL=IBy specifying the MSGLEVEL=I option, helpfulnotes describing index usage, sort utilities, andmerge processing are displayed on the SAS Log.

    OPTIONS MSGLEVEL=I;

    PROC SQL;SELECT MOVIES.TITLE, RATING, LENGTH, ACTOR_LEADING

    FROM MOVIES,

    ACTORS

    WHERE MOVIES.TITLE = ACTORS.TITLE AND

    RATING = PG;QUIT;

  • 7/31/2019 Kirk Lafker

    36/44

    Copyright 1992-2010 by Kirk Paul Lafler 36

    SAS Log ResultsOPTIONS MSGLEVEL=I;PROC SQL;

    SELECT MOVIES.TITLE, RATING, LENGTH, ACTOR_LEADING

    FROM MOVIES,

    ACTORS

    WHERE MOVIES.TITLE = ACTORS.TITLE ANDRATING = 'PG';

    INFO: Index Rating selected for WHERE clause optimization.

    QUIT;

    MSGLEVEL=I Log

  • 7/31/2019 Kirk Lafker

    37/44

    Copyright 1992-2010 by Kirk Paul Lafler 37

    _METHOD Option

    A_METHOD option can be specified on the PROCSQL statement to display the hierarchy of processingthat takes place. Results are displayed on the Log.

    Codes Description

    sqxcrta Create table as Selectsqxslct Select

    sqxjsl Step loop join (Cartesian)

    sqxjm Merge join

    sqxjndx Index join

    sqxjhsh Hash join

    sqxsort Sort

    sqxsrc Source rows from table

    sqxfil Filter rows

    sqxsumg Summary statistics with GROUP BY

    sqxsumn Summary statistics with no GROUP BY

  • 7/31/2019 Kirk Lafker

    38/44

    Copyright 1992-2010 by Kirk Paul Lafler 38

    Program Example using _METHOD

    SAS Log Results

    NOTE: SQL execution methods chosen are:

    sqxslctsqxjhsh

    sqxsrc( MOVIES )

    sqxsrc( ACTORS )

  • 7/31/2019 Kirk Lafker

    39/44

    Copyright 1992-2010 by Kirk Paul Lafler 39

    Conclusion

    Data federation characteristics A merge and join do not process data the same way

    A join combines tables side-by-side (horizontally)

    Joins adhere to ANSI guidelines

    Cartesian product represents all possiblecombinations of rows from the underlying tables

    256 tables can be joined using an inner join construct

    2 tables can be joined using an outer join construct

    Four types of join algorithms: Nested loop join

    Sort-Merge join

    Indexed join

    Hash join

  • 7/31/2019 Kirk Lafker

    40/44

    Copyright 1992-2010 by Kirk Paul Lafler 40

    PROC SQL

    Beyond the BasicsUsing SAS

    Kirk Paul Lafler

    PROC SQL

    Examples

    Book

    Coming Winter 2004!Available at www.sas.com!

  • 7/31/2019 Kirk Lafker

    41/44

    Copyright 1992-2010 by Kirk Paul Lafler 41

    Coming Winter 2004!Coming in September 2010!

    A Book of

    Data Transfer

    Methods withExamples

    SASand Excel

    Transferring Data betweenSAS and Excel Using SAS

    Kirk Paul Lafler

    William E. Benjamin, Jr.

  • 7/31/2019 Kirk Lafker

    42/44

    Questions ?Joining

    Worlds of Datais exciting and fun!

    Kirk Paul Lafler

    SAS Consultant, Author, and Trainer

    Software Intelligence Corporation

    E-mail: [email protected]

  • 7/31/2019 Kirk Lafker

    43/44

    We kindly thank the WUSS Leadership and the following organizations for their

    assistance and support for this symposium:

    BASAS

    http://www.tmcsoftware.com/index.htmlhttp://www.genomichealth.com/Index.aspxhttp://www.meta-x.com/index.htmlhttp://www.advancedclinical.com/default.aspxhttp://www.sas.com/
  • 7/31/2019 Kirk Lafker

    44/44

    Mark yourcalendars for

    WUSS 2010!Three days ofclasses, workshops andpresentations introducing the latest in

    SAS technology and applications.

    November 3-5

    Hyatt Regency Mission Bay

    San Diego, CA

    More information will be