introduction to data warehousing randy grenier rev. 11 november 2014

68
Introduction to Data Warehousing Randy Grenier Rev. 11 November 2014

Upload: andrea-baker

Post on 21-Jan-2016

221 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Introduction to Data Warehousing Randy Grenier Rev. 11 November 2014

Introduction to Data Warehousing

Randy GrenierRev. 11 November 2014

Page 2: Introduction to Data Warehousing Randy Grenier Rev. 11 November 2014

First of all what is data warehousing?The methods and architectures used to collect, integrate, transform and store operational data so that it can be used for analysis and reporting.

IntegrateTransform

Data WarehouseBanking Data Analysis

IntegrateTransform

Data Warehouse

Manufacturing Data

Analysis

IntegrateTransform

Data WarehouseHealthcare Data Analysis

Page 3: Introduction to Data Warehousing Randy Grenier Rev. 11 November 2014

• History

• Paradigm Shift

• Architecture

• Emerging Technologies

• Questions

Contents

Page 4: Introduction to Data Warehousing Randy Grenier Rev. 11 November 2014

Some history of the following will facilitate a better understanding of how data warehousing came about:

• How data is stored

• How data is accessed

• Transaction vs. analytical processing

History

Page 5: Introduction to Data Warehousing Randy Grenier Rev. 11 November 2014

Once upon a time…

• Data and programming were stored on Hollerith cards.

• A card contained one record of data or one line of programming code.

• Maximum length of the record or line of code was 80 characters.» For data, there could be multiple record types.» For programs, statements > 80 characters had to be split.

• Final deck of cards contained programming, data, and job control instructions.

History: Hollerith Cards

Page 6: Introduction to Data Warehousing Randy Grenier Rev. 11 November 2014

• Because of 80 character limitation of cards, multiple record types were often necessary.

History: Hollerith Cards

Rectype 03: Patient Insurance

Rectype 02: Patient Demographics

RecType 01: Patient Name

RecType PatientID Employer Insurance03 100001 Acme. Inc. BCBSMA

RecType PatientID DOB Sex Race02 100001 4/22/1975 F B

RecType PatientID LastName FirstName01 100001 Doe Jane

Page 7: Introduction to Data Warehousing Randy Grenier Rev. 11 November 2014

JCLData EndCompileRun

ProgramProgramProgramProgramData

JCLProgram EndData Start

• Job Control Language (JCL) provided job-specific instructions to the computer.

History: Hollerith Cards

ProgramProgramProgramProgramProgram

JCLCompiler Language: COBOLProgram Start

JCLProgramID: 122249Programmer: 22488Department 44

Top

Bottom

Page 8: Introduction to Data Warehousing Randy Grenier Rev. 11 November 2014

History: Hollerith Cards

1. Punch information onto cards using keypunch machines

2. Load with Card reader 4. Create reports3. Processed in computer memory

Page 9: Introduction to Data Warehousing Randy Grenier Rev. 11 November 2014

• Program statements acted on single records—not on data sets. › Loop through records; for each record:

» Read data elements into memory variables.» Add to counter variables.» Add to sum variables.» Apply conditional logic (IF… THEN… ELSE…)

› End Loop› Format output› Print report line at a time

• Transactions (changes to data) were implemented by simply adding, removing or replacing cards.

History: Hollerith Cards

Page 10: Introduction to Data Warehousing Randy Grenier Rev. 11 November 2014

History: Hollerith CardsA section of a COBOL program

Page 11: Introduction to Data Warehousing Randy Grenier Rev. 11 November 2014

Wikipedia Factoids

• Card type: IBM 80-column punched card

• A/K/A: “Punched Card”, “IBM Card”

• Size: 7 3⁄8 by 3 1⁄4 inches

• Thickness: .007 inches (143 cards per inch)

• Capacity: 80 columns with 12 punch locations each

History: Hollerith Cards

Page 12: Introduction to Data Warehousing Randy Grenier Rev. 11 November 2014

• Hollerith cards were eventually replaced by magnetic tape.

• Tapes made data storage more efficient and more reliable.

• Records were stored sequentially, so access could be very slow.

• Data processing was similar to that of cards—one record at a time.

History: Magnetic Tape

Page 13: Introduction to Data Warehousing Randy Grenier Rev. 11 November 2014

History: Magnetic Tape

Old File

Trans-actions

New FileApply transactions

• Transaction processing (changes to data) became more complicated.

Create transactions

Page 14: Introduction to Data Warehousing Randy Grenier Rev. 11 November 2014

• The arrival of disk storage revolutionized data storage and access.

• Instead of having to load data for each process, data was always available online.

• Data was available to multiple users at any given time.

• This “home base” for data became known as a database.

• Direct access replaced sequential access so data could be accessed more quickly.

History: Disk Storage

Flowchart symbol for disk storage

Page 15: Introduction to Data Warehousing Randy Grenier Rev. 11 November 2014

• Online storage required new methods for processing transactions. These became known as Online Transaction Processing (OLTP).

• Reporting from online data became known as Online Analytical Processing (OLAP).

• Programming reports was the same as before—one record at a time.

History: Disk Storage

Flowchart symbol for disk storage

Page 16: Introduction to Data Warehousing Randy Grenier Rev. 11 November 2014

History: Disk Storage

Data was now online in a database

Online databases became available to multiple users connected to a

mainframe computer.

Computer terminals became the most common user interface. They

had no CPU or memory—the mainframe did all processing.

Disk storage required new methods to modify data online. (OLTP)

Page 17: Introduction to Data Warehousing Randy Grenier Rev. 11 November 2014

Wikipedia Factoids

• Storage Device: IBM 350 disk storage unit

• Released: 1956

• Capacity: 5 million 6-bit characters (3.75 megabytes)

• Disk spin speed: 1200 RPM

• Data transfer rate: 8,800 characters per second.

History: Disk Storage

Page 18: Introduction to Data Warehousing Randy Grenier Rev. 11 November 2014

• In the 1960’s, E.F. Codd developed the relational model.

• Relational modeling was based on a branch of mathematics called set theory and added rigor to the organization and management of data.

• The relational model also introduced primary keys, foreign keys, referential integrity, constraints, relational algebra, selection, joins, unions, difference, intersection, and a number of other concepts used in modern database systems.

• Subsequent development of relational database theory was done by Raymond Boyce and C.J. Date.

• C.J. Date’s book An Introduction to Database Systems (ISBN 0-321-19784-4) is used by colleges and universities to teach relational database theory.

History: Relational Model

Page 19: Introduction to Data Warehousing Randy Grenier Rev. 11 November 2014

History: Relational ModelNon-relational (“denormalized”)

RxID PatientID NDC StartDate DaysSupply DOB Gender

999111 1000001 00005312131 4/1/2013 30 4/1/1990 M

999122 1000001 00006057062 4/1/2013 30 4/1/1990 M

999133 1000003 10106364001 7/2/2014 90 2/15/1982 F

999144 1000003 23490574303 7/2/2014 90 2/15/1982 F

999145 1000003 42549055390 7/2/2014 30 2/15/1982 F

Rx (prescription dispensings)

PxID PatientID CPT PxDate PhysicianID DOB Gender

999111 1000001 64632 1/1/2013 18222 4/1/1990 M

999122 1000001 64633 9/15/2013 94024 4/1/1990 M

999133 1000003 29800 5/4/2014 33445 2/15/1982 F

999144 1000003 64635 5/4/2014 33445 2/15/1982 F

999145 1000003 28515 5/18/2014 72488 2/15/1982 F

Px (procedures)

Page 20: Introduction to Data Warehousing Randy Grenier Rev. 11 November 2014

History: Relational Model

PatientID DOB Gender

1000001 4/1/1990 M

1000002 7/22/1975 F

1000003 2/15/1982 F

Relational: (“normalized”)

Patient

RxID PatientID NDC StartDate DaysSupply

999111 1000001 00005312131 4/1/2013 30

999122 1000001 00006057062 4/1/2013 30

999133 1000003 10106364001 7/2/2014 90

999144 1000003 23490574303 7/2/2014 90

999145 1000003 42549055390 7/2/2014 30

Rx

Px

PxID PatientID CPT PxDate PhysicianID

999111 1000001 64632 1/1/2013 18222

999122 1000001 64633 9/15/2013 94024

999133 1000003 29800 5/4/2014 33445

999144 1000003 64635 5/4/2014 33445

999145 1000003 28515 5/18/2014 72488

• DOB and Gender are attributes of Patient—not of Rx or Px.

• Edits to patient attributes can now be done in one place—OLTP is simplified.

Page 21: Introduction to Data Warehousing Randy Grenier Rev. 11 November 2014

• Relational modeling facilitated OLTP by making these processes more efficient and reducing data anomalies.

• The relational model was not always optimal for OLAP.

• Data was stored as relational, and non-relational extracts were created to support OLAP.

History: Relational Model

ReportOLTP Source OLAP Extract

Relational Non-relational

Page 22: Introduction to Data Warehousing Randy Grenier Rev. 11 November 2014

Wikipedia Factoids

• E.F. Codd got his undergraduate degree in mathematics from Oxford. He received his doctorate in computer science from University of Michigan.

• Angered by McCarthyism in the U.S. during the 1950’s, Codd moved to Canada for several years.

• Although E.F. Codd was employed by IBM when he created the relational model, IBM did not commercialize relational databases because it would have competed with another of their database products.

• The first commercial implementation of relational database and SQL was from Relational Software, Inc. which is now Oracle Corporation.

History: Relational Model

Page 23: Introduction to Data Warehousing Randy Grenier Rev. 11 November 2014

• Structured Query Language (SQL) was the first language created to support relational database operations for both OLAP and OLTP.

• SQL could operate on sets of records instead of just one record at a time.

• IBM originally called the language SEQUEL, but because that was a name proprietary to IBM it was renamed SQL.

• SQL has been standardized by standards organizations American National Standards Institute (ANSI) and the International Standards Organization (ISO).

History: SQL

Page 24: Introduction to Data Warehousing Randy Grenier Rev. 11 November 2014

• In early relational databases, data was extracted from OLTP systems into denormalized extracts for reporting.

History: Extracts

ReportOLTP Source OLAP Extract

Page 25: Introduction to Data Warehousing Randy Grenier Rev. 11 November 2014

• And more extracts...

History: Extracts

ReportOLTP Source OLAP Extract

ReportOLTP Source OLAP Extract

ReportOLAP Extract

Page 26: Introduction to Data Warehousing Randy Grenier Rev. 11 November 2014

• And more extracts...

History: Extracts

Page 27: Introduction to Data Warehousing Randy Grenier Rev. 11 November 2014

• And more extracts...

History: Extracts

Page 28: Introduction to Data Warehousing Randy Grenier Rev. 11 November 2014

• Naturally evolving systems began to emerge.

History: Extracts

Page 29: Introduction to Data Warehousing Randy Grenier Rev. 11 November 2014

• Naturally evolving systems resulted in› Poor organization of data› Extremely complicated processing requirements› Inconsistencies in extract refresh status› Inconsistent report results.

• This created a need for architected systems for analysis and reporting.

• Instead of multiple extract files, a single source of truth was needed for each data source.

History: Extracts

Page 30: Introduction to Data Warehousing Randy Grenier Rev. 11 November 2014

• Developers began to design architected systems for OLAP data.

• Over time methods and techniques for architected systems began to evolve, and best practices began to emerge.

• In the 1980’s organizations began to integrate data from all of their databases (.e.g accts. receivable, accts. payable, HR, inventory, etc.). These integrated OLAP databases became known as Enterprise Data Warehouses (EDWs).

• The term data warehousing came to be used for the methods and architectures used to build architected OLAP databases.

History: Architected Systems

Page 31: Introduction to Data Warehousing Randy Grenier Rev. 11 November 2014

Example of an Architected Data Warehouse

History: Architected Systems

OLTP

OLTP

Staging HistoryReferenceMetadata

DM

DM

DM

DMData set

ReportOLTP

OLTP

ODS

Query

Query

Page 32: Introduction to Data Warehousing Randy Grenier Rev. 11 November 2014

Compare: Naturally evolving system

History: Architected Systems

Page 33: Introduction to Data Warehousing Randy Grenier Rev. 11 November 2014

Compare: Architected Data Warehouse

History: Architected Systems

OLTP

OLTP

Staging

DM

DM

DM

DMData set

OLTP

OLTP

ODS

HistoryReferenceMetadata

Report

Query

Query

Page 34: Introduction to Data Warehousing Randy Grenier Rev. 11 November 2014

• In the early 1990’s W.H. Inmon published Building the Data Warehouse (ISBN-10: 0471141615)

• Inmon put together the quickly accumulating knowledge of data warehousing and popularized most of the terminology we use today.› Extract, Transform, and Load (ETL)› Transformation and Integration (T&I)› Operational data› History data› Snapshot› Source of Truth› Data Mart› Heuristic development

• W.H. Inmon created the first and most commonly accepted definition of a data warehouse: A subject oriented, nonvolatile, integrated, time variant collection of data in support of management's decisions.

• Inmon has subsequently published a more recent architecture called DW 2.0, but it is not yet as widely accepted as his earlier ideas.

History: Inmon

Page 35: Introduction to Data Warehousing Randy Grenier Rev. 11 November 2014

Wikipedia Factoids

• W.H. Inmon coined the term data warehouse.

• W.H. Inmon is recognized by many as the father of data warehousing.

• Other firsts of W.H. Inman› Wrote the first book on data warehousing› Wrote the first magazine column on data warehousing› Taught the first classes on data warehousing

History: Inmon

Page 36: Introduction to Data Warehousing Randy Grenier Rev. 11 November 2014

• Also in the 1990’s, Ralph Kimball published The Data Warehouse Toolkit (ISBN-10: 0471153370) which popularized dimensional modeling.

• Dimensional modeling is based on the cube concept which is a multi-dimensional view of data.

• The cube metaphor can only illustrate three dimensions. A dimensional model can be any number of dimensions.

History: Kimball

A cube used to represent multi-dimensional data

Page 37: Introduction to Data Warehousing Randy Grenier Rev. 11 November 2014

• Kimball implemented cubes as star schemas which support querying data in multiple dimensions.

• A star schema consists of a fact table surrounded by dimension tables (like a star).

History: Kimball

FactSales

Dimension Product Dimension

Location

DimensionTime

DimensionSales

Person

Page 38: Introduction to Data Warehousing Randy Grenier Rev. 11 November 2014

History: Kimball

ProductNDCProduct NameStrengthEtc.

LocationStore Number,Store NameCityEtc.

TimeDateQuarterFYEtc.

Sales PersonIDLast NameFirst NameEtc.

SalesSale DateNDCStore NumberSales Person IDUnit Price

• Kimball implemented cubes as star schemas which support querying data in multiple dimensions.

• A star schema consists of a fact table surrounded by dimension tables (like a star).

Page 39: Introduction to Data Warehousing Randy Grenier Rev. 11 November 2014

History: Kimball• The star schema structure simplified writing SQL.• SQL code could easily be generated from GUI user interfaces.

Simple Star Join

Page 40: Introduction to Data Warehousing Randy Grenier Rev. 11 November 2014

• Kimball does not discuss the relational model in depth, but his dimensional model can be explained in relational terms (i.e. facts are 3NF while dimensions are 2NF).

• A star schema makes it easy to slice and dice data on multiple dimensions. Slice and dice examples:› Units sold by store› Units sold by date› Units sold by store by date› Units sold by date by store› Units sold by product by date by store› Etc.

• Slice and dice operations include:› Drill down: Access more detail or more granular data.› Roll up: Summarize data or less granular data.› Pivot: Cross-tabulate data.

• Star schemas are frequently misunderstood and improperly implemented. Incorrectly designed star schemas result in skewed reporting.

• Most commercial GUI products for analyzing data (e.g. BI tools) utilize star schemas.

• The terms OLAP and CUBE are frequently misused in marketing materials to refer to products that utilize star schemas.

History: Kimball

Page 41: Introduction to Data Warehousing Randy Grenier Rev. 11 November 2014

Wikipedia Factoids

• Ralph Kimball had a Ph.D. in electrical engineering from Stanford University.

• Kimball worked at the Xerox Palo Alto Research Center (PARC). PARC is where laser printing, Ethernet, object-oriented programming, and graphic user interfaces (GUIs) were invented.

• Kimball was a principal designer of the Xerox Star Workstation which was the first personal computer to use a GUI, windows, icons, and mice.

History: Kimball

Page 42: Introduction to Data Warehousing Randy Grenier Rev. 11 November 2014

• Rapidly increasing amounts of data in the 21st Century are surpassing the capabilities of relational databases.

• New methods of data storage and retrieval are rapidly emerging.

• Unstructured databases which are sometimes referred to as NoSQL databases support vast amounts of text data and extremely fast text searches.

• Unstructured databases utilize massively parallel processing (MPP) and extensive text indexing.

• Open source software such as Hadoop from Apache is widely used to manage extremely large unstructured databases.

• Unstructured databases are generally not useful for complicated transaction processing (OLTP) or complex informatics (OLAP). However, these databases are rapidly evolving to incorporate additional relational capabilities.

• Oracle, Microsoft, and other RDBMS vendors sell hybrid database systems that combine unstructured data with relational database systems.

History: Big Data

Page 43: Introduction to Data Warehousing Randy Grenier Rev. 11 November 2014

Wikipedia Factoids

• Big data became an issue as early as 1880 with the U.S. Census which took several years to tabulate with then existing methods.

• The term information explosion was first used in the Lawton Constitution, a small-town Oklahoma newspaper in 1941.

• The first known use of the term big data was by NASA researchers Michael Cox and David Ellsworth discussing the inability of existing systems to handle increasing amounts of data.

• 1 exabyte = 10006 bytes = 1018 bytes = 1000 petabytes = 1 billion gigabytes.

History: Big Data

Page 44: Introduction to Data Warehousing Randy Grenier Rev. 11 November 2014

• OLTP vs. OLAP

• Paradigm Shift for Management

• Paradigm Shift for Database Administrators

• Paradigm Shift for Architects and Developers

• Paradigm Shift for Analysts and Data Users

Paradigm Shift

Page 45: Introduction to Data Warehousing Randy Grenier Rev. 11 November 2014

Paradigm ShiftOLTP vs. OLAP

OLTP (i.e. operational data) OLAP (e.g. a data warehouse)

Data is modelled specifically for the application. Data is taken from some other application.

All data elements for application are present. Desired data elements may not be present.

Each record is validated when submitted. Entire data files must be validated upon receipt.

Data is almost always normalized (3NF). Data may be normalized, denormalized, use dimensional model, cross-tabulated, or other models.

Data is constantly updated. Historic data does not change. New date ranges are added.

Typical operations are on small sets of records.(e.g. add a record, update a record)

Typical operations are on large numbers of records.(e.g. load large data file, groupings and aggregations)

All transactions are logged. Inserts may not be logged at record level. There normally are no updates or deletes.

B-tree indexes used for performance. Partitioning and bitmap indexes used for performance.

Traditional development life cycle Heuristic and agile development

Date range is limited. Old records are archived. Date range can be many years.

Development and production are in separate databases. All data is production. Code is version-controlled.

Page 46: Introduction to Data Warehousing Randy Grenier Rev. 11 November 2014

Paradigm Shift for Management

• Traditional development life cycle doesn’t work well when building a data warehouse. There is a discovery process. Agile development works better.

• Data warehouses are created from data that was designed for some other purpose. It is important to evaluate data content before planning applications.

• Integrating data from multiple sources can be hampered by inconsistencies:› Different code values› Different columns› Different meaning of column names› Variations in how well columns are populated› Other inconsistencies

• OLAP data tend to be much larger requiring more resources.

• Storage, storage, storage…

Paradigm Shift

Page 47: Introduction to Data Warehousing Randy Grenier Rev. 11 November 2014

Paradigm Shift for Database Administrators

• Different system configurations (in Oracle, different initialization parameters)

• Backup frequency may be based on ETL scheduling rather than transaction volume.

• Transaction log archiving may not be necessary since there are no transactions—just processes on large amounts of data. Methods for recovery from failure may be different.

• Different tuning requirements:› Selects are high cardinality (large percentage of rows)› Massive sorting, grouping and aggregation› DML operations can involve thousands or millions of records.

• Need much more temporary space for caching aggregations, sorts and temporary tables.

• May be required to add new partitions and archive old partitions for rolling windows of history.

• Storage, storage, storage…

Paradigm Shift

Page 48: Introduction to Data Warehousing Randy Grenier Rev. 11 November 2014

Paradigm Shift or Architects and Developers

• Different logical modeling and schema design.

• Extensive use of partitioning for history and other large tables

• Use indexes differently (e.g. bitmap rather than b-tree)

• Different tuning requirements› Selects are high cardinality (large percentage of rows)› Lots of sorting, grouping and aggregation› DML operations can involve thousands or millions of records.

• ETL processes are different than typical DML processes› Use different coding techniques› Use packages, functions, and stored procedures but rarely use triggers or constraints› Many steps to a process› Integrate data from multiple sources

• Iterative and incremental development process (agile development)

Paradigm Shift

Page 49: Introduction to Data Warehousing Randy Grenier Rev. 11 November 2014

For Analysts and Data Users—All Good News

• A custom schema (data mart) can be created for each application per the user requirements.

• Data marts can be permanent, temporary, generalized or project-specific.

• New data marts can be created quickly—typically in days instead of weeks or months.

• Data marts can easily be refreshed when new data is added to the data warehouse. Data mart refreshes can be scheduled or on demand.

• There may be additional query tools and dashboards available (e.g. Business Intelligence, Self-Service BI, data visualization, etc.).

• Several years of history can be maintained in a data warehouse—bigger samples.

• There is a consistent single source of truth for any given data set.

Paradigm Shift

Page 50: Introduction to Data Warehousing Randy Grenier Rev. 11 November 2014

Components of a Data Warehouse

Architecture: Main Components

OLTP

OLTP

Staging HistoryReferenceMetadata

DM

DM

DM

DMData set

ODS

Operational Data Data Warehouse Analytic Data

ETL/T&I ETL/T&I

OLTP

OLTP

Report

Query

Query

Page 51: Introduction to Data Warehousing Randy Grenier Rev. 11 November 2014

Report

Query

Query

Staging and ODS

Architecture: Staging and ODS

OLTP

OLTP

Staging HistoryREF

DM

DM

DM

DM Data set

ODS

Operational Data Data Warehouse OLAP Data

ETL

• Staging is the area where operational data is initially loaded.

• Snapshots of operational data at a given point in time are loaded into staging.

• Data sources may include complete replacement data files, but are usually new records only.

• Validation reports should be run on staging data if it originated from a source external to the organization.

• An Operational Data Store (ODS) is an optional component that is used for Near-Real-Time reporting.

• Limited transformation and integration of data• Less history (typically only days)

OLTP

OLTP

Page 52: Introduction to Data Warehousing Randy Grenier Rev. 11 November 2014

Report

Query

Query

History and Reference Data

Architecture: History

OLTP

OLTP

Staging

DM

DM

DM

DM Data set

ODS

Operational Data Data Warehouse OLAP Data• History includes all source data—no exclusions or integrity constraints.

• Data from multiple sources is integrated into the history tables.

• A data source column can be added to each table.

• Partitioning is used to:• manage extremely large tables• improve performance of queries• to facilitate “rolling window” of

history.• Denormalization can be used to

reduce number of joins when selecting data from history.

• No surrogate keys—maintain all original code values in history.

HistoryReferenceMetadataOLTP

OLTP

ETL/T&I

Page 53: Introduction to Data Warehousing Randy Grenier Rev. 11 November 2014

Report

Query

Query

History and Reference Data

Architecture: History

OLTP

OLTP

Staging

DM

DM

DM

DM Data set

ODS

Operational Data Data Warehouse OLAP Data

•Reference data should also have history (e.g. codes that change over time).•Metadata is used to “map” data into common fields when integrating from multiple sources.

HistoryReferenceMetadataOLTP

OLTP

ETL/T&I

Page 54: Introduction to Data Warehousing Randy Grenier Rev. 11 November 2014

Data Marts

Architecture: Data Marts

OLTP

OLTP

Staging HistoryREF

DM

DM

DM

DMData set

ODS

Operational Data

OLAP Data

ETL

• Data marts are per requirements of users and applications.

• Selection criteria (conditions in WHERE clause) are applied when creating data marts.

• Logical data modeling is applied here (e.g. denormalized, star schema, cross-tabulated, derived columns, etc.).

• Any surrogate keys can be applied at data mart level (e.g. patient IDs).

• Data marts can be on different platforms (e.g. Oracle, SQL Server, text files, SAS data sets, etc.)

• Data marts can be permanent for ongoing or temporary for one-time applications.

• Data mart refreshes can be scheduled or on demand.

Report

Query

Query

ETL/T&I

Page 55: Introduction to Data Warehousing Randy Grenier Rev. 11 November 2014

Emerging technologies that are having an impact on data warehousing

• Massively Parallel Processing (MPP)

• In-Memory Databases (IMDB)

• Column-Oriented Databases

• Database Appliances

• Advanced Access Tools

• Cloud Database Services

• Relational/Unstructured Hybrid Systems

Emerging Technologies

Page 56: Introduction to Data Warehousing Randy Grenier Rev. 11 November 2014

Massively Parallel Processing (MPP)

• Data is split up or sharded over many (up to thousands) of server nodes.

• A controller node manages query execution.

• A query is passed to all nodes simultaneously.

• Data is retrieved from all nodes and assembled to produce query results.

• MPP systems will automatically shard and distribute data using their own algorithms. Developers and architects need only be concerned with conventional data modeling and DML operations.

• MPP systems make sense for OLAP and data warehousing where queries are high cardinality (on very large numbers of records).

Emerging Technologies

Page 57: Introduction to Data Warehousing Randy Grenier Rev. 11 November 2014

Emerging Technologies

Node 1

Node 2Server Node

Query Map Reduce

Node 3

Node n

Result

Massively Parallel Processing (MPP)

Page 58: Introduction to Data Warehousing Randy Grenier Rev. 11 November 2014

In-Memory Databases (IMDB)

• Data is stored in random access memory (RAM) rather than on disk or SSD.

• Memory is accessed much more quickly than disk.

• Although traditional RDBMS software utilize memory cache, they are still optimized for storing and accessing data on disk.

• IMDB software has modified algorithms to be optimized to read data from memory.

• Database replication with failover is typically required because of the volatility of computer memory.

• Rapidly declining cost of RAM is making IMDB systems more feasible.

• Microsoft SQL Server has a feature called In-Memory. Tables must be defined as memory optimized to use this feature.

• Oracle supports in-memory computing with their Oracle Database In-Memory product.

Emerging Technologies

Page 59: Introduction to Data Warehousing Randy Grenier Rev. 11 November 2014

Column-Oriented Databases

• Data in a typical relational database is organized by row. The row paradigm is used for physical storage as well as the logical organization of data.

• Column-Oriented databases physically organize data by column while still able to present data within rows.

• Since most queries select a subset of columns (rather than entire rows), column-oriented databases tend to perform much better for analytical processing.

• Both Microsoft SQL Server and Oracle 12c have support for column-based data storage.

• See http://nms.csail.mit.edu/~stavros/pubs/tutorial2009-column_stores.pdf.

Emerging Technologies

Page 60: Introduction to Data Warehousing Randy Grenier Rev. 11 November 2014

Row-Oriented Storage

Emerging Technologies

Column-Oriented Storage

SELECT study_id, COUNT(*)FROM form_demog_dataWHERE dob > ‘01/01/1960’GROUP BY study_id ;

• In a row-oriented database, the entire row would be accessed.• In a column-oriented database only STUDY_ID and DOB would have to be accessed.

Column-Oriented Databases

HUNDREDS OF COLUMNS

Page 61: Introduction to Data Warehousing Randy Grenier Rev. 11 November 2014

Database Appliances

• A database appliance is an integrated, preconfigured package of RDBMS software and hardware.

• The most common type of database appliance is a data warehouse appliance.

• Most major database vendors including Oracle and Microsoft and their hardware partners package and sell database appliances for data warehousing.

• Data warehouse appliances utilize massively parallel processing (MPP).

• Database appliances don’t always scale well outside of the purchased configuration. You generally don’t add storage to a database appliance.

• The database appliance removes the burden of performance tuning. Conversely, database administrators have less flexibility.

• A database appliance can be a cost-effective solution for data warehousing in many situations.

Emerging Technologies

Page 62: Introduction to Data Warehousing Randy Grenier Rev. 11 November 2014

Advanced Access Tools

• Business Intelligence (BI) tools allow users to view and access data, create aggregations and summaries, create reports, and view dashboards with current data.

• BI tools are usually very good for slice and dice operations on star schemas.

• BI tools typically sit on top of data marts created by the architects and developers. Data marts that support BI are typically star schema.

• Newer Self-Service BI tools add additional capabilities such as allowing users to integrate multiple data sources and do further analysis on result data sets from previous analyses.

• Data visualization tools allow users to view data in various graphs.

• Newer tools allow users to access and analyze data from multiple form factors including smart phones and tablets.

• BI and data visualization tools do not always provide the capability to perform complex analyses or fulfill specific requirements of complex reports (e.g. complex statistical analyses or epidemiologic studies). Programming skills are frequently still required.

Emerging Technologies

Page 63: Introduction to Data Warehousing Randy Grenier Rev. 11 November 2014

Cloud Database Services

• A cloud database exists on remote servers and accessed securely over the Internet.

• Oracle, Microsoft, and other database vendors offer cloud database services.

• A cloud database platform provided by a vendor is called a Platform as a Service (PaaS).

• The cloud database service provider performs all database administrative tasks:› Replicate data on multiple severs › Make backups› Scale growing databases› Performance monitoring and tuning

• Cloud services can be useful for prototyping and heuristic development. A large commitment to hardware purchases and administrative staff can be postponed for later assessment.

• Cloud services could result in considerable cost savings for some organizations.

• A cloud hybrid database is one that has database components both on the cloud and on local servers.

• Cloud services may limit administrative options and flexibility vs. having your own DBAs and system administrators.

• Cloud services may not meet regulatory requirements for security and storage for some applications (e.g. HIPAA, FDA regulations, etc.).

Emerging Technologies

Page 64: Introduction to Data Warehousing Randy Grenier Rev. 11 November 2014

Relational/Unstructured Hybrid Systems

• Oracle, Microsoft, and other RDBMS vendors sell hybrid database systems that combine unstructured data with relational database systems.

• Both Oracle and Microsoft incorporate Hadoop unstructured databases with their proprietary products.

• Oracle product Big Data SQL allows standard Oracle SQL to be used against Hadoop data.› External tables are used on unstructured data so that Oracle can “see” the data as a relational table.

› Data from Hadoop can be integrated with relational data (e.g. join operations).

› Oracle’s Exadata technology is required to get high-performance.

› Oracle security can be applied to the unstructured data.

• Other Oracle products include Oracle Big Data Appliance, Oracle NoSQL Database, and Oracle Big Data Connectors.

• Microsoft product HDInsight allows integration of Hadoop data with SQL Server.

• Microsoft Azure with HDInsight integrates Hadoop and SQL Server data in the cloud.

• Both Oracle and Microsoft market relational/unstructured hybrid database appliances.

Emerging Technologies

Page 65: Introduction to Data Warehousing Randy Grenier Rev. 11 November 2014

Questions?

OLTP

OLTP

Staging HistoryReferenceMetadata

DM

DM

DM

DMData set

ODS

Operational Data Data Warehouse Analytic Data

ETL/T&I ETL/T&I

OLTP

OLTP

Report

Query

Query

???????

Page 66: Introduction to Data Warehousing Randy Grenier Rev. 11 November 2014

Slide #5

Image 1: "FortranCardPROJ039.agr" by Arnold Reinhold - I took this picture of an artifact in my possession. The card was created in the late 1960s or early 1970s and has no copyright notice.. Licensed under Creative Commons Attribution-Share Alike 2.5 via Wikimedia Commons - http://commons.wikimedia.org/wiki/File:FortranCardPROJ039.agr.jpg#mediaviewer/File:FortranCardPROJ039.agr.jpg

Slide #8

Image 1: "IBM Keypunch Machines in use" by born1945 - Flickr: IBM Keypunch Machines. Licensed under Creative Commons Attribution-Share Alike 2.0 via Wikimedia Commons - http://commons.wikimedia.org/wiki/File:IBM_Keypunch_Machines_in_use.jpg#mediaviewer/File:IBM_Keypunch_Machines_in_use.jpg

Image 2: “us__en_us__ibm100__punched_card__hand_cards__620x350.jpg” from “IBM 100” web page http://www-03.ibm.com/ibm/history/ibm100/us/en/icons/punchcard/breakthroughs/

Image 3: "IBM26" by Ben Franske - Own work. Licensed under Creative Commons Attribution-Share Alike 3.0-2.5-2.0-1.0 via Wikimedia Commons - http://commons.wikimedia.org/wiki/File:IBM26.jpg#mediaviewer/File:IBM26.jpg

Image 4: "IBM 1403 Printer opened" by Erik Pitti - originally posted to Flickr as IBM 1403 Printer. Licensed under Creative Commons Attribution 2.0 via Wikimedia Commons - http://commons.wikimedia.org/wiki/File:IBM_1403_Printer_opened.jpg#mediaviewer/File:IBM_1403_Printer_opened.jpg

Slide #9

Image 1: “us__en_us__ibm100__punched_card__hand_cards__620x350.jpg” from “IBM 100” web page http://www-03.ibm.com/ibm/history/ibm100/us/en/icons/punchcard/breakthroughs/

Slide #11

Image 1: "FortranCardPROJ039.agr" by Arnold Reinhold - I took this picture of an artifact in my possession. The card was created in the late 1960s or early 1970s and has no copyright notice.. Licensed under Creative Commons Attribution-Share Alike 2.5 via Wikimedia Commons - http://commons.wikimedia.org/wiki/File:FortranCardPROJ039.agr.jpg#mediaviewer/File:FortranCardPROJ039.agr.jpg

Image Author Attributions

Page 67: Introduction to Data Warehousing Randy Grenier Rev. 11 November 2014

Slide #12

Image 1: "Magtape1" by Daniel P. B. Smith..Original uploader was Dpbsmith at en.wikipedia.Later version(s) were uploaded by Boojit at en.wikipedia. - Image by Daniel P. B. Smith.;Transferred from en.wikipedia. Licensed under Creative Commons Attribution-Share Alike 3.0 via Wikimedia Commons - http://commons.wikimedia.org/wiki/File:Magtape1.jpg#mediaviewer/File:Magtape1.jpg

Image 2: "Camp Smith, Hawaii. PFC Patricia Barbeau operates a tape-drive on the IBM 729 at Camp Smith. - NARA - 532417" by Unknown or not provided. Licensed under Public domain via Wikimedia Commons - http://commons.wikimedia.org/wiki/File:Camp_Smith,_Hawaii._PFC_Patricia_Barbeau_operates_a_tape-drive_on_the_IBM_729_at_Camp_Smith._-_NARA_-_532417.tif#mediaviewer/File:Camp_Smith,_Hawaii._PFC_Patricia_Barbeau_operates_a_tape-drive_on_the_IBM_729_at_Camp_Smith._-_NARA_-_532417.tif

Slide #14 and 15

Image 1: "IBM 2311 memory unit" by Deep silence (Mikaël Restoux) - 25 years of computers, La défense (Paris). Licensed under Creative Commons Attribution 2.5 via Wikimedia Commons - http://commons.wikimedia.org/wiki/File:IBM_2311_memory_unit.JPG#mediaviewer/File:IBM_2311_memory_unit.JPG

Slide #16

Image 1: "IBM 2311 memory unit" by Deep silence (Mikaël Restoux) - 25 years of computers, La défense (Paris). Licensed under Creative Commons Attribution 2.5 via Wikimedia Commons - http://commons.wikimedia.org/wiki/File:IBM_2311_memory_unit.JPG#mediaviewer/File:IBM_2311_memory_unit.JPG

Image 2: "DEC VT100 terminal" by Jason Scott - Flickr: IMG_9976. Licensed under Creative Commons Attribution 2.0 via Wikimedia Commons - http://commons.wikimedia.org/wiki/File:DEC_VT100_terminal.jpg#mediaviewer/File:DEC_VT100_terminal.jpg

Image 3: "IBM360-65-1.corestore" by Original uploader was ArnoldReinhold at en.wikipedia - Originally from en.wikipedia; description page is/was here.. Licensed under Creative Commons Attribution-Share Alike 3.0 via Wikimedia Commons - http://commons.wikimedia.org/wiki/File:IBM360-65-1.corestore.jpg#mediaviewer/File:IBM360-65-1.corestore.jpg

Slide #17

Image 1: "IBM 2311 memory unit" by Deep silence (Mikaël Restoux) - 25 years of computers, La défense (Paris). Licensed under Creative Commons Attribution 2.5 via Wikimedia Commons - http://commons.wikimedia.org/wiki/File:IBM_2311_memory_unit.JPG#mediaviewer/File:IBM_2311_memory_unit.JPG

Image Author Attributions

Page 68: Introduction to Data Warehousing Randy Grenier Rev. 11 November 2014

Slide #22

Image 1: "Edgar F Codd". Via Wikipedia - http://en.wikipedia.org/wiki/File:Edgar_F_Codd.jpg#mediaviewer/File:Edgar_F_Codd.jpg

Slide #35

Image 1: “inmon.gif” from “Bill Inmon: Date Warehouses and Decision Support Systems” http://www.dssresources.com/interviews/inmon/inmon05122005.html

Slide #41

Image 1: "Ralph kimball" by Ralphfan99 - Own work. Licensed under Creative Commons Attribution-Share Alike 3.0 via Wikimedia Commons - http://commons.wikimedia.org/wiki/File:Ralph_kimball.jpg#mediaviewer/File:Ralph_kimball.jpg

Slide #43

Image 1: "Big data cartoon t gregorius" by Thierry Gregorius - Cartoon: Big Data. Licensed under Creative Commons Attribution 2.0 via Wikimedia Commons - http://commons.wikimedia.org/wiki/File:Big_data_cartoon_t_gregorius.jpg#mediaviewer/File:Big_data_cartoon_t_gregorius.jpg

All other images Copyright © 2014 Randy Grenier

Image Author Attributions