intro to hybrid data warehouse

44
Intro to Hybrid DataWarehouse Presented by: Jonathan Bloom Senior BI Consultant, Agile Bay, Inc.

Upload: jonathan-bloom

Post on 09-May-2015

1.902 views

Category:

Technology


4 download

DESCRIPTION

Intro to Hybrid Data Warehouse combines traditional Enterprise DW with Hadoop to create a complete data ecosystem. Learn the basics in this slide deck.

TRANSCRIPT

  • 1.Presented by: Jonathan Bloom Senior BI Consultant, Agile Bay, Inc.

2. Jonathan Bloom Current Position: Senior BI Consultant Customers & Partners Blog: http://www.BloomConsultingBI.com Twitter: @SQLJon Linked-in: http://www.linkedin.com/BloomConsultintBI Email: [email protected] 3. w w w . a g i l e b a y. c o m 4. Agenda EDW Hybrid Data Warehouse Hadoop Q&A 5. Why EDW? 6. Convert Data to Information Accumulating Data Manage the Business OLTP != Reporting Apply Business Rules Clean Data Analytics Proven Framework 7. EDW Role Reporting Lifecycle Domain Knowledge Interact with Business Gather Specs Estimate Time Knowledge of Database SQL Skills Change Management 8. EDW Architecture Source System Staging Raw Master Data Services Enterprise Data Warehouse Analysis Services Cubes 9. Data Modeling Kimball Methodology Star Schema Pattern forms a graphical Star Snow Flake Schema Branches 10. Tables Dimension Tables Describe Data Fact Tables Measures (Sums, Counts, Max, Min, etc.) Contain Surrogate Keys Link back to Dim Tables 11. Slowly Changing Dimensions Type 0 method is passive Values remain as they were at the time the dimension record was first inserted Type 1 Overwrites old with new data Does not track historical data Type 2 Tracks historical data by creating multiple records 12. Date Dimensions Create Scripts Fiscal Year Custom Start / End Dates Key Example: 20140226 13. Dim Tables 14. Fact Tables 15. Fact Table Keys 16. Analysis Server 17. Cubes (Measure Groups) 18. Cubes (Dimensions) 19. SSAS Create Connections Add Data Sources Create Relationships Add Dimensions Create Measure Groups / Measures Create Calculated Measures Create Hierarchys 20. Integrate Hadoop with EDW 21. Hadoop Open Source Community Distribute Parallel Processing Commodity hardware Large Data sets Semi - Un Structured Data 22. Data Gosinta (goes into) When thinking about Hadoop, we think of data. How to get data into HDFS and how to get data out of HDFS. Luckily, Hadoop has some popular processes to accomplish this. 23. SQOOP SQOOP was created to move data back and forth easily from an External Database or flat file into HDFS or HIVE. There are some standard commands for moving data by Importing and Exporting data. When data is moved to HDFS, it creates files on the HDFS folder system. Those folders can be partitioned in a variety of ways. Data can be appended to the files through SQOOP jobs. And you can add a WHERE clause to pull just certain data, for example, just bring in data from yesterday, run the SQOOP job daily to populate Hadoop. 24. Hive Once data gets moved to Hadoop HDFS, you can add a layer of HIVE on top which structures the data into relational format. Once applied, the data can be queried by HIVE SQL. If creating a table, in the HIVE database schema, you can create an External table which is basically a metadata layer pass through which points to the actual data. So if you drop the External table, the data remains in tact. 25. ODBC From HIVE SQL, the tables are exposed to ODBC to allow data to be accessed via Reports, Databases, ETL, etc. So as you can see from the basic description above, if you can move data back and forth easily between Hadoop and your Relational Database (or flat files). 26. PIG In addition, you can use a Hadoop language called PIG (not making this up), to massage the data into a structure series of steps, a form of ETL if you will. 27. Hybrid Data Warehouse You can keep the data up to data by using SQOOP, then add data from a variety of systems to build a Hybrid Data Warehouse. As Data Warehousing is a concept, a documented framework to follow with guidelines and rules. And storing the data in Hadoop and Relational Databases is typically known as a Hybrid Data Warehouse. 28. Connect to the Data Once data is stored in HDW, it can be consumed by users via HIVE ODBC or Microsoft PowerBI, Tableau, Qlikview or SAP HANA or a variety of other tools sitting on top of the data layer, including Self Service tools. 29. Machine Learning In addition, you could apply MAHOUT Machine Learning algorithms to you Hadoop cluster for Clustering, Classification and Collaborative Filtering. And you can run Statistical language analysis with a language called Revolution Analytic R version of Hadoop R. 30. Streaming And you can receive Steaming Data. 31. Monitor There's Zookeeper which is a centralized service to keep track of things. 32. Graph And Girage, which allows Hadoop the ability to process Graph connections between nodes. 33. In Memory And Spark, which allows faster processing by by-passing Map Reduce and ability to run In Memory 34. Cloud You can run your Hybrid Data Warehouse in the Cloud with Microsoft Azure Blobstorage HDInsight or Amazon Web Services. 35. On Premise You can run On Premise with IBM Infosphere BigInsights, Cloudera, Hortonworks and MapR. 36. Hadoop 2.0 And with the latest Hadoop 2.0, there's the addition of YARN which is a new layer that sits between HDFS2 and the application layers. Although HDFS Map Reduce was originally designed as the sole batch oriented approach to getting data from HDFS, it's no longer the sole way. HIVE SQL has been sped up through Impala which completely bypasses Map Reduce and the Stinger initiative which sits atop Tez. Tez has ability to compress data with column stores which allows the interaction to be sped up. 37. New Features With Hadoop 2.0, you can now monitor your clusters with Ambari which has an API layer for 3rd party tools to hook into. A well known limitation of Hadoop has been Security which has now been addressed as well. 38. HBase Hbase allows a separate database to allow random read/write access to the HDFS data, and surprisingly it too sits with the HDFS cluster. Data can be ingested to HBASE and interpreted On Read, which Relational Databases do not offer. 39. HCatalog Sometimes when developing, users don't know where data is stored. And sometimes the data can be stored in a variety of formats, because HIVE, PIG and Map Reduce can have separate data model types. So HCatalog was created to alleviate some of the frustration. It's a table abstraction layer, meta data service and a shared schema for Pig, Hive and M/R. It exposes info about the data to applications. 40. Hadoop 41. Future OLTP? Artificial Intelligence Neural Networks Robots 42. Summary EDW is a concept / framework Ingest Data ETL Output / Reports / Analytics Stay Current Never stop learning! 43. Blog: www.BloomConsultingBI.com Twitter: @SQLJon Linked-in: www.linkedin.com/BloomConsultingBI Email: [email protected]