syncsort, tableau, & cloudera present: break the barriers to big data insight
Post on 15-Jul-2015
Embed Size (px)
2014 Cloudera, Inc. All rights reserved.12014 Cloudera, Syncsort, Tableau Inc. All rights reserved.
2014 Cloudera, Syncsort, Tableau Inc. All rights reserved.#1Agenda2014 Cloudera, Syncsort, Tableau Inc. All rights reserved.Data Warehouse Vision & RealityWhat is legacy data & why an Enterprise Data HubOffloading legacy data and workloads to HadoopTransform all types of data into self-service analyticsLive DemonstrationCustomer case studyQ&A#2What is this?2014 Cloudera, Syncsort, Tableau Inc. All rights reserved.3
So what is this?
If you are thinking its a 3.5 inch floppy disk and it stored 1.44Mb of your data you were born after 1998
In 1998 the imac was launched and it was the first home computer not to have one of these as standard just a CD drive
And to anyone born after then this is the save button in most applications and you have no idea why its the save button and certainly would not call it a floppy
So over christmas my mum was sitting with my 4 year old nephew using his ipad and theres clearly some sort of confusion so I see the two of them sitting there trying to figure out which slot on the ipad mum can insert a floppy disk with her christmas pudding recipe into.
So I can tell you that getting data from a floppy disk onto an ipad is not fun at all and my mum is not sure this whole computer thing is really working out for her son or grandchild because we were largely useless
So whats funny is that I guess the equivalent today of a floppy disk is a memory stick today it can store a lot more data but if I personally want to get a large file from one machine to another like a mac I use dropbox or box and it happens instantly and its constantly kept in sync.
Technology evolution has completely changed our approach to solving a problem and thats an important theme3
The Data Warehouse Vision -19984Data Integration & ETL Tools would enable a Single, Consistent Version of the Truth Data MartData Mart2014 Cloudera, Syncsort, Tableau Inc. All rights reserved.#Steve + Paul
Back when I started my career in Data Warehousing in the 90s this is what the business was promised.An Enterprise data warehouse would bring together data from every different source system across an organization to create a single trusted source of information.Data would be extracted transformed and loaded into the warehouse using ETL tools these would be used instead of hand coding SQL or COBOL or other scripts because they would provide a graphical user interface that allowed anyone even a graduate that just joined your team to develop flows and no rocket scientists required scalability to handle the growing data volumesmetadata to enable re-use and sharing and governanceand transparent connectivity to the different sources and targets including mainframeETL would then be used to move data from the EDW to marts and delivered to reporting tools.
4Data Warehouse Reality 20145
Data Integration & ETL Tools would enable a Single, Consistent Version of the Truth Data MartData Mart
Dormant DataStaging / ELT
2014 Cloudera, Syncsort, Tableau Inc. All rights reserved.#Steve + Paul
This is the reality of most Data Warehouses today. A spaghetti like architecture has evolved because the market leading ETL tools couldnt cope with the data volumes on core operations like sort, join, merge, aggregation so that workload was pushed into the only place that could handle it the databases with their optimizer. But that meant ELT hand coded or generated SQL that became impossible to maintain a customer told me they called this the onion effect because their staging had become layers of SQL that nobody wanted to touch so they just added another layer on top. But if you ever really had to take the onion apart it would make everyone cry - TDWI estimates it takes upwards of 8 weeks to add a column to a table and in my experience thats low most times you have to wait a couple of months before they get to your request and start making the change because of the back-log
Today the average cost of an integration project runs between $250K and $1M, according to Gartner
5The Data Warehouse Vision vs RealityFresher dataLonger history dataFaster analyticsMore data sourcesLower costsLonger ELT batch windowsShorter data retentionSlower queriesWeeks/months just to add new data fieldsGrowing costsVisionReality
2014 Cloudera, Syncsort, Tableau Inc. All rights reserved.#So theres a massive disconnect between the original vision of the warehouse and the realityBut its important to note that business users are getting great information from warehouses but they still want fresher data, longer history data, faster analytics, more sources all at a lower costWhile they are seeing longer batch windows many companies have people sitting around drinking coffee in the mroning until the warehouse is avialableThey have a small subset of a customers lifecycle
6Mainframes | A Critical Source of Big Data7Top 25World Banks9 of Worlds Top Insurers23 of Top 25 US Retailers71%Fortune 50030 Billion Bus. Transactions / day
2014 Cloudera, Syncsort, Tableau Inc. All rights reserved.#So the first thing we all need to recognize is that Mainframes today play a very important role in many organizations. Top telcos, retailers, insurance, healthcare and financial organizations of the world still rely on mainframes for their most critical applications. When talking to these organizations, its not unusual to hear that up to 80% of their corporate data originates in the mainframe. Now, that is some serious Big Data, and organizations cannot afford to neglect it.
But Can you afford to analyze it? Well, Mainframes today, costs an average of $16M a year for the typical $10B organization!
Thats why many of these organizations are now looking at Hadoop and making mainframes a core piece of their Big Data strategy. Just imagine for a second the kind of insights that you could get by combining detail transactional data from mainframes with clickstream data, web logs, and sentiment anallysis
7Suits & Hoodies Working Together8Integration GapsExpertiseGapsCOBOL appeared in 1959, Hadoop in 2005Mainframe & Hadoop skills shortageSecurityGapsHosts mission critical sensitive dataVery difficult to install new software on MFCosts GapsMainframe data is (expensive) Big DataEven FTP costs CPU cycles (MIPS)ConnectivityData conversion (EBCDIC vs ASCII)Suits & Hoodies idea: Merv Adrian, Gartner Research.
2014 Cloudera, Syncsort, Tableau Inc. All rights reserved.
#8Expanding Data Requires A New Approach91980sBring Data to ComputeNowBring Compute to DataRelative size & complexityDataInformation-centricbusinesses use all data: Multi-structured, internal & external data of all typesComputeComputeComputeProcess-centric businesses use:
Structured data mainlyInternal data onlyImportant data only
ComputeComputeComputeDataDataDataData2014 Cloudera, Syncsort, Tableau Inc. All rights reserved.#Today we're in the middle of a shift in how businesses use information. In the past, you'd define a set of business processes, build applications around each of them, and then go about gathering, conforming, and merging the necessary data sets to support those applications. From an infrastructure perspective, you'd be bringing the data over to the compute, often in relational databases. But you'd be leaving quite a lot on the table.
The modern realities of business demand a new approach. Today companies need, more than ever, to become information-driven, but given the amount and diversity of information available, and the rate of change in business, it's simply unsustainable to keep moving around and transforming huge volumes of data.9From Apache Hadoop to an enterprise data hub10Open SourceScalableFlexibleCost-EffectiveManagedOpen ArchitectureSecure and GovernedBATCHPROCESSINGSTORAGE FOR ANY TYPE OF DATAUNIFIED, ELASTIC, RESILIENT, SECURE
FILESYSTEMMAPREDUCEHDFSCore Apache Hadoop is great, but1) Hard to use and manage.2) Only supports batch processing.3) Not comprehensively secure.2014 Cloudera, Syncsort, Tableau Inc. All rights reserved.#The foundational platform that's addressing this wide range of problems today is Apache Hadoop, an open source platform for scalable, fault-tolerant data storage and processing that runs on a cluster of industry-standard servers. But Hadoop, in the beginning, wasn't capable of solving these problems. Originally, Hadoop was just a scalable distributed system for storing and processing large amounts of data. You could bring workloads to an effectively limitless amount and variety of data, provided the only kind of work you wanted to do was batch processing by writing Java code, and provided you liked hiring highly-skilled computer scientists to operate it.10From Apache Hadoop to an enterprise data hub11Open SourceScalableFlexibleCost-EffectiveManagedOpen ArchitectureSecure and GovernedBATCHPROCESSINGSTORAGE FOR ANY TYPE OF DATAUNIFIED, ELASTIC, RESILIENT, SECURE
2014 Cloudera, Syncsort, Tableau Inc. All rights reserved.#Cloudera solved the latter problem with Cloudera Manager, the leading system management application for Apache Hadoop. Customers love Cloudera manager because it makes the complex simple. Hadoop is more than a dozen services running across many machines, with limitless configuration permutations. With Cloudera Manager, customers can centrally manage and monitor their clusters from