bdd data lake demo
TRANSCRIPT
NASDAQ: EDGW
Business Analytics Solutions Start HereIntegrated EPM, BI, and Big Data Solutions
2
Why is Microsoft Excel the most commonly used BI tool in the world?
3
Everyone's an “expert”
Industry standard for spreadsheets 750 million users worldwide Over 30 years old
How many Excel “experts” does your organization have?
Excel is Familiar
4
Ultimately, Excel puts Analysts in Control
“Show me the data and I’ll know it when I see it”
...Not just about data consumption, but data consumption and contribution
Analysts need to develop their own “personal” data modification techniques and mashups
Business Analysts don’t know how to provide reporting requirements until they get their hands on the data
5
Despite Excel’s utility for analysts, three primary issues exist...
But Problems Arise
No Data Variety...
No Data Volume...
No Data Governance...
6
By the time you count to 60...
This data will be structured, semi-structured, and completely unstructured
Excel Doesn’t Accommodate Variety
More than 204 million emails will be sent
Billions of new sensor data points will be detected
Over 2 million Google search queries will be performed
684,000 bits of content shared on Facebook
More than 100,000 tweets will be sent
7
Most companies in the US have at least 100,000 GBs of data stored
Excel Doesn’t Accommodate Volume
...Meanwhile, Excel is limited to just 1 million rows…
43 trillion GBs will be created by 2020
Enterprise data will grow 650% in the next five years
The world’s info now doubles every year and a half...
Excel Doesn’t Allow for Governance
SpreadsheetsGive analysts control of the data, but security and integrity are lost as multiple “versions” of data are created
Data WarehouseDesigned to
provide a single version of truth for
analysts and facilitate
governance
IT wants governance … Business wants control
IT Analysts
While a traditional warehouse may be able to handle expected volumes, it can’t...
Is Your Current Warehouse the Solution?
Data Warehouse
CRM
ERP
etc.
ETL
Support rapid data development, ad hoc analysis
Answer unknown questions Quickly integrate new or unstructured data sources
Reporting
A New Approach Is Required...
To give analysts control and access to data
To accommodate increased data variety
To scale your analytical capabilities
To complement the existing solutions
To create a centralized governed repository
Enable Ad-hoc Analysis for the Business
Questions You’re Not Asking
Questions You’re Asking
Things you don’t know
Things you know
01101100 01100110 101011 00111011 01110011 01 1100 01101000 01100010 00 1101 01101100 01100110 0 01011 01100001 011100111000100 01101000 01100010 00111011 01101100 01100110 01101011 01100001 01110011 01100100
Ad-Hoc Analysis
● Heterogenous Data● Massive Compute● Ad-Hoc Analysis● Centralized Repository● Advanced Transform
...What your business needsTraditional Reporting
● Trusted KPIs● Historic Data● Scheduled Reports● Homogenous Data● Pixel Perfect
What your business has...
Enable Discovery Before Reporting
Data Lake
Data Warehouse
00111011 01101100 01100110 101011
00111011 01100001 01110011 011100
01101011 01101000 01100010 00 1101 00111011 01101100 01100110 0 01011 00111011 01100001 011100111000100 01101011 01101000 01100010 001110110 00111011 01101100 01100110 011010111
CRMERP
Conform
Archive
Ad-Hoc Analysis Reporting
New Data Sources
Existing Data Sources
Copy/Ingest
13
Load all types of existing data into the lake “as is”
Step 1 - Fill the Lake
Data Variety
Centralized RepositoryIncorporate New Data Sources
One Centralized Repository
• Eliminates Data Silos• Improves Data Integration• Promotes Data Governance
• Social Media• Transactions• Unstructured• Sensor Data• “As-is” Data
00111011 01101100 01100110 10101100 00111011 01100001 01110011 10011100 01101011 01101000 01100010 00101101
Step 2 - Add a Discovery Layer
Give analysts control and access to the data
Select a Data Discovery tool that is right for your business
Analyst Control Software Agnostic
• Total autonomy• Ad-hoc analysis• Personalized mash-ups• Single version of the “truth”
• Oracle Big Data Discovery• Datameer• Platfora• Open Source
Read the fine print: Be wary of tools that promise ad-hoc analysis, but only enable data consumption or visualization
Step 3 - Graduate to the Warehouse
Augment Existing Solutions
Lake + Warehouse quicker time-to-value, more data, more capability
Migrate crucial insights to the warehouse
Leverage existing reports/create new ones
Archive back into the data lake
Identify data quality issues quickly
Build transforms at massive scale
The Bigger Picture
Scalable Storage and Compute
Tech Replacement
Massive Transform Capabilities
New Advanced Analytics
Introduce a repository that can house all your
organization’s data, at scale, with no risk of data loss
Lay the foundation for new “untapped” analytical
capabilities like predictive, machine learning, search, and
real-time alerting
Over time, reduce the size and cost of your warehouse by re-platforming some reporting onto the data lake
Deliver powerful, performant transforms leveraging the massive compute power of the data lake
17
Scenario:
Flipflops Resort is located in the heart of the caribbean and is a popular tourist destination
Their marketing team would like to better understand the impact of social sentiment on sales
How might this play out in the “real world”?
10101011 01101100 0110 01101011 1011 01100001 0011 01100100
10010101 0 010101100111011 01101100 01100110 01101011 00111011 01100001 01110011 01100100 01101011 01101000 01100010 00111011
01 01010 00111011 01101100 01100110 01101011 00111011 01100001 01110011 01100100
10 1010 11000111011 01101100 01100110 01101011 00111011 01100001 01110011 01100100 01101011 01101000 01100010 00111011
01111 1 00100111011 01101100 011 01101011 00111011 011 01110011 01100100 011 01101000 01100010 001
0011 01000111011 01101100 01100110 01101011 00111011 01100001 01110011 01100100
1110 0101 11100111011 01101100 01100110 01101011 00111011 01100001 01110011 01100100 01101011 01101000 01100010 00111011
Currently, Flipflops uses database file dumps in excel format to gather any insights...
This can be very time consuming and does not promote the inclusion of new data sources
Current Strategy
“Does our resort’s weather impact social media sentiment?”
Discovery Starts With a Question
Need to ingest data from sources and formats that may not be not structured in a spreadsheet friendly way
Limitations of Current Practices
Obtaining this data can be a labor intensive process
Semi-Structured Data Example
It's clear that Excel does not handle semi-structured data well, and doesn’t support unstructured data at all
Attempting to draw insights from this data, or joining additional data sources to draw any correlations would be difficult at best
This is where we can utilize discovery and the data lake to answer our question
Outgrowing Excel
Piping Outside Data to the Lake
We’re focused on social media sentiment, so let’s grab some tweets and weather data, and put it into our lake
New Data Sources
23
Data Lake
Data Warehouse
00111011 0110110001100110 101011 0011101101100001 01110011 01110001101011 01101000 01100010 00 1101 00111011 01101100 01100110 0 01011 010100 111100 100010
Ad-Hoc Analysis
Reporting
24
Piping More Data to the Lake
Data Lake
Data Warehouse
Ad-Hoc Analysis Reporting
Additionally, let’s leverage existing marketing and booking data to help answer our question
Existing Data Sources
00111011 0110110001100110 101011 0011101101100001 01110011 01110001101011 01101000 01100010 00 1101 00111011 01101100 01100110 0 01011 010100 111100 100010
View of our Data Lake through a web interface called Hue
Note the variety of file types that can be stored
Hue Lake View
Data Lake
26
Analysis on top of Lake
We are now ready to start our discovery phase and will use an analytical tool on top of our lake to visualize any insights
Data Lake
Data Warehouse
Ad-Hoc Analysis
Reporting
Existing Data Sources
New Data Sources
00111011 0110110001100110 101011 0011101101100001 01110011 01110001101011 01101000 01100010 00 1101 00111011 01101100 01100110 0 01011 010100 111100 100010
Diving into the Lake
With a variety of both open source and proprietary tools
available, we can quickly view our data and gather potential
insights
28
Options for Discovery
28
Ad-Hoc Analysis
There are many different ways to analyze the data in the lake
00111011 0110110001100110 101011 0011101101100001 01110011 01110001101011 01101000 01100010 00 1101 00111011 01101100 01100110 0 01011 010100 111100 100010
Data Lake Demo
Incorporating New Insights
Any insights we discover could be included in a traditional data
warehouse and integrated into regular reporting
Data Warehouse Reporting
New Data Fields/Sources
Data Lake
Data Lake• Centralized access to heterogeneous
data• Powerful data transformations • Easily join data sets together• Ability to visualize fields within
moments of upload• Garnish insights into data without
significant time investment• Maintain data integrity
Demo Recap
Microsoft Excel• Local access to homogeneous data• Slow data transformations, data loaded
onto local machine• Tedious joining of data sets• Visualizations must be built and configured
for new data sets• Gathering data insights may involve notable
amount of staff time• Loss of data governance and integrity
A comparison of what we accomplished using a data lake:
32
Next Steps
So What Now?
1. Let Ranzal help your organization understand how to best move forward with an “Analytics Roadmap”
2.) Start small with your data lake. Let Ranzal implement the first
solution to deliver real ROI. This is often Infrastructure Replacement, Active Archive, and/or ETL Offload
33
Contact Information
Edgewater Ranzal108 Corporate Park Drive, Suite 105
White Plains, NY 10604Tel (914) 253-6600
Email: [email protected]
45 Beech Street, Suite 109London EC2Y 8ADUnited KingdomTel +44 (0) 2033 717 174
130 S. Jefferson St.Suite 101Chicago, IL 60661Tel (847) 269-3524
200 Harvard Mill SquareSuite 210Wakefield, MA 01880Tel (781) 246-3343