big data visualization with...
TRANSCRIPT
Big Data Visualization
with Tableau
Avirup Chakraborty(MDS201908)
Debangshu Bhattacharya(MDS201910)
Ipsita Ghosh(MDS201913)
Swaraj Bose(MDS201936)
Sreya K.K.(MDS201804)
What is big data?
Extremely large data sets that may be analyzed
computationally to reveal patterns, trends, and
associations, especially relating to human behaviour
and interactions.
Velocity
Variety
Volume
Veracity
Why data visualization is
important?
communicates relationships of the data with images
allows trends and patterns to be more easily seen
give meaning to complicated datasets so that their
message is clear and concise
outlier detection becomes easier
results from complex algorithms are much easier to
understand in a visual format
summary of data
Challenges in big data visualization
(4 V’s yet again!)
Traditional visualization tools are not capable of handling
large datasets. Eg: MS Excel, Minitab
Providing low latency in visualization
Parallelization is required
Dimensions of the data has to be carefully chosen
Most current visualization tools have low performance
w.r.t scalability, functionality and response time
Steps for big data visualization
Data
acquisition
Parsing
and
filtering
Mining
hidden
patterns
Data
visualization Refinement
a powerful and fast growing data visualization tool
used in the Business Intelligence Industry.
connects easily to nearly any data source.
allows for instantaneous insight on data by
transforming it into interactive visualizations
called dashboards.
What is Tableau?
Why is Tableau helpful?
Handle large volume of data
No scripts or code required, provides user interface
Filter multiple datasets simultaneously
Creates interactive and shareable dashboards depicting trends and variations
Incorporate other programming languages to do complex calculations
And many more….
Trivia
Founded: January 2003, California
Founders: Christian Chabot, Chris Stolte (Stanford
University) , Pat Hanrahan
Headquarters: Seattle, California
Website: https://www.tableau.com/
Built using C++
Latest version: 2020.1
Tableau Desktop:
A data visualization tool designed to create data visualization, report and
dashboard in a fast and intelligent way.
Users can connect to multiple data sources, carry out multi-dimensional
data analysis, create dashboards or report, modify metadata and publish a
complete workbook to Tableau server if needed.
Adapt your content performance for any size and any device (i.e. Desktop,
laptop, tablet or even a smartphone!).
Tableau Desktop
Personal Edition
Professional Edition
Personal Edition Professional Edition
Connects to limited data sources as:
Microsoft Access,
Microsoft Excel,
Microsoft Azure,
Tableau Data Extract,
Text files (CSVs).
Connects to a wider variety of data sources:
Amazon Redshift,
Google Analytics,
Google BigQuery,
Hortonworks Hadoop,
OLAP databases,
Salesforce.
Cannot connect to Tableau Server but allows
users to create package files for Tableau
Reader.
Enables connection to Tableau Server and
creating package files for Tableau Reader.
Costs $999 per user. Costs $1999 per user.
Tableau Server:
Tableau server is essentially an online hosting platform to hold all your tableau workbooks, data sources and more.
It works like any other server, you can store things here and they will safe from fires and pesky hackers.
So, what are the advantages of Tableau Server??
1. Firstly…. COLLABORATION!
Being a Tableau product, Tableau Server lets us to use the
functionality of Tableau, without needing to always be
downloading and opening workbooks.
Users need not to install Tableau Desktop on their machine, and
they can still interact with dashboards shared with them.
3. COMPATIBILITY
Tableau Server supports variety of Android apps, iPhone apps and Web browsers
like Internet Explorer, Mozilla Firefox, Google Chrome and Safari.
2. CLOUD SUPPORT
Tableau server can be deployed on-premises as well as in public clouds like
Azure, AWS, IBM Cloud, Google Cloud Platform etc.
It also enables an administrator to track and manage the content, licenses,
performance, and permissions for data sources with ease.
4. LIMITED ACCESS DESIGN
On Tableau Server, we can set permissions to different bits
of work, to allow us as an organization to determine who
can access and interact with what.
Let us illustrate this using a really simple example >>
Consider this ‘imaginary’ company consisting:
Tony Stark Dr. Bruce Banner
and
Nick Fury
❖ Tony Stark has access on server to upload and edit work in a project containing test documents.
❖ Dr. Banner can interact with only the production quality documents.
❖ And…Nick Fury can access but not edit the final presentation documents.
❖ Of course…Loki cannot even have a look at the documents!
(at least we can hope so)
Tableau Public:
Tableau Public is a FREE tool that anyone can use to connect to
data, create interactive data visualizations and publish them on
the web.
Once these visualizations are in Tableau public one can share to social medias
or even can embed on webpages.
Since everyone has access to published data, user should be careful not to
put the proprietary data on Tableau Public.
Limitations to Tableau Public:
Row limitation:
Limited to 15,000,000 rows of data per workbook.
Limited storage:
Limited to ten gigabytes (10 GB) of storage space for your workbooks.
No workbook privacy:
Tableau Public does not allow to save workbooks locally. One has to save them publicly which means that everyone can see the data since it’s saved on the cloud.
No security:
As visualizations are public so anyone can access the data and make change by downloading the workbook.
Tableau Online:
Tableau Online is a hosted version of Tableau Server. It is the business analytics
platform where people can share dashboards, interact with report and gain insights.
It is hosted in the cloud so that there is no hardware, no set-up time needed.
“Want the sharing and collaboration of
Server, but without having to actually
manage a server? Then you want Tableau
Online. Secure. Scalable. And Look Ma—No
hardware to maintain!”
- https://www.tableau.com/products
Roughly, Tableau Online can also be thought as a private version (and paid, obviously) of
Tableau Public.
Key Features:
Fully hosted in the cloud. Servers are managed by Tableau
Team.
Supports live data connections to Amazon Redshift, Google
BigQuery, as well as to SQL-based sources hosted on cloud
platforms.
Ideal for small number of users who need to be able to
interact with the data and visualizations in a secure way.
Easily accessible from a browser or Tableau Mobile App.
Authenticate users through TableauID (email address and
password). No guest access allowed.
Subscription rate is $500 per user for one year (half the
price of individual Tableau Server Licenses)
Key Features (Contd.):
Tableau Reader
Tableau Reader is a FREE desktop application
Allows interaction with data visualizations,
created with Tableau Desktop.
Users can filter, drill-down and view the details
of the data as long as the author allows.
Tableau Start Page
Data grid - Displays first 1,000 rows of the
data contained in the Tableau data source.
Left pane- Displays the
connected data source
and other details about
your data.
Canvas- Displays
information about
how the data
source is set up and
options for
combining the
data.
Metadata grid- Displays the fields in your data source
as rows.
Tableau Worksheet
The Dashboard Workspace
DEMO
Philosophy of Tableau working with Big Data
• Democratisation of Data: Knowledge workers of all skill levels
should be able to access and analyze data wherever it resides.
• Partnerships within the Big Data Ecosystem
Overview of how Tableau works with big data
Data access and connectivity
To enable analysis of data of any size and format, Tableau supports
broad access to data wherever it lives.
o SQL and NoSQL based connections — Tableau uses SQL to
interface with Hadoop, NoSQL databases and Spark.
o Open Database Connectivity(ODBC) — By using ODBC, one
can access any data source that supports the SQL standard and
implements the ODBC API. For Hadoop, this includes interfaces such
as Hive Query Language (HiveQL), Impala SQL, BigSQL and Spark
SQL.
o Web Data Connector — With the Tableau Web Data Connector
SDK, users can build connections to data that lives outside of the
existing connectors which is any data accessible over HTTP,
including internal web services, JSON data, and REST API.
Fast Interaction with all data at scale
1. Hyper data engine
• Hyper is a high-performance in-memory data engine technology
that helps customers analyze large or complex data sets faster.
• They use dynamic code generation and cutting-edge parallelism
techniques to achieve high query speed.
• Hyper can also augment and accelerate slower data sources by
creating an extract of the data and bringing it in-memory.
Fast Interaction with all data at scale
2. Hybrid data architecture
• Tableau can connect live to data sources or bring data (or a
subset) in-memory.
• Users can go back and forth between these modes to suit their
needs.
• This hybrid approach brings a lot of flexibility and helps in query
optimization.
Fast Interaction with all data at scale
3. VizQL™
• A traditional analysis tool analyzes data in rows and columns,
choose a subset of the data to present, organize that data into a
table, then create a chart from that table.
• VizQL creates a visual representation of the data right away,
giving visual feedback as the user analyzes.
• VizQL provides an intuitive user experience that lets people
answer questions as fast as they can think of them.
• In this cycle of visual analysis, users learn as they go, add more
data if needed, and ultimately get deeper insights.
Tableau and Big data analytics ecosystem
Tableau fits nicely in the big data paradigm because it prioritizes flexibility—the ability to
move data across platforms, adjust infrastructure on demand, take advantage of new data
types, and enable new users and use cases.
Cloud infrastructure
• Organizations are increasingly moving business processes and infrastructure to the cloud.
• Cloud based infrastructure and data services have removed some of the major hurdles
faced with on-premises Hadoop data lakes.
• Cloud-based big data analytics solutions are easier to implement and manage than ever
before.
• Tableau delivers key integrations with cloud-based technologies that organizations already
use, including Amazon Web Services, Google Cloud Platform and Microsoft Azure.
Ingest and prep
• In modern ingest-and-load design patterns, the destination
for raw data of any size or shape is often a data lake.
• Stream data is generated continuously by connected devices
and apps located everywhere, such as social networks, smart
meters, home automation, video games, and IoT sensors.
• Often, this data is collected via pipelines of semi-structured
data.
• While real-time analytics and predictive algorithms can be
applied to streams, we typically see stream data routed and
stored in raw formats using lambda architecture and into a
data lake, such as Hadoop, for analytics usage.
Ingest and prep
• Lambda architecture is a data processing architecture designed to handle
massive quantities of data by taking advantage of both batch and stream
processing methods.
• The design balances latency, throughput, and fault tolerance challenges.
• A variety of options exist today for streaming data including Amazon Kinesis,
Storm, Flume, Kafka, and Informatica Vibe Data Stream.
• Once data has landed in a data lake, it needs to be ingested and prepared
for analysis.
• Tableau has partners like Informatica, Alteryx, Trifacta, and Datameer that
help with this process and work fluidly with Tableau.
Storage
1. Hadoop Data Lake
• Hadoop has been used for data lakes due to its resilience and low cost, scale-out data storage,
parallel processing, and clustered workload management.
• It provides massive storage for any kind of data, massive processing power, and the ability to
handle extreme volumes of concurrent tasks or jobs.
• Tableau provides direct connectivity to all the major Hadoop distributions with Cloudera via
Impala, Hortonworks via Hive, and MapR via Apache Drill.
2. Databases and Data warehouses
• Even companies who adopt other technologies typically retain relational databases as a part of
their data source mixture. Snowflake is one example of a cloud-native SQL-based enterprise
data warehouse with a native Tableau connector.
Storage
3. Cloud
• Object stores, such as Amazon Web Services Simple Storage Service (S3).
• Tableau supports Amazon’s Athena data service to connect to Amazon S3.
4. NoSQL Databases
• NoSQL databases with flexible schemas can also be used as data lakes.
• Tableau has various tools that enable connectivity to NoSQL databases directly.
• Examples of NoSQL databases that are often used with Tableau include, but
are not limited to, MongoDB, Datastax, and MarkLogic.
Processing
• The data science and engineering platform, Databricks,
offers data processing on Spark.
• Spark is a popular engine for both batch-oriented and
interactive, scale-out data processing.
• Through a native connector to Spark, one can visualize the
results of complex machine learning models from
Databricks in Tableau.
Query acceleration
o Faster databases leveraging in-memory and massive parallel
processing (MPP) technology like Exasol and MemSQL
o Hadoop-based stores like Kudu
o Technologies that enable faster queries with preprocessing like
Vertica.
o Query Accelerators
❖SQL-on-Hadoop engines like Apache Impala, Hive LLAP, etc.
❖Online Analytical Processing(OLAP)-on-Hadoop technologies like
AtScale, etc.
Data Catolog
• Enterprise data catalogs essentially serve as a
business glossary of data sources and common data
definitions, allowing users to more easily find the
right data for decision making from governed and
approved data sources.
• Data catalogs exist within visual analytics solutions
and are also available as standalone offerings
designed for seamless integration with Tableau.
Informatica is an example of a data catalog partner
of Tableau.
Major Cloud Provider examples
USE CASE
Which is the largest streaming platform for TV
shows and movies?
QUIZ
NETFLIX!
• Grown to support more than 1/3rd of all internet
traffic
• Need arose to expand capabilities to do this
• Extensive platform built on Tableau and AWS acts as
blueprint for many organisations looking to build
scalable and flexible business intelligence on the
cloud
NETFLIX
Features
Data platform is complex, but elegant
Built on events and operational data fed into Amazon S3
Data sent to appropriate processors(NoSQL, Amazon
Redshift etc) which are then aggregated into Tableau Data
Extracts
Data lake/warehouse strategy allows storage of massive
amounts of data
Provides a high level view of data to analyze and explore
All data connections and extracts end up on Tableau
Server, hosted on EC2.
Benefits of NETFLIX by using Tableau Server
Reuse its data sources and govern them across a wide
range of users. Eg: Dashboards can be developed that
show usage and watch patterns within individual
countries.
Helps country managers easily manage programming for
their audiences.
Dozens of people can view dashboard but only one data
source to feed it.
Permissions can be set so that right people have access to
information which is relevant to them
So what did we learn?
References:
Big Data Analytics for Data Visualization: Review of Techniques - Geetika Chawla,
Savita Bamal, Rekha Khatana
Visualizing Big Data – Ekaterina Olshannikova, Aleksandr Ometov, Yevgeni
Koucheryavy, Thomas Olsson
Big Data and Tableau - Sofia Machairidou
https://www.tableau.com/learn/whitepapers/tableau-big-data-overview
https://www.tableau.com/products
https://www.thedataschool.co.uk/tom-pilgrem/earth-tableau-server/
https://en.wikipedia.org/wiki/Tableau_Software
THANK YOU