visualizing big data

17
Visualizing Big Data David Schmittdiel CSC 9010-003 9/16/2014

Upload: emmet

Post on 05-Jan-2016

43 views

Category:

Documents


0 download

DESCRIPTION

Visualizing Big Data. David Schmittdiel CSC 9010-003 9/16/2014. Outline. Me Big Data review and background Problem statement Case study: StubHub. Intro. I don’t have a Computer Science background (but I really, really regret it) MATLAB  PHP  MySQL  Oracle - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Visualizing Big Data

Visualizing Big Data

David SchmittdielCSC 9010-003

9/16/2014

Page 2: Visualizing Big Data

Outline

• Me• Big Data review and background• Problem statement• Case study: StubHub

Page 3: Visualizing Big Data

Intro

I don’t have a Computer Science background (but I really, really regret it)

MATLAB PHP MySQL Oracle

Manager of Business Intelligence Development at StubHub• Bringing actionable data to the masses• Self-service, on-demand, exploratory BI• Data discovery through visualization• Automation

Page 4: Visualizing Big Data

Big Data, Big Ruse?

Stephen Few: “What the hell is Big Data anyway?”

BI vendor-driven responses:• Increased data volume AND velocity• New data sources (unstructured)

Fundamental question: Do you really need Big Data?“Until you’ve figured out how to use the data that you already have, collecting more will only distract you from the real task. Time spent collecting more data is time that could be better spent weaving it into something meaningful.”

Stephen Few, Perceptual Edge - July/August/September 2012, “Big Data, Big Ruse”http://www.perceptualedge.com/articles/visual_business_intelligence/big_data_big_ruse.pdf

Page 5: Visualizing Big Data

The Real Task

Transforming raw data into meaningful, useful, actionable information® Leveraging the past to guide future endeavors® Finding the signals amidst the noise

Driving forces:• Scientific research• Business (ecommerce)• Government

Stephen Few: “The success of BI … [is] measured in our increased ability to understand data and then make better decisions based on that understanding.”

Page 6: Visualizing Big Data

Visualizing Small Data

MS Excel• Ease of use for tasks involving smaller data sets, limited interactivity• Stephen Few: “building applications on top of Excel can be arduous and painful”

Stephen Few, Perceptual Edge – September/October 2009, “Fundamental Differences in Analytical Tools”http://www.perceptualedge.com/articles/visual_business_intelligence/differences_in_analytical_tools.pdf

Page 7: Visualizing Big Data

Visualizing Small Data

Static dashboards: “custom analytics” • Time-consuming to build but relatively easy to maintain• “Remove … functionality that isn’t relevant to the analytical objective of its users”

Page 8: Visualizing Big Data

Unique Challenges

Juliana Freire: “Visualization: Big Data Considerations”• Interactivity is key, but challenging for Big Data• Need better integration between data management and visualization

components

Phil Simon describing Netflix’s data mindset:• Data should be accessible, easy to discover, and easy to process for everyone• The longer you take to find the data, the less valuable it becomes• Whether a dataset is large or small, being able to visualize it makes it easier to

explain

Juliana Freire, DIMACS 2013, “Big Data Analysis and Integration”http://dimacs.rutgers.edu/Workshops/BigData/Slides/2013-dimacs.pdf

Phil Simon, HBR Webinar, “The Visual Organization: Data Visualization, Big Data, and the Quest for Better Decisions”http://www.scribd.com/doc/232032215/HBR-Webinar-Summary-The-Visual-Organization

Page 9: Visualizing Big Data

Case Study: StubHub

Using SAP Business Objects (BO) since at least 2008 on top of Oracle 11g DW

Included in the “Leaders” quadrantof 2014 Gartner report• BO “delivers a broad range of BI and

analytic capabilities through a semantic layer best suited for large IT-managed deployments that require robust governance and administrative capabilities”

• Customers use “primarily for reporting; the number that use it for interactive discovery or visualization was well below the average”

Gartner, Magic Quadrant for Business Intelligence and Analytics Platformswww.gartner.com/technology/reprints.do?id=1-1QLGACN&ct=140210&st=sb

Page 10: Visualizing Big Data

Case Study: StubHub

Feedback from business users was universally poor• Hard to use• Limited number of (inadequate) visualizations available• Not interactive• Supported by Tech org only

Reporting Team within Analytics org formed in January, 2013• Innovative• Responsive• Promote self-service• Objective vs subjective use of data

Page 11: Visualizing Big Data

Case Study: StubHub

General concept: aggregate any metrics by any breakdown, over any time period, filtered for anything Supports “exploratory analytics”: pursue each question as it arises

Settle instead for a collection of dashboards categorized by business use case

Page 12: Visualizing Big Data

Case Study: StubHub

First iteration: Dynamic SQL• Complicated rules for commenting based on front-end selections

select -- DATE: sp.src_created_dttm_sale g.genre_cat_final as "GCF", -- DISPLAY CATEGORY: GCF g.genre_descr as "Genre", -- DISPLAY CATEGORY: Genre sum(sp.ticket_cost) as "GTS", -- DATA METRIC: GTS count(distinct transaction_id) as "# Orders", -- DATA METRIC: # Orders

from owbruntarget_dw.dw_sales_pipeline_fact spjoin owbruntarget_dw.dw_genre_dim g on sp.genre_dw_id = g.genre_dw_id -- DISPLAY CATEGORY or FILTER: GCF, Genre

where 1=1 -- FILTER: g.genre_cat_final for GCF -- FILTER: g.genre_descr for Genre AND trunc(src_created_dttm_sale) between :startdate and :enddate

group by g.genre_cat_final, -- DISPLAY CATEGORY: GCF g.genre_descr, -- DISPLAY CATEGORY: Genre -- DATEG: sp.src_created_dttm_sale ''

Proved unworkable because of long query execution times, even after incorporating bind variables

Page 13: Visualizing Big Data

Case Study: StubHub

Next iteration: “pandas” dataframes• Open source Python library for data manipulation and analysis• Fast and efficient DataFrame object for data manipulation with integrated

indexing• Tools for reading and writing data between in-memory data structures and

different formats (e.g. CSV)

For each dashboard, one static query• Tuning + Oracle query optimizer• Retrieve comprehensive data set needed to power the dashboard• Store data in CSV files on network• “Jukebox” functionality: only files needed are loaded into memory for

processing

Pandas: http://pandas.pydata.org/pandas-docs/stable/index.html

Page 14: Visualizing Big Data

Case Study: StubHub

Results:• Huge decrease in dashboard run times• Corresponding increase in adoption rate

Page 15: Visualizing Big Data

Case Study: StubHub

Where does the interactivity necessary for data discovery come from?

Template-based front end built with PHP + HTML + CSS + jQuery• Provide different levels of granularity• Decreases amount of time needed to create a new dashboard (vs. Tableau)

Menus control requests for:• Categories group by• Metrics aggregate functions• Filters where clause• Date range• Chart types, date aggregation

Page 16: Visualizing Big Data

Case Study: StubHub

How to provide integration between back-end data management and front-end visualization components?

Solution is Data-Driven Documents (D3.js)• JavaScript library to drive the creation and control of dynamic and interactive

graphical forms which run in web browsers• W3C-compliant, making use of the widely implemented Scalable Vector

Graphics (SVG), JavaScript, HTML5, and Cascading Style Sheets (CSS3) standards• Large data sets can be easily bound to SVG objects using JSON and simple D3

functions to generate charts and diagrams

D3: http://d3js.org/

Page 17: Visualizing Big Data

Case Study: StubHub

Summary of approach

Create a collection of BI dashboards that are:• Fast• Customizable• Interactive• Highly visual• On-demand• Scalable• Consistent

Custom build EVERYTHING as needed

Leverage open source technologies whenever possible

Data source agnostic to accommodate new data stores as they become available Output from MapReduce jobs in CSV format