big data, easy bi - microstrategy · big data, easy bi p r e s e n ... • senior data engineer at...

30
Big Data, Easy BI PRESENTED BY Tim Hsu January 28, 2014

Upload: dangtu

Post on 10-May-2018

216 views

Category:

Documents


2 download

TRANSCRIPT

Big Da ta , Easy B I

P R E S E N T E D B Y T i m H s u ⎪ J a n u a r y 2 8 , 2 0 1 4

2

Tim Hsu

• Senior Data Engineer at Yahoo!

• Data modeling, BI application design

• To provide an integrated, easy to use BI system

to Yahoo! EC users

3

Neal Lee

• Senior Data Engineer

• Aims to build up an easy

to use self-service BI

platform connecting to

Hadoop

Johnny Nien

• Senior Data Engineer

• Software developer

specialized in large scale

data processing

infrastructure and

applications

Agenda

Background Introduction

Solution

Summary

4

Background Introduct ion

Who are we?

Challenges

What we need?

5

APAC is the best region where

Yahoo! runs EC business

Major EC properties

› 2001 Auction

› 2004 Shopping Mall

› 2008 Store Market

Yahoo! is the leading e-commerce

company in Taiwan

6

Who Are We?

In MM USD

- 1,000 2,000 3,000 4,000 5,000

EHS

National 3C Chains

Fubon momo TV shopping

FarEastern Dept store

TK 3C

PC Home

SOGO Dept Store

Y!EC

RT Mart(hyper mart)

FamiMart

PxMart (hyper mart)

Carrefour

ShinKwan Mitsukoshi

7-Eleven

2011 Taiwan Retail Revenue

Types of End Users in Yahoo! Taiwan

7

GM, BU Heads

Business Analysts

Marketers, Data Analysts

Category Managers

Suppliers, Sellers

BI Needs for Different Types of Users

8

Sophisticated Summarized

Lo

w

Hig

h

Data Scale

Analytics &

Interactivity

GM, BU

Heads

Business

Analysts

Marketers,

Data

Analysts

Category

Managers Suppliers,

Sellers

Challenges

9

Ad hoc

Reports

ERP

Transactions

Web Logs

Browsing

Purchase

DW/DM

Performance

Reports

Management

Reports

Traffic

Reports

PHP,

ASP.NET

MicroStrategy

Hyperion SQL, Stored

Procedure, Pig, HiveQL

PHP, Web

Services API

Yahoo! Taiwan Needs …

10

One unified data platform for retrieving information in an easy and efficient way.

Where We Are Going …

11

Business Intelligence Application

Business Intelligence Platform

Data Storage

Data Process

Data Source

Solut ion

Architecture

Enhance by Using Shark/Spark

Demonstration

12

Architecture

13

Auction

Shopping

Store

Instrumentation

Instrumentation

Instrumentation

Auction

Backend

Shopping

ERP

Store

ERP

E

T

L

Oracle RAC

Listing

Member

Revenue

Seller

Sales

Supplier

F

E

T

L

Yahoo! Grid

Page View

Click Event

Session

ETL

Beacon

Servers

Data

Highway

Users

Hive

Shark

MicroStrategy

SQL Engine

Unified BI Platform

14

Oracle RAC

In Memory Caches

Hive Performance Test

15

Use case: Visitor distribution by demographic and device preference

Source Data: 293TB web logs in 60 days

Transformed Cube : 2.3 GB, 60.5M rows

Test environment

› MicroStrategy Server: 8 Cores 2.5G, 16G RAM, v9.2.1

› Hive Server: 4 Cores 2.5G, 4G RAM, v0.9

› Hadoop clusters: 300+ nodes, v0.23

Case C1: Cross tab with date

slice

Case C2: Dynamic prompt on

date

Case C3:

Dynamic data

grouping (Browser)

Case C4:

80/20 Analysis

Case C5: Data grouping

& charting

Hive Test Cases

16

Case C1: Cross tab with date

slice

Case C2: Dynamic prompt on

date

Case C3:

Dynamic data

grouping (Browser)

Case C4:

80/20 Analysis

Case C5: Data grouping

& charting

Hive Performance Test

17

Average response time is less than 20 seconds under the

stress of 50 concurrent users against 60 days data.

20 Days 40 Days 60 Days

10 CU. 1.8 3.1 4.7

25 CU. 3.5 6.8 9.6

50 CU. 6.1 12.1 19.2

100 CU. 11.9 24.5 36.1

0

5

10

15

20

25

30

35

40

Av

g.

Resp

on

se T

ime (

sec)

Data Volume in Cube

Avg. Resp. Time by Data Volume

10 CU.

25 CU.

50 CU.

100 CU.

Enhance by Using Spark/Shark

18

Spark is a fast and expressive cluster computing system interoperable

with Apache Hadoop

iter. 1 iter. 2 …

Input

File system

read

File system

write

File system

read

File system

write

Map/Reduce

iter. 1 iter. 2 …

Input

File system

read Memory

write

Memory

read

Memory

write

Spark

Enhance by Using Spark/Shark

19

Shark is an analytic query engine built on top of Spark

› 100% compatible with Hive

› Could be 100x faster than Hive

Meta

store

HDFS

Client

Driver

SQL

Parse

r

Query

Optimizer

Physical Plan

Execution

CLI JDBC

MapReduce

Meta

store

HDFS

Client

Driver

SQL

Parse

r

Physical Plan

Execution

CLI JDBC

Spark

Cache Mgr.

Query

Optimizer

Item based recommendation

system by collaborative filtering

Modules implemented

› Viewed-also-viewed (Shopping)

› Bought-also-bought (Shopping)

› Bought-after-viewed (Auction)

Implemented by Pig script

20

Spark Performance Test: Recommendation by CF

21

3,616 production machines

10 virtual machines

Yahoo! Grid

Pig vs. Spark: CF Performance Test

Nodes CPU RAM HD

3,616 16 Cores 48GB 16TB

10 2 Cores 4GB 100G

106 mins

14 mins 7.5x faster!

Pig vs. Spark: CF Performance Test

22

Put Shark into The Scene

23

Shark

Hive

Users

EC Backend

ERP

MicroStrategy

SQL Engine

Yahoo! TW Grid

ETL

Web

Clickstream

Demonstrat ion

Dynamic OLAP analysis using in memory cubes

Self-service Business Intelligence

24

Summary

Lessons Learned

Benefits to Yahoo! Taiwan

25

Lessons Learned

26

Data modeling

› Join operation is extremely expensive in Hive/Shark

Denomalize as much as possible

Modeling in snowflake schema

Data processing (ETL)

› Use partition to minimize data loading time

› Hive handles partitions well, but Shark does not

Keep partitioned tables for daily refresh

Create and cache non-partitioned tables for MicroStrategy

Shark is not the silver bullet

› Aggregation is still needed for best performance

Aggregation tables for ad-hoc query

Intelligent Cubes for dashboards

Lessons Learned – expect the unexpected

27

select day_id, count(distinct buyer_id) as buyer_cnt

from fact_table

group by day_id;

select day_id, count(buyer_id) as buyer_cnt

from (

select day_id, buyer_id

from fact_table

group by day_id, buyer_id

) tmp

group by day_id;

A

B

1. Rewrite SQL

2. Performance improved

significantly in Shark 0.8

Lessons Learned – expect the unexpected

28

select day_id, sum(order_amt) as revenue

from fact_table

where day_id between date_add(„2013-12-01‟, -10)

and date_add(„2013-12-01‟, 0)

and cate_id in (1, 2, 3)

group by day_id;

select day_id, sum(order_amt) as revenue

from fact_table

where cate_id in (1, 2, 3)

and day_id between date_add(„2013-12-01‟, -10)

and date_add(„2013-12-01‟, 0)

group by day_id;

A

B 1. Change sequence of

filters

2. Write a patch to

evaluate and replace

date_add() ‘2013-11-21’

‘2013-12-01’

‘2013-11-21’

‘2013-12-01’

Benefits to Yahoo! Taiwan

29

One unified data platform for all EC properties. i.e. EC Source of Truth

› Access transaction and web traffic data simultaneously and transparently.

Self-service BI reporting

› Users can now create their own reports at the “speed of thought”.

Sophisticated dashboards

› Consolidate different information into one single screen.

Low latency

› Daily report average response time increased by 83%, from 43.6 seconds to 7.4

seconds.

30