yahoo!, big data, and microsoft bi: bigger and better together

28
April 10-12, Chicago, IL Yahoo!, Big Data, and Microsoft BI: Bigger and Better Together Dianne Cantwell and Denny Lee

Upload: denny-lee

Post on 15-Apr-2017

199 views

Category:

Technology


1 download

TRANSCRIPT

Page 1: Yahoo!, Big Data, and Microsoft BI: Bigger and Better Together

April 10-12, Chicago, IL

Yahoo!, Big Data, and Microsoft BI: Bigger and Better TogetherDianne Cantwell and Denny Lee

Page 2: Yahoo!, Big Data, and Microsoft BI: Bigger and Better Together

April 10-12, Chicago, IL

Please silence cell phones

Page 3: Yahoo!, Big Data, and Microsoft BI: Bigger and Better Together

3

AgendaYahoo! Business Case for Hadoop and BIBig Data, Fast Queries

Big Data / BI ThemesGet the Hardware Balance RightPartitioning, Partitioning, PartitioningKeep it SimpleIt is the order of things

Page 4: Yahoo!, Big Data, and Microsoft BI: Bigger and Better Together

4

Yahoo! manages a powerful scalable advertising exchange that includes publishers and advertisers

Yahoo! TAO Business Challenge

Page 5: Yahoo!, Big Data, and Microsoft BI: Bigger and Better Together

5

Advertisers want to get the best bang for their buck by reaching their targeted audiences effectively and efficiently

Yahoo! TAO Business Challenge

Page 6: Yahoo!, Big Data, and Microsoft BI: Bigger and Better Together

6

Yahoo! needs visibility into how consumers are responding to ads

along many dimensions: web sites, creatives, time of day, user segments (e.g. gender, age, location) to make the exchange work as efficiently

and effectively as possible

Yahoo! TAO Business Challenge

Dianne
I would add "User segments" to this list
Denny Lee
does my update make sense?
Page 7: Yahoo!, Big Data, and Microsoft BI: Bigger and Better Together

7

Yahoo! TAO Technical Requirements

680,000,000Visitors to Yahoo! Branded sites:

Ad Impressions: 3,500,000,000(per day)

Refresh Frequency: Hourly464,000,000,000(per qtr)Rows Loaded:

Average Query Time: <10 seconds

Page 8: Yahoo!, Big Data, and Microsoft BI: Bigger and Better Together

8

Yahoo! TAO Platform ArchitectureHow did we load so much so quickly?

Data Archive & StagingOracle 11G RAC

File 1

File 2

File N

Partition 1

Partition 2

Partition N

Partition 1

Partition 2

Partition N 24TB

Cube/qtr

1.2TB

/day135GB/daycompressed

2PB cluste

r

Data Aggregation & ETLHadoop

BI ServerSQL Server Analysis

Services 2008 R2

Dianne
I don't know if all these numbers are exactly correct still, but I don't think anything significant has changed since then so we should probably just go with this. If you want me to dig into it and get these more accurate let me know.
Denny Lee
Nah - these numbers should be okay, eh?!
Page 9: Yahoo!, Big Data, and Microsoft BI: Bigger and Better Together

9

BI Query ServersSQL Server AnalysisServices 2008 R2

24TB Cube/qtr

Adhoc Query/VisualizationTableau Desktop 7

Optimization ApplicationCustom J2EE App

Yahoo! TAO Platform ArchitectureQueries at the “speed of thought”

464B rows of event level

data/qtr

• Dimensions: 42• Attributes: 296• Measures: 278

Avg Query Time:2 secs

Avg Query Time:5 secs

Page 10: Yahoo!, Big Data, and Microsoft BI: Bigger and Better Together

10

Yahoo! TAO Return on Investment

For campaigns optimized using TAO, advertisers spent more with Yahoo! than before

For campaigns optimized using TAO, more eCPMs (revenue)!

Dianne
I've requested data points here but don't know if I'll get anything back.
Denny Lee
No need!
Page 11: Yahoo!, Big Data, and Microsoft BI: Bigger and Better Together

11

Yahoo! TAO Return on Investment

Yahoo! TAO exposed customer segment performance to campaign

managers and advertisers for the first time! No longer “flying audience

blind”

Page 12: Yahoo!, Big Data, and Microsoft BI: Bigger and Better Together

12

Yahoo! TAO Future DirectionIncrease Segments by 3xIncrease data size and cartesian

No longer doing distinct countBuilt frequency reports and sampling to deliver this due to the inherent

complexity!

Current ChallengeHadoop to SSAS cube (more later)External access to cubesMore disk due to need for more IO

Page 13: Yahoo!, Big Data, and Microsoft BI: Bigger and Better Together

13

Big Data Analytics Challenges

CubeF

Page 14: Yahoo!, Big Data, and Microsoft BI: Bigger and Better Together

14

Get the data out!

Page 15: Yahoo!, Big Data, and Microsoft BI: Bigger and Better Together

15

Extracting the dataFile GenerationHadoop jobs create many files that are exported / dumped to disk in tabular

format

File StagingFiles are propped to a staging folder for relational dB access

Oracle External TablesGenerate external tables that point to the staged filesNo need to import the dataProcessing is slow

Page 16: Yahoo!, Big Data, and Microsoft BI: Bigger and Better Together

16

AS on Oracle CaseOracle OLEDB10K rows/sec

100K rows/sec

SSIS Connector20K rows/sec

Oracle Analysis Services

Oracle SQL Analysis Services

Dianne
We need to chat about the positioning and arguments around this stuff. It's great to walk through it with these guys.
Denny Lee
Absolutely!
Page 17: Yahoo!, Big Data, and Microsoft BI: Bigger and Better Together

17

Passthrough Query to Linked Server

http://msdn.microsoft.com/en-us/library/jj710329.aspx

Page 18: Yahoo!, Big Data, and Microsoft BI: Bigger and Better Together

18

Partitioning, Partitioning, Partitioning

Page 19: Yahoo!, Big Data, and Microsoft BI: Bigger and Better Together

19

PartitionsPartitions

• Data is streamed in to Oracle to files• To get max processing, 30 threads are fired because all T (temp)

partitions are processed concurrently• Super fast data loads• Problem is that it requires constant merging of partitions

Files are streamed in as they become available10/10/10 T360772

10/10/10 T360773…

10/10/10 T361645

10/10/10 T360772Oracle 10g

10/10/10 T360773

10/10/10 T361645…

10/10/10 T36077210/10/10 T360773

10/10/10 T361645…

SSAS10/10/10

Merge

Page 20: Yahoo!, Big Data, and Microsoft BI: Bigger and Better Together

20

Partitions – Directly Merging

Partitions

10/10/10 00:00Oracle 10g

10/10/10 01:00

10/10/10 23:00…

• New model allows for set hourly partitions• No more streaming data but with hourly partitions, cannot have as many

threads for fast data loads, unless…• Process multiple cubes or measure groups in parallel

Partitions

10/10/10 00:0010/10/10 01:00

10/10/10 23:00…

SSAS

Segments

10/10/10 00:0010/10/10 01:00

10/10/10 23:00…

Activities

10/10/10 00:0010/10/10 01:00

10/10/10 23:00…

Uniques

Page 21: Yahoo!, Big Data, and Microsoft BI: Bigger and Better Together

21

It is the order of things

Page 22: Yahoo!, Big Data, and Microsoft BI: Bigger and Better Together

22

It is the order of things“I am a Jem'Hadar. He is a Vorta. It is the order of things.""Do you really want to give up your life for the 'order of things'?""It is not my life to give up, Captain – and it never was.”

Rocks and Shoals, Deep Space NineWritten by Ronald D. Moore

Page 23: Yahoo!, Big Data, and Microsoft BI: Bigger and Better Together

23

Segments and the importance of sort order

Data File Sorted Not Sorted % Difffact.data 195,708,592 344,502,968 43.19%agg.rigid.data 106,825,677 106,825,677 0.00%dim1.dim2.fact.map 17,332,729 32,989,946 47.46%dim1.dim3.fact.map 16,923,276 32,222,813 47.48%dim1.dim4.fact.map 6,079,396 12,286,978 50.52%dim5.dim6.fact.map 2,630,888 6,057,334 56.57%dim1.dim7.fact.map 1,809,725 3,904,004 53.64%dim8.dim9.fact.map 1,592,886 3,793,452 58.01%dim1.dim10.fact.map 1,419,255 3,108,248 54.34%dim8.dim11.fact.map 1,301,221 3,042,638 57.23%dim1.dim12.fact.map 2,949,432 2,949,432 0.00%dim1.dim13.fact.map 2,934,836 2,934,836 0.00%dimA.dimA.fact.map 1,101,552 2,716,289 59.45%dim8.dimB.fact.map 961,332 2,451,956 60.79%dim1.dimC.fact.map 1,027,305 2,323,906 55.79%dim8.dim8.fact.map 1,592,886 2,308,232 30.99%dimA.dimD.fact.map 851,095 2,170,962 60.80%

Not Sorted

Sorted

Dianne
We can comment that sorting is needed BOTH in Oracle AND in the cube partition queries.
Denny Lee
Please do!
Page 24: Yahoo!, Big Data, and Microsoft BI: Bigger and Better Together

24

Across the Eighth Dimension!

How do you associate dimensions withStar Trek Into Darkness?

Cube

Page 25: Yahoo!, Big Data, and Microsoft BI: Bigger and Better Together

25

Page 26: Yahoo!, Big Data, and Microsoft BI: Bigger and Better Together

26

Back to cube dimensionsRunning ProcessUpdateTakes a long time to run because all of the fact partitions are re-indexed!

Minimize likelihood by building SCD-2 dimensionsComposite Key based on lowest level unique values to represent rowSometimes identity can be just as effective though hashing requires mapping or

lookuptablesCreate SK to allow for SCD-2 dimensionsKey is that we keep the memory space of the SK smallComposite(Composite) or Hash(Composite) is good for dimensions loaded from fact BUT do not expect Type-2 for fact-based dimensionsImportant to call out restatement based on current data (high cost associated with keeping versioned history of dimension tables)

Page 27: Yahoo!, Big Data, and Microsoft BI: Bigger and Better Together

27

Let’s aggregate it up

Dianne
One major reason we hand-create aggs is because we're so concerned with data size. We can't afford to just create lots of aggs without questioning the need of each one.
Denny Lee
Yup - figured you could chime in on this one since its a screenshot of your Excel spreadsheet :)
Page 28: Yahoo!, Big Data, and Microsoft BI: Bigger and Better Together

April 10-12, Chicago, IL

Thank you!Diamond Sponsor