yahoo!, big data, and microsoft bi: bigger and better together

April 10-12, Chicago, IL

Yahoo!, Big Data, and Microsoft BI: Bigger and Better TogetherDianne Cantwell and Denny Lee


Please silence cell phones

3

AgendaYahoo! Business Case for Hadoop and BIBig Data, Fast Queries

Big Data / BI ThemesGet the Hardware Balance RightPartitioning, Partitioning, PartitioningKeep it SimpleIt is the order of things

4

Yahoo! manages a powerful scalable advertising exchange that includes publishers and advertisers

Yahoo! TAO Business Challenge

5

Advertisers want to get the best bang for their buck by reaching their targeted audiences effectively and efficiently


6

Yahoo! needs visibility into how consumers are responding to ads

along many dimensions: web sites, creatives, time of day, user segments (e.g. gender, age, location) to make the exchange work as efficiently

and effectively as possible


Dianne

I would add "User segments" to this list

Denny Lee

does my update make sense?

7

Yahoo! TAO Technical Requirements

680,000,000Visitors to Yahoo! Branded sites:

Ad Impressions: 3,500,000,000(per day)

Refresh Frequency: Hourly464,000,000,000(per qtr)Rows Loaded:

Average Query Time: <10 seconds

8

Yahoo! TAO Platform ArchitectureHow did we load so much so quickly?

Data Archive & StagingOracle 11G RAC

File 1

File 2

File N

Partition 1

Partition 2

Partition N

Partition 1

Partition 2

Partition N 24TB

Cube/qtr

1.2TB

/day135GB/daycompressed

2PB cluste

r

Data Aggregation & ETLHadoop

BI ServerSQL Server Analysis

Services 2008 R2

Dianne

I don't know if all these numbers are exactly correct still, but I don't think anything significant has changed since then so we should probably just go with this. If you want me to dig into it and get these more accurate let me know.

Denny Lee

Nah - these numbers should be okay, eh?!

9

BI Query ServersSQL Server AnalysisServices 2008 R2

24TB Cube/qtr

Adhoc Query/VisualizationTableau Desktop 7

Optimization ApplicationCustom J2EE App

Yahoo! TAO Platform ArchitectureQueries at the “speed of thought”

464B rows of event level

data/qtr

• Dimensions: 42• Attributes: 296• Measures: 278

Avg Query Time:2 secs

Avg Query Time:5 secs

10

Yahoo! TAO Return on Investment

For campaigns optimized using TAO, advertisers spent more with Yahoo! than before

For campaigns optimized using TAO, more eCPMs (revenue)!

Dianne

I've requested data points here but don't know if I'll get anything back.

Denny Lee

No need!

11

Yahoo! TAO Return on Investment

Yahoo! TAO exposed customer segment performance to campaign

managers and advertisers for the first time! No longer “flying audience

blind”

12

Yahoo! TAO Future DirectionIncrease Segments by 3xIncrease data size and cartesian

No longer doing distinct countBuilt frequency reports and sampling to deliver this due to the inherent

complexity!

Current ChallengeHadoop to SSAS cube (more later)External access to cubesMore disk due to need for more IO

13

Big Data Analytics Challenges

CubeF

14

Get the data out!

15

Extracting the dataFile GenerationHadoop jobs create many files that are exported / dumped to disk in tabular

format

File StagingFiles are propped to a staging folder for relational dB access

Oracle External TablesGenerate external tables that point to the staged filesNo need to import the dataProcessing is slow

16

AS on Oracle CaseOracle OLEDB10K rows/sec

100K rows/sec

SSIS Connector20K rows/sec

Oracle Analysis Services

Oracle SQL Analysis Services

Dianne

We need to chat about the positioning and arguments around this stuff. It's great to walk through it with these guys.

Denny Lee

Absolutely!

17

Passthrough Query to Linked Server

http://msdn.microsoft.com/en-us/library/jj710329.aspx

18

Partitioning, Partitioning, Partitioning

19

PartitionsPartitions

• Data is streamed in to Oracle to files• To get max processing, 30 threads are fired because all T (temp)

partitions are processed concurrently• Super fast data loads• Problem is that it requires constant merging of partitions

Files are streamed in as they become available10/10/10 T360772

10/10/10 T360773…

10/10/10 T361645

10/10/10 T360772Oracle 10g

10/10/10 T360773

10/10/10 T361645…

10/10/10 T36077210/10/10 T360773

10/10/10 T361645…

SSAS10/10/10

Merge

20

Partitions – Directly Merging

Partitions

10/10/10 00:00Oracle 10g

10/10/10 01:00

10/10/10 23:00…

• New model allows for set hourly partitions• No more streaming data but with hourly partitions, cannot have as many

threads for fast data loads, unless…• Process multiple cubes or measure groups in parallel

Partitions

10/10/10 00:0010/10/10 01:00

10/10/10 23:00…

SSAS

Segments

10/10/10 00:0010/10/10 01:00

10/10/10 23:00…

Activities

10/10/10 00:0010/10/10 01:00

10/10/10 23:00…

Uniques

21

It is the order of things

22

It is the order of things“I am a Jem'Hadar. He is a Vorta. It is the order of things.""Do you really want to give up your life for the 'order of things'?""It is not my life to give up, Captain – and it never was.”

Rocks and Shoals, Deep Space NineWritten by Ronald D. Moore

23

Segments and the importance of sort order

Data File Sorted Not Sorted % Difffact.data 195,708,592 344,502,968 43.19%agg.rigid.data 106,825,677 106,825,677 0.00%dim1.dim2.fact.map 17,332,729 32,989,946 47.46%dim1.dim3.fact.map 16,923,276 32,222,813 47.48%dim1.dim4.fact.map 6,079,396 12,286,978 50.52%dim5.dim6.fact.map 2,630,888 6,057,334 56.57%dim1.dim7.fact.map 1,809,725 3,904,004 53.64%dim8.dim9.fact.map 1,592,886 3,793,452 58.01%dim1.dim10.fact.map 1,419,255 3,108,248 54.34%dim8.dim11.fact.map 1,301,221 3,042,638 57.23%dim1.dim12.fact.map 2,949,432 2,949,432 0.00%dim1.dim13.fact.map 2,934,836 2,934,836 0.00%dimA.dimA.fact.map 1,101,552 2,716,289 59.45%dim8.dimB.fact.map 961,332 2,451,956 60.79%dim1.dimC.fact.map 1,027,305 2,323,906 55.79%dim8.dim8.fact.map 1,592,886 2,308,232 30.99%dimA.dimD.fact.map 851,095 2,170,962 60.80%

Not Sorted

Sorted

Dianne

We can comment that sorting is needed BOTH in Oracle AND in the cube partition queries.

Denny Lee

Please do!

24

Across the Eighth Dimension!

How do you associate dimensions withStar Trek Into Darkness?

Cube

26

Back to cube dimensionsRunning ProcessUpdateTakes a long time to run because all of the fact partitions are re-indexed!

Minimize likelihood by building SCD-2 dimensionsComposite Key based on lowest level unique values to represent rowSometimes identity can be just as effective though hashing requires mapping or

lookuptablesCreate SK to allow for SCD-2 dimensionsKey is that we keep the memory space of the SK smallComposite(Composite) or Hash(Composite) is good for dimensions loaded from fact BUT do not expect Type-2 for fact-based dimensionsImportant to call out restatement based on current data (high cost associated with keeping versioned history of dimension tables)

27

Let’s aggregate it up

Dianne

One major reason we hand-create aggs is because we're so concerned with data size. We can't afford to just create lots of aggs without questioning the need of each one.

Denny Lee

Yup - figured you could chime in on this one since its a screenshot of your Excel spreadsheet :)


Thank you!Diamond Sponsor

yahoo!, big data, and microsoft bi: bigger and better together

Technology