yahoo enabling exploratory analytics of data in shared-service hadoop clusters

34
Enabling Exploratory Analytics of Data in Shared-service Hadoop Clusters PRESENTED BY Sagi Zelnick Principal Architect @ Yahoo and Ledion Bitincka Principal Architect @ Splunk Hadoop Summit June 2014 San Jose, CA

Upload: brett-sheppard

Post on 11-Aug-2014

794 views

Category:

Data & Analytics


4 download

DESCRIPTION

Yahoo presentation at Hadoop Summit San Jose, CA in June 2014.

TRANSCRIPT

Page 1: Yahoo Enabling Exploratory Analytics of Data in Shared-service Hadoop Clusters

Enabl ing Exploratory Analyt ics of Data in Shared-serv ice Hadoop Clusters PRESENTED BY Sagi Zelnick Principal Architect @ Yahoo and Ledion Bitincka Principal Architect @ Splunk Hadoop Summit June 2014 San Jose, CA

Page 2: Yahoo Enabling Exploratory Analytics of Data in Shared-service Hadoop Clusters

Overview

2 Yahoo Proprietary

!  Hadoop @ Yahoo: 8+ years of innovation !  Hunk @ Yahoo: organization-wide investment for next 3+ years !  Yahoo providing Hunk as a self-service to explore, analyze & visualize data in HDFS

›  Hunk allows for visually browsing very complex tables (250+ fields)

›  Rapid prototyping for new jobs with almost instant results for searches, without having

to wait for the entire job/query to finish

›  Cuts down on the development cycles by faster interaction with results

›  Built-in graphs/charts makes for a powerful solution for many situations

Page 3: Yahoo Enabling Exploratory Analytics of Data in Shared-service Hadoop Clusters

About your speakers

3 Yahoo Proprietary

Sagi Zelnick Ledion Bitincka Principal Architect Principal Architect Yahoo Splunk

Page 4: Yahoo Enabling Exploratory Analytics of Data in Shared-service Hadoop Clusters

Hunk + Hadoop @ Yahoo

4 Yahoo Proprietary

Page 5: Yahoo Enabling Exploratory Analytics of Data in Shared-service Hadoop Clusters

5 Yahoo Proprietary

History of Hadoop innovation @ Yahoo

Page 6: Yahoo Enabling Exploratory Analytics of Data in Shared-service Hadoop Clusters

Over 600PB of Hadoop storage (over half an Exabyte)

6 Yahoo Proprietary

!  Very large clusters used by many groups across the enterprise. !  More than 35,000 individual datanodes. !  Hadoop is provided as a service. !  Multiple cluster types such as research, dev, sandbox and production. !  Services such as HBase, Hive, Oozie, etc… !  Users are free to run jobs, but have resource constraints. !  Maintained by the Grid Operations Group.

Page 7: Yahoo Enabling Exploratory Analytics of Data in Shared-service Hadoop Clusters

Improving operational visibility with Hunk

!  We pointed Hunk at many operational logs and event data we already had on the grid.

!  This includes system metrics, HDFS ops, JVM stats and YARN metrics. !  Created instrumentation to measure usage per user and job. !  Analyzed terabytes of NameNode audit logs. !  Job history leveraged for visualizing usage/growth and historical views. !  Custom events for HBase statistics.

7 Yahoo Proprietary

Page 8: Yahoo Enabling Exploratory Analytics of Data in Shared-service Hadoop Clusters

Use Case Customer Benefits

System metrics from 35k nodes Grid Ops / Grid Customers

Identify slow tasks/nodes when debugging

Historical insights of resources All Grid Customers Track organic growth

Job performance All Grid Customers Improved job SLAs

HBase metrics All Grid Customers Track region/RS/table metrics…

Job logs in near real-time All Grid Customers / Ops Search for errors directly from the YARN logs

Namenode operational data Research, Dev Improved performance and stability

Tracking Hadoop performance and metrics in Hunk

8 Yahoo Proprietary

Page 9: Yahoo Enabling Exploratory Analytics of Data in Shared-service Hadoop Clusters

Measuring NameNode performance pre & post upgrades

9 Yahoo Proprietary

!  Historical visualizations of all operations. !  Search data in Hunk from billions of NameNode events. !  Measure JVM and memory usage. !  Insights into operational performance.

Page 10: Yahoo Enabling Exploratory Analytics of Data in Shared-service Hadoop Clusters

Yahoo Proprietary

New Searchindex="simon_blue_new_all" this_cluster="dilithiumblue*" (log_subtype="DFS" #hdfs=hdfs) | timechart spa

n=1h avg(number*) as num_*

Last 7 days

✓ 10,086 events (5/15/14 1:00:00.000 AM to 5/22/14 1:36:34.000 AM)

_time

num_BlockReports num_CopyBl...perations num_HeartBeats num_ReadBl...perationsnum_ReadMe...perations num_Replac...Operations num_WriteB...Operations num_blockChecksumOp

Fri May 162014

Sun May 18 Tue May 20

200,000,000

400,000,000

600,000,000

_time ↕

num_BlockReports ↕

num_CopyBlockOpera

tions ↕

num_HeartBeats ↕

num_ReadBlockOpera

tions ↕

num_ReadMetadataOperati

ons ↕

num_ReplaceBlockOperat

ions ↕

num_WriteBlockOpera

tions ↕

num_blockChecksumOp ↕

2014-05-15 01:00 1124437.7359

02

46721126.819672

514957.3840

98

12930433.077869

0.000000 94210832.786885

63512425.967213

13975.306557

2014-05-15 02:00 1115496.2904

92

53597000.262295

298717.6370

49

10402176.717213

0.000000 94109944.655738

93916552.393443

35459.288689

2014-05-15 03:00 1110372.4173

56566721.704918

428494.9449

13296385.590164

0.000000 94141430.295082

97353478.229508

20307.549344

Visualization Visualization using Hunk

10

Page 11: Yahoo Enabling Exploratory Analytics of Data in Shared-service Hadoop Clusters

11 Yahoo Proprietary

New Searchindex="simon_blue_new_all" this_cluster="dilithiumblue*" (log_subtype="DFS" #hdfs=hdfs) | timechart spa

n=5m avg(number*) as num_*

Last 2 days

✓ 2,753 events (5/20/14 1:14:21.000 AM to 5/22/14 1:14:21.000 AM)

_time

num_BlockReports num_CopyBl...perations num_HeartBeats num_ReadBl...perationsnum_ReadMe...perations num_Replac...Operations num_WriteB...Operations num_blockChecksumOp

12:00 PMTue May 202014

12:00 AMWed May 21

12:00 PM

1,000,000,000

250,000,000

500,000,000

750,000,000

_time ↕

num_BlockReports ↕

num_CopyBlockOpera

tions ↕

num_HeartBeats ↕

num_ReadBlockOpera

tions ↕

num_ReadMetadataOperati

ons ↕

num_ReplaceBlockOperat

ions ↕

num_WriteBlockOpera

tions ↕

num_blockChecksumOp ↕

2014-05-20 01:15:00 1056047.0240

00

34677652.000000

124121.2640

00

26242490.800000

0.000000 88112292.800000

126478486.400000

51405.346000

2014-05-20 01:20:00 1055517.9240

00

30920700.800000

1065390.086

000

22756041.800000

0.000000 87745422.400000

92323387.200000

32070.482000

2014-05-20 01:25:00 1055457.2000

33068504.400000

27622.56200

11396610.700000

0.000000 88569211.200000

94593716.800000

28873.618000

Visualization

Sample troubleshooting in Hunk of 750 million events

Page 12: Yahoo Enabling Exploratory Analytics of Data in Shared-service Hadoop Clusters

12 Yahoo Proprietary

New Searchindex="simon_blue_new_all" this_cluster="dilithiumblue*" (log_subtype="JVM" ProcessName="NameNode") | tim

echart span=5m avg(Threads*) as threads_*

Last 2 days

✓ 8,463 events (5/20/14 12:00:00.000 AM to 5/22/14 12:00:00.000 AM)

_time

threads_Blocked threads_New threads_Runnable threads_Terminated threads_TimedWaitingthreads_Waiting

12:00 AMTue May 202014

12:00 PM 12:00 AMWed May 21

12:00 PM

200

400

_time ↕ threads_Block

ed ↕ threads_Ne

w ↕ threads_Runna

ble ↕ threads_Terminat

ed ↕ threads_TimedWait

ing ↕ threads_Waiti

ng ↕

2014-05-20 00:00:00 72.360000 10.638333 5.485833 0.000000 21.208333 78.555000

2014-05-20 00:05:00 70.177333 10.554667 5.277333 0.000000 20.744667 76.578000

2014-05-20 00:10:00 70.211333 9.998667 5.022000 0.000000 19.333333 73.766667

2014-05-20 00:15:00 70.300667 10.268000 5.156667 0.000000 17.488667 70.122000

2014-05-20 00:20:00 70.422667 10.376000 5.188000 0.000000 15.700000 66.611333

2014-05-20 00:25:00 70.444000 10.288000 5.144000 0.000000 14.089333 63.400667

Visualization

Big picture plus granular details

Page 13: Yahoo Enabling Exploratory Analytics of Data in Shared-service Hadoop Clusters

Analyzing NameNode RPC calls (troubleshooting)

13 Yahoo Proprietary

!  Who is making what RPC call (open, listStatus, create, etc.). !  How often are they making these RPC calls. !  From which IP/host are they coming from. !  Search and visualize historical data from billions of events. !  Prevent NameNode abuse/misuse.

Page 14: Yahoo Enabling Exploratory Analytics of Data in Shared-service Hadoop Clusters

14 Yahoo Proprietary

Visualizing 834 million discrete events …

Page 15: Yahoo Enabling Exploratory Analytics of Data in Shared-service Hadoop Clusters

15 Yahoo Confidential & Proprietary

… continued

Page 16: Yahoo Enabling Exploratory Analytics of Data in Shared-service Hadoop Clusters

Queue insights (capacity & provisioning) !  Each Hadoop job runs in a specific queue. !  We track every aspect of the YARN framework. !  Immediate queue performance and configuration profiling via job

history server. !  Historical views and trends that enable better capacity management. !  Improved queue utilization and allocation management.

16 Yahoo Proprietary

Page 17: Yahoo Enabling Exploratory Analytics of Data in Shared-service Hadoop Clusters

New Searchindex="jobsummary_logs_all_red" cluster="dilithium*" | eval total_slot_seconds=(mapSlotSeconds + reduceSlotSec

onds) | eval gb_hours=((total_slot_seconds * 0.5) / 3600) | eval gb_hours=round(gb_hours) | timechart span=6h sum

(gb_hours) as gb_hours by queue

Last 7 days

✓ 1,175,726 events (5/20/14 8:00:00.000 PM to 5/27/14 8:26:26.000 PM)

200,000

400,000

600,000

_time ↕

OTHER

apg_dailyhigh_

p3 ↕

apg_dailymedium

_p5 ↕

apg_hourlyhigh_

p1 ↕

apg_hourlylow_

p4 ↕

apg_hourlymedium

_p2 ↕

apg_p7

curveball_larg

e ↕

curveball_me

d ↕

slingshot

slingstone

2014-05-20 18:00 4154

45512 7071 25643 12111 29664 3473

26547 14192 60875

45376

2014-05-21 00:00 19341

92661 18005 41008 22944 88115 10896

38648 8693 48186

87670

2014-05-21 06:00 21160

108137 38398 35627 14934 101925 24458

29269 14066 24344

47831

2014-05-21 12:00 24238

74849 22695 47431 17731 53673 17332

37079 14479 44873

96909

2014-05-21 18:00 5792

95449 2737 44214 20325 48339 10222

34390 4605 168593

24298

2014-05-22 00:00 10177

68048 12853 36921 23248 57740 16005

44138 9142 88121

34544

2014-05-22 06:00 12720

85048 21977 35870 15503 100364 7823

35179 8086 33973

19802

2014-05-22 12:00 5459

76489 13154 34703 11204 34877 20178

22631 40567 98 24250

2014-05-22 18:00 8169

38394 2211 49840 19977 52438 4050

38066 27973 49333

31312

2014-05-23 00:00 12898

117518 7354 36422 16426 52918 8179

28202 21798 79808

37078

2014-05-23 06:00 6572

105431 26941 48614 29159 120424 14317

26011 12433 16745

35928

Visualization

_time

Wed May 212014

Thu May 22 Fri May 23 Sat May 24 Sun May 25 Mon May 26

Search | Splunk 6.1.0 http://spbl103n01.blue.ygrid.yahoo.com:9999/en-US/app/search...

1 of 2 5/27/14, 3:20 PM

Visualizing queues

17 Yahoo Proprietary

Page 18: Yahoo Enabling Exploratory Analytics of Data in Shared-service Hadoop Clusters

Self-service job reports

18 Yahoo Proprietary

!  Each job is unique and so are the map and reduce elements. !  How to start analyzing jobs? !  Historical job performance and profiling enables in-depth

performance tuning. !  Long terms historical views and trending of growth.

Page 19: Yahoo Enabling Exploratory Analytics of Data in Shared-service Hadoop Clusters

19 Yahoo Proprietary

cluster

user

queue

↕ jobName ↕ jobId ↕ status

↕ gb-hours ↕

run_mins

cobalt

gmon

grideng

PigLatin:findRemoteHDFSFromAudits.pig job_1398982765383_315271

SUCCEEDED

108.00

33.07

cobalt

gmon

grideng

PigLatin:findRemoteHDFSFromAudits.pig job_1398982765383_312700

SUCCEEDED

104.00

37.37

cobalt

gmon

grideng

PigLatin:findRemoteHDFSFromAudits.pig job_1398982765383_309715

SUCCEEDED

88.00 29.83

cobalt

gmon

gridops

distcp: job_1398982765383_309921

SUCCEEDED

36.00 68.49

cobalt

gmon

gridops

SPLK_spbl103n01.blue.ygrid.yahoo.com_1401125953.2076_0 job_1398982765383_313570

SUCCEEDED

25.00 14.26

cobalt

gmon

gridops

nnaudit_DR_2014_05_25 job_1398982765383_308938

SUCCEEDED

25.00 15.43

cob g grid nnaudit_DB_2014_05_25 job_1398982765 SUCCE 24.00 18.07

New Searchindex="jobsummary_logs_all_blue" cluster="*" user="gmon" |

eval total_slot_seconds=(mapSlotSeconds + reduceSlotSeconds) |

eval gb_hours=((total_slot_seconds * 0.5) / 3600) |

eval gb_hours=round(gb_hours,2) |

eval runtime=(finishTime-submitTime)/1000 | stats sum(gb_hours) as gb-hours

avg(runtime) as run_mins

by cluster user queue jobName jobId status| eval run_mins=round(run_mins/60,2) | sort -gb-hours

Yesterday

✓ 4,871 events (5/26/14 12:00:00.000 AM to 5/27/14 12:00:00.000 AM)

Statistics (4,871)

Page 20: Yahoo Enabling Exploratory Analytics of Data in Shared-service Hadoop Clusters

20 Yahoo Proprietary

Page 21: Yahoo Enabling Exploratory Analytics of Data in Shared-service Hadoop Clusters

21 Yahoo Proprietary

Page 22: Yahoo Enabling Exploratory Analytics of Data in Shared-service Hadoop Clusters

22 Yahoo Proprietary

Page 23: Yahoo Enabling Exploratory Analytics of Data in Shared-service Hadoop Clusters

More data to tap into with the metastore / Hive sources

23 Yahoo Proprietary

!  Using the metastore we can setup virtual indexes to any table(s) in Hive, without the need to define the schema up-front

!  Visualize very complex tables (250+ fields) !  Rapid prototyping for new jobs with almost instant results for searches,

without having to wait for the entire job/query to finish !  Built-in aggregates and graphs/charts !  Accelerates development workflow by providing faster interaction with

data

... it’s not just logs we’re looking at

Page 24: Yahoo Enabling Exploratory Analytics of Data in Shared-service Hadoop Clusters

24 Yahoo Proprietary

Page 25: Yahoo Enabling Exploratory Analytics of Data in Shared-service Hadoop Clusters

Meet%Hunk% !

Page 26: Yahoo Enabling Exploratory Analytics of Data in Shared-service Hadoop Clusters

26%

Integrated%Analy4cs%Pla8orm%for%Diverse%Data%Stores%

Full%featured,!Integrated!Product%

Fast!Insights!!for!Everyone%

Works!with!What!You!Have!Today%

Explore% Visualize% Dashboards%

Share%Analyze%

Hadoop!Clusters! NoSQL!and!Other!Data!Stores!

Hadoop%Client%Libraries% Streaming%Resource%Libraries%

Page 27: Yahoo Enabling Exploratory Analytics of Data in Shared-service Hadoop Clusters

27%

Fast%Deployment%and%Configura4on%Just%point%at%Hadoop%•  Cer4fied%integra4ons%to%all%major%Hadoop%distribu4ons%

•  Choose%1stLgen%MapReduce%or%YARN%%

•  Create%Virtual%Indexes%across%one%or%more%clusters%

•  From%download%to%searching%data%in%<%60%minutes%

Connect%to%one%or%mul4ple%Hadoop%clusters%

YARN%cer4fied%

Page 28: Yahoo Enabling Exploratory Analytics of Data in Shared-service Hadoop Clusters

28%

Interac4ve%Search%and%Results%Preview%Rapidly%interact%with%data%•  Powerful%Search%Processing%Language%(SPL™)%

•  Ad%hoc%exploratory%analy4cs%across%massive%datasets%

•  Preview%results%•  No%fixed%schema%

•  No%requirement%to%“understand”%data%upfront%

Search%interface%

Preview%results%

Drill%down%to%raw%data%

Pause%or%stop%MapReduce%jobs%

Page 29: Yahoo Enabling Exploratory Analytics of Data in Shared-service Hadoop Clusters

29%

Powerful%Dashboards%for%SelfLService%Analy4cs%

Interac4ve%Dashboards%and%Charts%•  EasyLtoLuse%dashboard%editor%•  Chart%overlay%•  Pan%and%zoom%•  InLdashboard%drill%down%•  Embed%charts%and%dashboards%in%3rd%party%apps%

•  Reuse%skills%with%Splunk%Enterprise%6.1%and%Hunk%6.1%

Page 30: Yahoo Enabling Exploratory Analytics of Data in Shared-service Hadoop Clusters

30%

Automate%Access%for%Rapid%Explora4on%Supported%File%Formats%•  Text%files%•  Sequence%files%%•  RCFile%•  ORC%files%•  Parquet%

Page 31: Yahoo Enabling Exploratory Analytics of Data in Shared-service Hadoop Clusters

31%

RoleLbased%Security%for%Shared%Clusters%

PassLthrough%Authen4ca4on%•  Provide%roleLbased%security%for%Hadoop%clusters%

•  Access%Hadoop%resources%under%security%and%compliance%

•  Integrates%with%Kerberos%for%Hadoop%security%

Business!Analyst%

MarkeNng!Analyst%

Sys!Admin%

Business!!Analyst!!Queue:!!

Biz!AnalyNcs%

MarkeNng!Analyst!Queue:!

MarkeNng%

Sys!!Admin2!Queue:!!Prod%

Page 32: Yahoo Enabling Exploratory Analytics of Data in Shared-service Hadoop Clusters

32%

Powerful%Developer%Environment%•  Use%a%standardsLbased%web%framework%and%REST%API%%

•  Customize%dashboards%and%UIs%with%Simple%XML,%JavaScript%or%Django%

•  Choose%among%SDKs%%

•  One%integra4on%for%both%Splunk%Enterprise%and%Hunk%

Build%Analy4csLRich%Big%Data%Apps%

Page 33: Yahoo Enabling Exploratory Analytics of Data in Shared-service Hadoop Clusters

33%

Explore,%analyze%and%visualize%data%in%one%integrated%pla8orm%

Point%Hunk%at%your%storage%clusters%and%explore%data%immediately%

Preview%results%as%MapReduce%jobs%run%and%accelerate%reports%with%no%fixed%schemas%

INTERACTIVE!SEARCH!

RICH!DEVELOPER!ENVIRONMENT!

Build%big%data%apps%using%standard%web%languages%and%frameworks%

FULL%FEATURED!ANALYTICS!

FAST!TO!DEPLOY!AND!DRIVE!VALUE!

FullLFeatured,%Integrated%Analy4cs%Pla8orm%

Page 34: Yahoo Enabling Exploratory Analytics of Data in Shared-service Hadoop Clusters

Quest ion/Comments? Sagi Zelnick – Pr incipal Archi tect Emai l : [email protected] Ledion Bi t incka – Pr incipal Archi tect Emai l : lb i t [email protected]