accumulo design

of 54/54
APACHE ACCUMULO From a design perspective

Post on 03-Dec-2014

146 views

Category:

Data & Analytics

1 download

Embed Size (px)

DESCRIPTION

Learn the fundamentals of Accumulo with this presentation by Koverse CTO Aaron Cordova (@aaroncordova)

TRANSCRIPT

  • 1. APACHE ACCUMULO From a design perspective

2. SCALABLE KEY-VALUE STORE BASED ON GOOGLE'S BIGTABLE 3. BIGTABLE FEATURES Distributes data across many commodity servers Sorts data by key for fast lookup of values by key Scan across multiple key value pairs Highly consistent writes to single row Support for MapReduce jobs 4. DATA MODEL Key Value Row ID Column Timestamp Family Qualier 5. Row ID Col Fam Col Qual Timestamp Value Bob Email id0023 20120301 Hey joe, can you send ... Bob Email id0024 20120302 Re: next Thursday ... Bob UserPrefs Background 20130101 Grey Fred Email id0001 20080302 Welcome to gmail ... Sarah Email id0004 20130201 Hi again ... Sara Videos ytid009 20100303 nsu736:)jdudjd k$:)378;'$$) 6. Tablet servers HDFS DataNodes Commit Layer Replication Layer 7. SINCE 2006 Several BigTable implementations Apache Hbase Apache Cassandra Apache Accumulo others 8. BIGTABLE IS BIGTABLE RIGHT? 9. HBASE 10. HBASE Open source Apache project started by developers at Powerset, bought by Microsoft Now used at Facebook, StumbleUpon, other big web sites Fast reads Row-oriented API Each column family has it's own set of les 11. CASSANDRA 12. CASSANDRA Apache project started at Facebook Combines elements of BigTable and Amazon's Dynamo into one system Used at Netix, other web sites Fast writes Tunable consistency 13. Tablet servers Commit and Replication Layer 14. CONSISTENCY Highly consistent means: writes in one place Eventually consistent: writes in > one place Writes in > one place: network partition tolerance Partition tolerance: geographically distributed servers *Google uses Spanner to synchronize multiple dbs 15. Tablet servers Data Center A Data Center B 16. Data Center A Data Center B Tablet servers 17. OVERVIEW Both highly scalable Used to build web applications that can serve millions of users at once Serves as a low-latency persistence layer for real time service of requests Available in single data center or cross data center options 18. USE CASE Most data comes from users Schema dened by the application Data builds up over time 19. Many UsersDb Web application 20. ACCUMULO 21. ACCUMULO Can support the web application use-case But what are those other extra features for? 22. ACCUMULO EXTRAS Dynamic Column Families ColumnVisibility Key-value oriented API Iterators Batch Scanners 23. BIG ORGANIZATIONS Missions other than internet services Various disparate operational systems that generate data Desire to look across and analyze that data Desire to deliver results to their own population 24. USE CASE IS DISCOVERING AND ANALYZING ALL DATA 25. ISSUES Scale Unknown / multiple schema Support for analysis without data movement Varying levels of sensitivity in the same system Support a high number of low-latency user requests 26. Many Users Analyze Db Data sets 27. SCALE? 28. CHECK (ITS BIGTABLE) 29. NO CONTROL OVER OR MANY DIFFERENT SCHEMA? 30. MAP EXISTING FIELDSTO COLUMNS DYNAMICALLY 31. INCLUDING COLUMN FAMILIES 32. VARYING LEVELS OF DATA SENSITIVITY? 33. COLUMNVISIBILITY 34. DATA MODEL Key Value Row ID Column Time stamp Family Qualier Visibility 35. Row ID Col Fam Col Qual Col Vis Timestamp Value Bob Email id0023 personal comms 20120301 Hey joe, can you send ... Bob Email id0024 personal comms 20120302 Re: next Thursday ... Bob UserPrefs Background prefs 20130101 Grey Fred Email id0001 personal comms 20080302 Welcome to gmail ... Sarah Email id0004 personal comms 20130201 Hi again ... Sara Videos ytid009 public post 20100303 nsu736:)jdu djdk $:)378;'$$) 36. DATA OFVARYING SENSITIVITY LEVELS CAN BE PHYSICALLY CO-LOCATED 37. FRAMEWORKS LIKE HADOOP MAP REDUCE LOVE IT WHEN DATA IS ALLTOGETHER 38. LOOK ACROSS DATASETS? 39. SECONDARY INDICES 40. SECONDARY INDICES Application-created data: known Pre-existing data? unknown 41. DATA DISCOVERY! 42. SECONDARY INDICES RowID Col Qual Value RID00001 age 54 RID00001 name bob RID00002 name fred RID00003 age 43 RID00003 height 59 RID00003 name harry RID00004 name carl RID00005 name evan RowID Col Fam Col Qual 43 age RID00003 54 age RID00001 59 height RID00003 bob name RID00001 carl name RID00004 evan name RID00005 fred name RID00002 harry name RID00003 43. PARTIAL ROW SCANS 44. BATCH SCANNERS 45. RowID Col Qual Value RID00001 age 54 RID00001 name bob RID00002 name fred RID00003 age 43 RID00003 height 59 RID00003 name harry RID00004 name carl RID00005 name evan Batch Scanner 46. COLUMNVISIBILITY APPLIES TO INDEXESTOO 47. ANALYSIS? 48. MAPREDUCE: CHECK 49. SHUFFLE-SORTED? Between Map and Reduce phases is shufe-sort Sorting by key is necessary so all the values for a given key end up next to each other BigTable also sorts keys 50. ITERATORS 51. Value combine(Iterator values) 52. PRE-COMPUTATION 53. Many Users Analyze Db Data sets 54. ACCUMULO