sqrrl real time_big_data_20130411
DESCRIPTION
Sqrrl CTO, Adam Fuchs, discusses Sqrrl and Accumulo at April 2013 Boston Hadoop User GroupTRANSCRIPT
![Page 1: Sqrrl real time_big_data_20130411](https://reader034.vdocuments.net/reader034/viewer/2022051608/5458be1eb1af9fc15d8b4bd6/html5/thumbnails/1.jpg)
sqrrl Secure. Scale. Adapt
Sqrrl Data, Inc. All Rights Reserved
sqrrl Secure. Scale. Adapt.
Adam Fuchs, CTO 11 April, 2013
![Page 2: Sqrrl real time_big_data_20130411](https://reader034.vdocuments.net/reader034/viewer/2022051608/5458be1eb1af9fc15d8b4bd6/html5/thumbnails/2.jpg)
2 Sqrrl Data, Inc. All Rights Reserved
Management
Ely Kahn sqrrl VP BizDev,
White House
Investors
Adam Fuchs
sqrrl CTO, NSA
Who We Are
20+ years of combined Apache Accumulo engineering exper9se
Mark Terenzoni sqrrl CEO, F5
• Founded July 2012 • Funded August 2012 • Team includes former Tech
Director of Accumulo at NSA and 6 commiDers/contributors
![Page 3: Sqrrl real time_big_data_20130411](https://reader034.vdocuments.net/reader034/viewer/2022051608/5458be1eb1af9fc15d8b4bd6/html5/thumbnails/3.jpg)
3 Sqrrl Data, Inc. All Rights Reserved
3
Our Mission
Security
AdapGvity Scalability
![Page 4: Sqrrl real time_big_data_20130411](https://reader034.vdocuments.net/reader034/viewer/2022051608/5458be1eb1af9fc15d8b4bd6/html5/thumbnails/4.jpg)
4 Sqrrl Data, Inc. All Rights Reserved
4
Apache Accumulo
" Sorted, Distributed Key/Value Store
" Based on Google’s Big Table Design
" Built on Top of Apache Hadoop and Apache Zookeeper
" Augments and Integrates With the Hadoop ecosystem
" Originally developed at the National Security Agency, now an Apache Software Foundation project
![Page 5: Sqrrl real time_big_data_20130411](https://reader034.vdocuments.net/reader034/viewer/2022051608/5458be1eb1af9fc15d8b4bd6/html5/thumbnails/5.jpg)
5 Sqrrl Data, Inc. All Rights Reserved
5
Applica9ons
Analy9cs APIs
Security & Access Controls
Data Integra9on
Search, Sta*s*cs, Graph, Lucene, SQL, Custom Extensions
IAM, Encryp*on, DAM, Secure Code
ETL, Hadoop
Accumulo
Sqrrl Enterprise Architecture
![Page 6: Sqrrl real time_big_data_20130411](https://reader034.vdocuments.net/reader034/viewer/2022051608/5458be1eb1af9fc15d8b4bd6/html5/thumbnails/6.jpg)
6 Sqrrl Data, Inc. All Rights Reserved
" Start small, but design for scalability – One applicaGon first, then grow to hundreds – One gigabyte first, then grow to petabytes
" Itera*ve schema refinement – IniGally, let the data define the schema – Refine the schema in bulk as you beDer understand the data – Middle ground between flat files and complete ontologies
" Discovery analy*cs as applica*on building blocks – Universal search: structured and unstructured data, across data sets, low latency – Basic staGsGcs: aggregaGons of query results, parallelized, low latency, to support big
picture analysis – Graphs: scalable graph analyGcs for analyzing how everything is connected
" Data-‐centric security – Separate modeling of security and analysis – Simplifies mulG-‐tenancy and applicaGon accreditaGon
Big Data Lessons Learned
![Page 7: Sqrrl real time_big_data_20130411](https://reader034.vdocuments.net/reader034/viewer/2022051608/5458be1eb1af9fc15d8b4bd6/html5/thumbnails/7.jpg)
7 Sqrrl Data, Inc. All Rights Reserved
7
Schema Discovery
![Page 8: Sqrrl real time_big_data_20130411](https://reader034.vdocuments.net/reader034/viewer/2022051608/5458be1eb1af9fc15d8b4bd6/html5/thumbnails/8.jpg)
8 Sqrrl Data, Inc. All Rights Reserved
The future of Big Data innovaGon is Apps, built on: • Universal Search • Schema-‐less StaGsGcs • Graphs • IntuiGve Languages • Secure, Scalable, and
Adaptable pla\orms
Lightweight Apps
![Page 9: Sqrrl real time_big_data_20130411](https://reader034.vdocuments.net/reader034/viewer/2022051608/5458be1eb1af9fc15d8b4bd6/html5/thumbnails/9.jpg)
9 Sqrrl Data, Inc. All Rights Reserved
9
Targeted Analysis
![Page 10: Sqrrl real time_big_data_20130411](https://reader034.vdocuments.net/reader034/viewer/2022051608/5458be1eb1af9fc15d8b4bd6/html5/thumbnails/10.jpg)
10 Sqrrl Data, Inc. All Rights Reserved
10
Big-Picture Analytics
![Page 11: Sqrrl real time_big_data_20130411](https://reader034.vdocuments.net/reader034/viewer/2022051608/5458be1eb1af9fc15d8b4bd6/html5/thumbnails/11.jpg)
11 Sqrrl Data, Inc. All Rights Reserved
DefiniGon: A form of security in which data carries with it the elements of provenance that are required to make policy decisions on its releasability. • Separate data modeling for Security and Analysis • Reusability of applicaGons across security domains
• Distributed development of ingest and query applicaGons
• Supported by Accumulo’s cell-‐level security
Data-Centric Security
![Page 12: Sqrrl real time_big_data_20130411](https://reader034.vdocuments.net/reader034/viewer/2022051608/5458be1eb1af9fc15d8b4bd6/html5/thumbnails/12.jpg)
12 Sqrrl Data, Inc. All Rights Reserved
12
Cell-Level Security
![Page 13: Sqrrl real time_big_data_20130411](https://reader034.vdocuments.net/reader034/viewer/2022051608/5458be1eb1af9fc15d8b4bd6/html5/thumbnails/13.jpg)
13 Sqrrl Data, Inc. All Rights Reserved
13
Scalable Data-Centric Security
Data Labeler Accumulo Apps
User ACributes
Audits
Policies
HDFS, Zookeeper
End Users
Auth. Service
Policy Engine
![Page 14: Sqrrl real time_big_data_20130411](https://reader034.vdocuments.net/reader034/viewer/2022051608/5458be1eb1af9fc15d8b4bd6/html5/thumbnails/14.jpg)
14 Sqrrl Data, Inc. All Rights Reserved
14
Accumulo’s Strengths
" Security – Cell-‐level security reduces the cost of applicaGon development in the
presence of complex legal or policy restricGons on data use – IAM and encrypGon Ges into enterprise security standards
" Scalability – Proven reliability and performance at the mulG-‐petabyte scale – High-‐performance parallel I/O library
" Adap9vity – Flexible schema support to quickly ingest new data sources – Sorted key/value paradigm supports a mulGtude of search and
analysis applicaGons – Server-‐side programming framework “iterator trees” support best-‐in-‐
class aggregaGon, filtering, and complex query semanGcs
![Page 15: Sqrrl real time_big_data_20130411](https://reader034.vdocuments.net/reader034/viewer/2022051608/5458be1eb1af9fc15d8b4bd6/html5/thumbnails/15.jpg)
15 Sqrrl Data, Inc. All Rights Reserved
15
An Accumulo key is a 5-‐tuple, consis9ng of: " Row: Controls Atomicity " Column Family: Controls Locality " Column Qualifier: Controls Uniqueness " Visibility Label: Controls Access " Timestamp: Controls Versioning
Row Col. Fam. Col. Qual. Visibility Timestamp Value
John Doe Notes PCP PCP_JD 20120912 PaGent suffers from an acute …
John Doe Test Results Cholesterol JD|PCP_JD 20120912 183
John Doe Test Results Mental Health JD|PSYCH_JD 20120801 Pass
John Doe Test Results X-‐Ray JD|PHYS_JD 20120513 1010110110100…
Accumulo Key/Value Example
Accumulo Key Structure
![Page 16: Sqrrl real time_big_data_20130411](https://reader034.vdocuments.net/reader034/viewer/2022051608/5458be1eb1af9fc15d8b4bd6/html5/thumbnails/16.jpg)
16 Sqrrl Data, Inc. All Rights Reserved
16
Accumulo Architecture
Tablet Server
Tablet
Tablet Server
Tablet
Tablet Server
Tablet
ApplicaGon
Zookeeper
Zookeeper
Zookeeper
Master
HDFS
Read/Write
Store/Replicate
Assign/Balance
Delegate Authority
Delegate Authority
ApplicaGon
ApplicaGon
![Page 17: Sqrrl real time_big_data_20130411](https://reader034.vdocuments.net/reader034/viewer/2022051608/5458be1eb1af9fc15d8b4bd6/html5/thumbnails/17.jpg)
17 Sqrrl Data, Inc. All Rights Reserved
17
Tablet Data Flow
In-‐Memory Map
Write Ahead Log
(For Recovery)
Sorted, Indexed File
Sorted, Indexed File
Sorted, Indexed File
Tablet Reads
Iterator Tree
Minor Compac<on
Merging / Major Compac<on
Iterator Tree
Writes Iterator Tree
Scan
![Page 18: Sqrrl real time_big_data_20130411](https://reader034.vdocuments.net/reader034/viewer/2022051608/5458be1eb1af9fc15d8b4bd6/html5/thumbnails/18.jpg)
Iterator Framework
18
Secure. Scale. Adapt.
Iterator Opera9ons: " File Reads " Block Caching " Merging " DeleGon " IsolaGon " Locality Groups " Range SelecGon " Column SelecGon " Cell-‐level Security " Versioning " Filtering " AggregaGon " ParGGoned Joins
[email protected] | @sqrrl_inc | 617.520.4375 sqrrl data, INC., All Rights Reserved
![Page 19: Sqrrl real time_big_data_20130411](https://reader034.vdocuments.net/reader034/viewer/2022051608/5458be1eb1af9fc15d8b4bd6/html5/thumbnails/19.jpg)
19 Sqrrl Data, Inc. All Rights Reserved
• No built-‐in secondary indices
• Sort Order ó Index • Balance between ingest and query
• Avoid introducing boDlenecks
• Preserve cell-‐level security and scalability
Table Design Table:
Row:
Column Family:
Column Qualifier:
Value:
Forward Index
<UUID>
<Type>
<Field>
<Term>
Inverted Index
<Term>
<Type> + <Field>
<UUID>
<Digest of Event>
![Page 20: Sqrrl real time_big_data_20130411](https://reader034.vdocuments.net/reader034/viewer/2022051608/5458be1eb1af9fc15d8b4bd6/html5/thumbnails/20.jpg)
20 Sqrrl Data, Inc. All Rights Reserved
20
Ecosystem Architecture
Apache HDFS
Apache Accumulo
Sqrrl Enterprise
Custom Ingester Web Server Custom AnalyGc Map/Reduce Task
Sqrrl API over Apache Thrip RPC : Hierarchical Documents + Graphs, Lucene + SQL + more
Accumulo RPC : Sorted Key/Value I/O
Hadoop RPC : File I/O
![Page 21: Sqrrl real time_big_data_20130411](https://reader034.vdocuments.net/reader034/viewer/2022051608/5458be1eb1af9fc15d8b4bd6/html5/thumbnails/21.jpg)
21 Sqrrl Data, Inc. All Rights Reserved
21
sqrrl data, inc. 275 Third St.
Cambridge, MA 02142
617-‐902-‐0784 www.sqrrl.com @sqrrl_inc
Contact