accumulo summit 2014: accumulo with distributed sql queries

of 17/17
Copyright © 2014 by Argyle Data Inc. All Rights Reserved. 1 Real-Time Risk Analytics at Network Speed and Hadoop Scale When Minutes Means Millions

Post on 27-Jan-2015

115 views

Category:

Technology

0 download

Embed Size (px)

DESCRIPTION

SQL queries are often the #1 requested feature of key/value stores. Argyle will present our integration of Accumulo with Facebook’s PrestoDB distributed query engine. We will discuss: · Data locality between PrestoDB and Accumulo · Predicate pushdown for row keys · Leveraging a secondary index for column based queries The talk will include a live demonstration of big data benchmark queries.

TRANSCRIPT

  • 1. Copyright 2014 by Argyle Data Inc. All Rights Reserved. 1 Real-Time Risk Analytics at Network Speed and Hadoop Scale When Minutes Means Millions

2. Copyright 2014 by Argyle Data Inc. All Rights Reserved. 2 Agenda About Argyle Use Cases we are Focusing on Case Study Architecture Deep Packet Inspection SQL on Accumulo 3. Copyright 2014 by Argyle Data Inc. All Rights Reserved. 3 Argyle Data Founded 2009 Venture backed 25+ employees Headquartered in San Mateo, CA Mobile Communications, Finance Services, eCommerce, Federal Alliance program vertical market ISV app providers History Vertical Markets 4. Copyright 2014 by Argyle Data Inc. All Rights Reserved. 4 Argyle Data Our Story Every Enterprise App Will be re-written in a better Data Driven way Data Driven Apps Will be Real-Time, Network Speed and Hadoop Scale Proven Stack for Data Driven apps 5. Copyright 2014 by Argyle Data Inc. All Rights Reserved. 5 Pattern for Real-Time Risk Applications Minutes Means Millions Risk App Same Common Pattern Customer Real-Time Call-Data Non-Invasive Network Packet Ingestion Call Data Millions of Mixed Inserts/Reads/Second Real-Time Analytics Fast and Fresh Real-Time SMS-Data Non-Invasive Network Packet Ingestion SMS Data Millions of Mixed Inserts/Reads/Second Real-Time Analytics Fast and Fresh Real-Time Operational Data Non-Invasive Packet/Log File Ingestion - Text Millions of Mixed Inserts/Reads/Second Real-Time Analytics Fast and Fresh 6. Copyright 2014 by Argyle Data Inc. All Rights Reserved. 6 Real-Time Fraud Detection Situation Wangiri Fraud Missed Call Multi-Billion Dollar Fraud Next Day Call Data Record Analysis Solution Real-Time Network DPI Real-Time Analytics and Detection Scale Ingest All Live Call Data for Whole Country Non-Intrusive Tap 10Gb/s to 100Gb/s Benefit Detect IRSF Callback Fraud in Minutes Data Packet Lake for Multiple Apps 7. Copyright 2014 by Argyle Data Inc. All Rights Reserved. 7 Stack Shift 24 Hour ETL/DB Process In-Memory Analytics Patchwork Quilt Systems App Transaction, Log Files Application Data Silos Complex Rules Complex App Dev Real-Time Petabyte Scale Analytics Single Hadoop Stack Network Packet Ingestion Network Packet Data Lake Machine Learning at Scale As Simple as Splunk 62% Moving to Hadoop Infrastructure - Gartner Old world architecture New world architecture 8. Copyright 2014 by Argyle Data Inc. All Rights Reserved. 8 ArgyleDB Enabling Data Driven Risk Apps at Network Speed and Hadoop Scale Ingestion Network Packet Ingestion Deep Packet Inspection Storage Optimization Universal Schema Query Distributed SQL Optimization Machine Learning Machine Learning Query Search GraphIngest 9. Copyright 2014 by Argyle Data Inc. All Rights Reserved. 9 Deep Packet Inspection A Sea of Protocols 10. Copyright 2014 by Argyle Data Inc. All Rights Reserved. 10 Presto + Hive Architecture 11. Copyright 2014 by Argyle Data Inc. All Rights Reserved. 11 Presto + Accumulo From K/V to SQL 12. Copyright 2014 by Argyle Data Inc. All Rights Reserved. 12 Parallel Architecture / Data Locality Collocate Presto-Accumulo Workers and Accumulo Nodes 13. Copyright 2014 by Argyle Data Inc. All Rights Reserved. 13 Accumulo KV to Presto data model mapping Schema-less to Schema-full Accumulo is schema-less Presto expects a predefined schema for tables Table definitions in ZooKeeper Each Presto table mapped to an Accumulo table Each Presto column mapped to an Accumulo colfam+colqualifier Use column definition to detect data type and deserialize from byte[] 14. Copyright 2014 by Argyle Data Inc. All Rights Reserved. 14 Secondary Index Or how to make it columnar Presto works well with Columnar storage Presto fetches individual columns, not rows We considered Accumulo Locality Groups But we decided to use a separate index table 15. Copyright 2014 by Argyle Data Inc. All Rights Reserved. 15 Secondary Index Table Presto Worker Table1_index Table1 16. Copyright 2014 by Argyle Data Inc. All Rights Reserved. 16 Secondary Index Table Table1 Table1_index Prefixed with a byte for sharding data (to prevent burning kindle) key Value Joe 123 Smith 123 Key Column Value 123 Firstname Joe 123 Lastname Smith 17. Copyright 2014 by Argyle Data Inc. All Rights Reserved. 17 REAL-TIME RISK ANALYTICS AT NETWORK SPEED AND HADOOP SCALE When Minutes Means Millions