sql on accumulo

18
SQL-on-Accumulo A talk about how we may be able to do it one day DONALD MINER 1 May 5 th , 2014

Upload: donald-miner

Post on 15-Jan-2015

829 views

Category:

Technology


0 download

DESCRIPTION

7:30 SQL-on-Accumulo - Don Miner, ClearEdge IT Running SQL queries over data in Accumulo is easier said than done and has several nuanced design challenges that don't have clear answers. This talk will give an outline of the current state of the art in SQL-on-Accumulo technologies, while giving a realistic view on what is doable and what is not doable today.

TRANSCRIPT

Page 1: SQL on Accumulo

SQL-on-AccumuloA talk about how we may be able to do it one day

DONALD MINER

1

May 5th, 2014

Page 2: SQL on Accumulo

2

A brief history of Hadoop and SQL

time1980 2000

SQL

Page 3: SQL on Accumulo

3

A brief history of Hadoop and SQL

time1980 2000

Hadoo

p HDFS

& M

apRed

uce

Page 4: SQL on Accumulo

4

BIG DATASQL

Page 5: SQL on Accumulo

5

A brief history of Hadoop and SQL

time1980 2000

SQL-

on-H

adoo

p

SQL-on-Hadoop

Page 6: SQL on Accumulo

6

SQL-on-Accumulo would be nice

Problem: Accumulo is just a data store We’ll have to do query somewhere else

Page 7: SQL on Accumulo

7

WWHBD?(What Would HBase Do?)

Page 8: SQL on Accumulo

8

WWHBD? - Hive

• Hive Runs in MapReduce Map col family and col qualifiers to columns Maintained by Hive community

• Impala and Shark inherit functionality from Hive

Page 9: SQL on Accumulo

9

WWHBD? - next level

Problem: Hive, Impala, and Shark don’t know how HBase works … and don’t care

• Apache Phoenix Specifically SQL-on-HBase Currently Apache incubator project Client-embedded JDBC driver Uses series of scans and coprocessors

• Pivotal’s HAWQ and PXF PXF is external table functionality in HAWQ Native support for HAWQ: uses push down filters,

range scans, etc. to efficiently slurp data into HAWQ

Page 10: SQL on Accumulo

10

ACCUMULO-143

people. technology. integrity.

Page 11: SQL on Accumulo

11

SQL-on-Accumulo Status

Hive (and somewhat Impala and Shark)• Github project by Brian Femiano [1]

Doesn’t work on new versions Hasn’t been touched in 9 months Wasn’t committed into trunk

• Some rumors that some orgs have done it themselves (but no public information)

people. technology. integrity.

[1] https://github.com/bfemiano/accumulo-hive-storage-manager (google for “accumulo hive”)

Page 12: SQL on Accumulo

12

SQL-on-Accumulo Status

Phoenix• Discussion on mailing list last week• Some differences between iterators and

coprocessors makes this interesting

Pivotal’s HAWQ and PXF• In development• Will support visibility labels • Pushdown and optimizations with iterators

people. technology. integrity.

Page 13: SQL on Accumulo

13

Visibility Design Problems

These problems are unique to Accumulo

• SELECT and visibility labels Assume two cells, only uniqueness is visibility…

Which do I pick in a SELECT? Timestamps have this problem, but have a logical

assumption (most recent)

• Authorizations in SQL How do you tell the execution engine which

authorizations to use? Table definition? (hard to change) SQL statement? (extend SQL language?) Based on login? (how do you downgrade?)

Page 14: SQL on Accumulo

14

What are the next steps?

I guess that’s up to the community

Page 15: SQL on Accumulo

15

QUIZ: What is this definition trying to say?

Big Data:• Volume• Variety• Velocity• Veracity

A warning about SQL-on-Accumulo

Page 16: SQL on Accumulo

16

QUIZ: What is this definition trying to say?

Big Data:• Volume• Variety• Velocity• Veracity

Answer: RDBMS/SQL suck at all these things

A warning about SQL-on-Accumulo

Page 17: SQL on Accumulo

17

QUIZ: What is this definition trying to say?

Big Data:• Volume• Variety• Velocity• Veracity

Answer: RDBMS/SQL suck at all these things

A warning about SQL-on-Accumulo

What does SQL-on-Accumulo still suck at?*Added context for my internet viewers since this could be controversial if taken literally and I’m not talking to my slides: I’m trying to say that SQL-on-X can’t solve all of the worlds problems, but it can solve a good number of them very well. It also tees up the conversation that SQL is not the end-all-be-all… there are ways that it could be made better to adapt to “the big data use case”. Don’t take this the wrong way, SQL-on-Hadoop and SQL-on-Accumulo would be incredible useful, but it doesn’t solve 100% of the problems.

Page 18: SQL on Accumulo

SQL-on-Accumulo

DONALD MINER

18

[email protected] @donaldpminer

Questions?