SQL-on-Accumulo with Pivotal HAWQ and PXF
Agenda
• HAWQ & PXF Overview
• Accumulo Connector - Usage
• Accumulo Connector - Advanced Features
• PXF API
• Demo
HAWQ is…
A parallel SQL query engine on Hadoop
PHD
PHD
PHD
PHD
PXF is...
A fast extensible framework connecting HAWQ to a data store of choice that
exposes a parallel API
PHD
dire
ct an
alytics
PXF
PHD
ind
irect a
na
lytics
PXF
Usage
CREATE EXTERNAL TABLE <table>(<col list>)LOCATION (‘pxf://rest_host:port/<data source>?<plugin options>’)FORMAT ‘<type>’(<params>)[SEGMENT REJECT LIMIT <n> [ROWS|PERCENT] LOG ERRORS INTO <err_t>]
-- direct analytics (external)SELECT <…> FROM <table> WHERE <…>
-- indirect analytics (internal)INSERT INTO <hawq table> SELECT <…> FROM <table> WHERE <…>
Any SQL operation (joining, aggregates, sorting, etc) can be executed
Accumulo Connector - Usage
CREATE EXTERNAL TABLE <table>(<col list>)LOCATION (‘pxf://…/<accumulo table name>?profile=accumulo’)FORMAT ‘custom’(formatter=‘pxfwritable_import’)
CREATE EXTERNAL TABLE t(recordkey text, “cf1:date” date, “cf1:price” double)
LOCATION (‘pxf://…/instance:sales?profile=accumulo’)FORMAT ‘custom’(formatter=‘pxfwritable_import’)
-- Example of a simple querySELECT “cf1:date”, max(“cf1:price”) FROM tGROUP BY “cf1:date”
Accumulo Connector - Advanced Features
Smart filtering with predicate pushdownExcluding irrelevant tablets and filtering on values on source according to HAWQ’s query WHERE clause.
Error tables for logging badly formatted data and avoid aborting the querySpecify desired error threshold. Query the error table after operation to see the rejected data and the related error.
Lookup table for easy access to non textual qualifiersDefine a qualifier lookup table that translates between Accumulo style naming and SQL style naming.
Automatic Statistics for better join planningRun ANALYZE on a PXF-Accumulo table to update HAWQ’s optimizer with table and attribute level statistics from the Accumulo table.
Mechanism for storing remote credentialsThe mapping between a HAWQ user credentials and Accumulo user credentials are entered once in HAWQ and automatically transferred to the Accumulo connector in runtime.
Accumulo Connector - Advanced Features
Visibility labels for enhanced securityThe Accumulo connector utilizes Accumulo’s built in cell-level security to ensure users are only able to view information for which they have been granted access.
Custom Iterators for increased performancePredicate pushdown is implemented using stackable custom Iterators which increase comparison operation (<, <=, >, >=, ==, !=) performance in a query’s WHERE clause.
Intelligent range filteringSpecifying a comparison on a recordkey will modify the Accumulo Connector’s range, minimizing the amount of data scanned, resulting in faster scans.
Automatic type detectionData types are detected automatically within the iterator, ensuring correct comparison operations are being utilized.
PXF API
• Fragmenter – returns a list of data source fragments and their location
• Accessor – access a given list of fragments, read them and return records
• Resolver – deserialize each record according to a given schema or technique
Distributedexecutionthreads
Distributeddatabaseservers
PXF API
• AccumuloFragmenter returns a list of Accumulo tablets+locations for a given table
• AccumuloAccessor access a given list of fragments, read them and return Accumulo records. Use filter pushdown when possible
• AccumuloResolver convert each qualifier value into something that can be understood by HAWQ
Live Demo
Accumulo Table Contents
User Authorizations
$PHD_ROOT/conf/pxf-profiles.xml
Define Table in HAWQ
Setting Authorizations
Executing a Simple Query
A Query With a Single Pushdown Filter
A Query With a Single Pushdown Filter
A Query With a Multiple Pushdown Filters
A Query With a Multiple Pushdown Filters
A Query With a Multiple Pushdown Filters
Setting Authorizations
Executing a Query as ‘foo’
Define a Lookup Table in Accumulo
Define a Lookup Table in HAWQ
Performing a Simple Query