pxf bdam 2016

19
Shivram Mani ( Pivotal) PXF A Unified Access Framework for HDFS datasets

Upload: shivram-mani

Post on 19-Jan-2017

141 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: PXF BDAM 2016

Shivram Mani ( Pivotal)

PXF A Unified Access Framework for

HDFS datasets

Page 2: PXF BDAM 2016

Agenda

● Motivations● PXF Introduction● Architecture/Design● Developer View● Usage/Plugins● Value Proposition to new applications● Whats coming

Page 3: PXF BDAM 2016

Motivations: SQL on Hadoop

RDBMS

?

various formats, storages supported on HDFS

● ANSI SQL● Cost based optimizer● Transactions● ...

Foreign Tables!

Page 4: PXF BDAM 2016

PXF is an extension framework that does the following

● Uniform tabular view to heterogeneous data sources

● Exploits parallelism for data access

● Pluggable framework for custom connectors

● Provides built-in connectors for accessing data in HDFS files, Hive/HBase tables, etc

What is PXF ?

Page 5: PXF BDAM 2016

PXF Communication

Apache Tomcat

PXF WebappREST API

Java API

libhdfs3 (written in C) segments

External Tables

Native Tables

HTTP, port: 51200

Java API

Java/Thrift

Page 6: PXF BDAM 2016

Deployment Architecture

HAWQMaster Node NN

pxf

HBase Master

DN4

pxf

HAWQseg4

DN1

pxf

HAWQseg1

HBase Region Server1

DN2

pxf

HAWQseg2

HBase Region Server2

DN3

pxf

HAWQseg3

HBase Region Server3

* PXF needs to be installed on all DN* PXF is recommended to be installed on NN

Page 7: PXF BDAM 2016

PXF Components

Fragmenter Splits dataset into partitionsReturns locations of each partition

Accessor Understand and read/write the fragmentReturn records

Resolver Convert records to a consumable format (Data Types)

Page 8: PXF BDAM 2016

Architecture - Read Data Flow

HAWQMaster Node NN

pxf

DN1

pxf

HAWQseg1

select * from ext_table0

getFragments() API

pxf://<location>:<port>/<path>

1

Fragments (JSON)2

7

3Split mapping(fragment -> segment)

DN1

pxf

HAWQseg1

DN1

pxf

HAWQseg1Query dispatched to Segment 1,2,3… (Interconnect)

5

Read() REST

6 records

8

query result

Records (stream)

Fragmenter

Resolver

Accessor

4

Page 9: PXF BDAM 2016

Read Data Flow - Take 2

Page 10: PXF BDAM 2016

PXF Developer View

Page 11: PXF BDAM 2016

PXF Usage

Built-in with Plugins

● HDFS

● Hive

● HBase

Community (https://bintray.com/big-data/maven/pxf-plugins/view )

● Cassandra

● Accumulo

● Redis

● ...

CREATE [READABLE|WRITABLE] EXTERNAL TABLE table_name ( column_name data_type [, ...] )LOCATION ('pxf://host[:port]/path-to-data?PROFILE=<profile-name> [&custom-option=value...]')FORMAT '[TEXT | CSV | CUSTOM]' (<formatting_properties>);

Page 12: PXF BDAM 2016

PXF Hdfs PluginFragment - Splits (blocks)

● Support Read : multiple formats ->

● Support Write to Sequence Files

● Chunked Read Optimization

● Support for stats

Profile Description

HdfsTextSimple Read delimited single line records (plain text)

HdfsTextMulti Read delimited multiline records (plain text)

Avro Read avro records

JSON Supports simple/pretty printed JSON with

field projection

Page 13: PXF BDAM 2016

PXF Hive PluginFragment - Splits of the file stored in table

● Text based

● SequenceFile

● RCFile

● ORCFile

● Parquet

● Avro

*Complex types are converted to text

Partition Filtering

Metadata API *

Profile Description

Hive Read all Hive tables (all types)

HiveRC Hive tables stored in RC (serialized with

ColumnarSerDe/LazyBinaryColumnarSerDe)

HiveText Faster access for Hive tables stored as Text

Page 14: PXF BDAM 2016

PXF HBase PluginFragment - Regions

● Read Only. Uses Profile ‘Hbase’

● Filter push down to Hbase scanner

○ (Operators: EQ, NE, LT, GT, LE, GE & AND)

● Direct Mapping

● Indirect Mapping

○ Lookup table - pxflookup

○ Maps attribute name to hbase <cf:qualififer>

(row key) mapping

sales id=cf1:saleid

sales cmts-cf8:comments

Page 15: PXF BDAM 2016

● Abstracts application from external Datasource/APIs/Versions

● Focus on one data layout

● Off the shelf support for various datasources

● Extensibility. Ease of supporting custom datasources

● Provides means for Filter push down

● Dataset statistics for performance optimization

Value Proposition of PXF

Page 16: PXF BDAM 2016

● Using FDW callback functions that will interact with PXF.

PXF with Postgres

Apache Tomcat

PXF WebappREST API Java API

HTTP, port: 51200

Java API

Java/Thrift

FDW

Page 17: PXF BDAM 2016

● HA

● Schema Auto Discovery (Metadata)

● Support for more dataset statistics

● Time series data optimization

● More plugins (Gemfire, Solr, etc)

● Additional Filter push down support

● Custom Output Format

Whats coming

Page 18: PXF BDAM 2016

cwiki.apache.org/confluence/display/HAWQ/PXFhttp://hawq.incubator.apache.org/docs/pxf/javadoc

github.com/apache/incubator-hawq/tree/master/pxf

issues.apache.org/jira/browse/HAWQ Component = PXF

ContributionFeature Areas Custom Plugins

(storage, formats)Push Down

FiltersCustom

Applications

Documentation Wiki/Docs

Code / Review Github(Apache)

Join Discussion/Ask Questions Apache DLs [email protected]@hawq.incubator.apache.org

Github(Field) github.com/Pivotal-Field-Engineering/pxf-field

Page 19: PXF BDAM 2016

thank you !