scaling postgresql with gridsql
Post on 16-Apr-2017
8.533 Views
Preview:
TRANSCRIPT
PRESENTATION NAME
Scaling PostgreSQL
with GridSQL
Who Am I?
Jim MlodgenskiCo-organizer of NYCPUG
Founder of Cirrus Technologies
Former Chief Architect of EnterpriseDB
Agenda
What is GridSQL?
Architecture
Query Flow
Scaling
Limitations
What is GridSQL?
Shared-Nothing, distributed data architecture.Leverage the power of multiple commodity servers while appearing as a single database to the application
Essentially... Open Source Greenplum, Netezza or Teradata
GridSQL Details
Designed for Parallel Querying
Not just Read-Only, can execute UPDATE, DELETE
Data Loader for parallel loading
Standard connectivity via PostgreSQL compatible connectors: JDBC, ODBC, ADO.NET, libpq (psql)
What GridSQL is not?
A replication solution like Slony or Bucardo
A high availability solution like Streaming Replication in PostgreSQL 9.0
A scalable transactional solution like PostgresXC
An elastic, eventually consistent NoSQL database
Configuration
Can be configured for multiple logical nodes per physical serverTake advantage of multi-core processors
Tables may be either replicated or partitioned
Replicated tables for static lookup data or dimensionsPartitioned tables for large fact tables
Partitioning
Tables may simultaneously use GridSQL Partitioning with Constraint Exclusion PartitioningLarge queries scan a much smaller subset of data by using subtables
Since each subtable is also partitioned across nodes, they are scanned in parallel
Queries execute much faster
Architecture
Loosely coupled, shared-nothing architecture
Data repositoriesMetadata database
GridSQL database
GridSQL processesCentral coordinator
Agents
Query Optimization
Cost Based OptimizerTakes into account Row Shipping (expensive)
Looks for joins with replicated tablesCan be done locally
Looks for joins between tables on partitioned columns
Aggregation
First set of aggregates done in parallel at the nodes
Like groups of intermediate results shipped to same target node
Second aggregation done in parallel
Coordinator streams in node results, combining on the fly and sending to client result set, performing a merge sort if ORDER BY present
Two Phase Aggregation
SUMSUM(stat1)
SUM2(SUM(stat1)
AVGSUM(stat1) / COUNT(stat1)
SUM2 (SUM(stat1)) / SUM2 (COUNT(stat1))
Creating Tables
Tables can be partitioned or replicated
CREATE TABLE region (r_regionkey INTEGER NOT NULL, r_name CHAR(25) NOT NULL, r_comment VARCHAR(152)) REPLICATED;
Creating Tables
CREATE TABLE orders ( o_orderkey INTEGER NOT NULL, o_custkey INTEGER NOT NULL, o_orderstatus CHAR(1) NOT NULL, o_totalprice DECIMAL(15,2) NOT NULL, o_orderdate DATE NOT NULL, o_orderpriority CHAR(15) NOT NULL, o_clerk CHAR(15) NOT NULL, o_shippriority INTEGER NOT NULL, o_comment VARCHAR(79) NOT NULL) PARTITIONING KEY o_orderkey ON ALL;
DBT3: Query 1
SELECTl_returnflag, l_linestatus, sum(l_quantity) as sum_qty, sum(l_extendedprice) as sum_base_price,sum(l_extendedprice * (1 - l_discount)) as sum_disc_price, sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)) as sum_charge,avg(l_quantity) as avg_qty,avg(l_extendedprice) as avg_price,avg(l_discount) as avg_disc,count(*) as count_orderFROM lineitemWHERE l_shipdate
top related