bigelephants-130102185156-phpapp01
TRANSCRIPT
![Page 1: bigelephants-130102185156-phpapp01](https://reader035.vdocuments.net/reader035/viewer/2022062523/577cceeb1a28ab9e788e798a/html5/thumbnails/1.jpg)
Really Big Elephants
DataWarehousing
with
PostgreSQL
Josh BerkusMySQL User Conference 2011
![Page 2: bigelephants-130102185156-phpapp01](https://reader035.vdocuments.net/reader035/viewer/2022062523/577cceeb1a28ab9e788e798a/html5/thumbnails/2.jpg)
Included/ExcludedI will cover:
● advantages of Postgres for DW
● configuration● tablespaces● ETL/ELT● windowing● partitioning● materialized views
● I won't cover:● hardware selection● EAV / blobs● denormalization● DW query tuning● external DW tools● backups &
upgrades
![Page 3: bigelephants-130102185156-phpapp01](https://reader035.vdocuments.net/reader035/viewer/2022062523/577cceeb1a28ab9e788e798a/html5/thumbnails/3.jpg)
What is a“data warehouse”?
![Page 4: bigelephants-130102185156-phpapp01](https://reader035.vdocuments.net/reader035/viewer/2022062523/577cceeb1a28ab9e788e798a/html5/thumbnails/4.jpg)
synonyms etc.● Business Intelligence
● also BI/DW● Analytics database● OnLine Analytical Processing
(OLAP)● Data Mining● Decision Support
![Page 5: bigelephants-130102185156-phpapp01](https://reader035.vdocuments.net/reader035/viewer/2022062523/577cceeb1a28ab9e788e798a/html5/thumbnails/5.jpg)
OLTP vs DW ● many single-row
writes● current data● queries generated
by user activity● < 1s response
times● 0.5 to 5x RAM
● few large batch imports
● years of data● queries generated
by large reports● queries can run for
hours● 5x to 2000x RAM
![Page 6: bigelephants-130102185156-phpapp01](https://reader035.vdocuments.net/reader035/viewer/2022062523/577cceeb1a28ab9e788e798a/html5/thumbnails/6.jpg)
OLTP vs DW ● 100 to 1000 users● constraints
● 1 to 10 users● no constraints
![Page 7: bigelephants-130102185156-phpapp01](https://reader035.vdocuments.net/reader035/viewer/2022062523/577cceeb1a28ab9e788e798a/html5/thumbnails/7.jpg)
Why use PostgreSQL for
data warehousing?
![Page 8: bigelephants-130102185156-phpapp01](https://reader035.vdocuments.net/reader035/viewer/2022062523/577cceeb1a28ab9e788e798a/html5/thumbnails/8.jpg)
Complex QueriesSELECT
CASE WHEN ((SUM(inventory.closed_on_hand) + SUM(changes.received) + SUM(changes.adjustments) + SUM(changes.transferred_in-changes.transferred_out)) <> 0) THEN ROUND((CAST(SUM(changes.sold_and_closed + changes.returned_and_closed) AS numeric) * 100) / CAST(SUM(starting.closed_on_hand) + SUM(changes.received) + SUM(changes.adjustments) + SUM(changes.transferred_in-changes.transferred_out) AS numeric), 5) ELSE 0 END AS "Percent_Sold",
CASE WHEN (SUM(changes.sold_and_closed) <> 0) THEN ROUND(100*((SUM(changes.closed_markdown_units_sold)*1.0) / SUM(changes.sold_and_closed)), 5) ELSE 0 END AS "Percent_of_Units_Sold_with_Markdown",
CASE WHEN (SUM(changes.sold_and_closed * _sku.retail_price) <> 0) THEN ROUND(100*(SUM(changes.closed_markdown_dollars_sold)*1.0) / SUM(changes.sold_and_closed * _sku.retail_price), 5) ELSE 0 END AS "Markdown_Percent",
'0' AS "Percent_of_Total_Sales", CASE WHEN SUM((changes.sold_and_closed + changes.returned_and_closed) * _sku.retail_price) IS NULL THEN 0 ELSE
SUM((changes.sold_and_closed + changes.returned_and_closed) * _sku.retail_price) END AS "Net_Sales_at_Retail", '0' AS "Percent_of_Ending_Inventory_at_Retail", SUM(inventory.closed_on_hand * _sku.retail_price) AS
"Ending_Inventory_at_Retail", "_store"."label" AS "Store", "_department"."label" AS "Department", "_vendor"."name" AS "Vendor_Name"
FROM inventory JOIN inventory as starting ON inventory.warehouse_id = starting.warehouse_id AND inventory.sku_id = starting.sku_id LEFT OUTER JOIN ( SELECT warehouse_id, sku_id, sum(received) as received, sum(transferred_in) as transferred_in, sum(transferred_out) as transferred_out, sum(adjustments) as adjustments, sum(sold) as sold FROM movement WHERE movement.movement_date BETWEEN '2010-08-05' AND '2010-08-19' GROUP BY sku_id, warehouse_id ) as changes ON inventory.warehouse_id = changes.warehouse_id AND inventory.sku_id = changes.sku_id
JOIN _sku ON _sku.id = inventory.sku_id JOIN _warehouse ON _warehouse.id = inventory.warehouse_idJOIN _location_hierarchy AS _store ON _store.id = _warehouse.store_id
AND _store.type = 'Store' JOIN _product ON _product.id = _sku.product_idJOIN _merchandise_hierarchy AS _department
ON _department.id = _product.department_id AND _department.type = 'Department'JOIN _vendor AS _vendor ON _vendor.id = _sku.vendor_id
![Page 9: bigelephants-130102185156-phpapp01](https://reader035.vdocuments.net/reader035/viewer/2022062523/577cceeb1a28ab9e788e798a/html5/thumbnails/9.jpg)
Complex Queries● JOIN optimization
● 5 different JOIN types● approximate planning for 20+ table joins
● subqueries in any clause● plus nested subqueries
● windowing queries● recursive queries
![Page 10: bigelephants-130102185156-phpapp01](https://reader035.vdocuments.net/reader035/viewer/2022062523/577cceeb1a28ab9e788e798a/html5/thumbnails/10.jpg)
Big Data Features● big tables partitioning● big databases tablespaces● big backups PITR● big updates binary replication● big queries resource control
![Page 11: bigelephants-130102185156-phpapp01](https://reader035.vdocuments.net/reader035/viewer/2022062523/577cceeb1a28ab9e788e798a/html5/thumbnails/11.jpg)
Extensibility● add data analysis functionality from
external libraries inside the database● financial analysis● genetic sequencing● approximate queries
● create your own:● data types functions● aggregates operators
![Page 12: bigelephants-130102185156-phpapp01](https://reader035.vdocuments.net/reader035/viewer/2022062523/577cceeb1a28ab9e788e798a/html5/thumbnails/12.jpg)
Community
● lots of experience with large databases● blogs, tools, online help
“I'm running a partitioning scheme using 256 tables with a maximum of 16 million rows (namely IPv4-addresses) and a current total of about 2.5 billion rows, there are no deletes though, but lots of updates.”
“I use PostgreSQL basically as a data warehouse to store all the genetic data that our lab generates … With this configuration I figure I'll have ~3TB for my main data tables and 1TB for indexes. ”
“I'm running a partitioning scheme using 256 tables with a maximum of 16 million rows (namely IPv4-addresses) and a current total of about 2.5 billion rows, there are no deletes though, but lots of updates.”
![Page 13: bigelephants-130102185156-phpapp01](https://reader035.vdocuments.net/reader035/viewer/2022062523/577cceeb1a28ab9e788e798a/html5/thumbnails/13.jpg)
Sweet Spot
MySQL
PostgreSQL
DW Database
0 5 10 15 20 25 30
0 5 10 15 20 25 30
![Page 14: bigelephants-130102185156-phpapp01](https://reader035.vdocuments.net/reader035/viewer/2022062523/577cceeb1a28ab9e788e798a/html5/thumbnails/14.jpg)
DW Databases● Vertica● Greenplum● Aster Data● Infobright● Teradata● Hadoop/HBase
● Netezza● HadoopDB● LucidDB● MonetDB● SciDB● Paraccel
![Page 15: bigelephants-130102185156-phpapp01](https://reader035.vdocuments.net/reader035/viewer/2022062523/577cceeb1a28ab9e788e798a/html5/thumbnails/15.jpg)
DW Databases● Vertica● Greenplum● Aster Data● Infobright● Teradata● Hadoop/HBase
● Netezza● HadoopDB● LucidDB● MonetDB● SciDB● Paraccel
![Page 16: bigelephants-130102185156-phpapp01](https://reader035.vdocuments.net/reader035/viewer/2022062523/577cceeb1a28ab9e788e798a/html5/thumbnails/16.jpg)
How do I configure PostgreSQL for
data warehousing?
![Page 17: bigelephants-130102185156-phpapp01](https://reader035.vdocuments.net/reader035/viewer/2022062523/577cceeb1a28ab9e788e798a/html5/thumbnails/17.jpg)
General Setup● Latest version of PostgreSQL● System with lots of drives
● 6 to 48 drives– or 2 to 12 SSDs
● High-throughput RAID● Write ahead log (WAL) on separate disk(s)
● 10 to 50 GB space
![Page 18: bigelephants-130102185156-phpapp01](https://reader035.vdocuments.net/reader035/viewer/2022062523/577cceeb1a28ab9e788e798a/html5/thumbnails/18.jpg)
separate theDW workloadonto its own
server
![Page 19: bigelephants-130102185156-phpapp01](https://reader035.vdocuments.net/reader035/viewer/2022062523/577cceeb1a28ab9e788e798a/html5/thumbnails/19.jpg)
Settingsfew connectionsmax_connections = 10 to 40
raise those memory limits!shared_buffers = 1/8 to ¼ of RAMwork_mem = 128MB to 1GBmaintenance_work_mem = 512MB to 1GBtemp_buffers = 128MB to 1GBeffective_cache_size = ¾ of RAMwal_buffers = 16MB
![Page 20: bigelephants-130102185156-phpapp01](https://reader035.vdocuments.net/reader035/viewer/2022062523/577cceeb1a28ab9e788e798a/html5/thumbnails/20.jpg)
No autovacuumautovacuum = off
vacuum_cost_delay = off
● do your VACUUMs and ANALYZEs as part of the batch load process● usually several of them
● also maintain tables by partitioning
![Page 21: bigelephants-130102185156-phpapp01](https://reader035.vdocuments.net/reader035/viewer/2022062523/577cceeb1a28ab9e788e798a/html5/thumbnails/21.jpg)
What aretablespaces?
![Page 22: bigelephants-130102185156-phpapp01](https://reader035.vdocuments.net/reader035/viewer/2022062523/577cceeb1a28ab9e788e798a/html5/thumbnails/22.jpg)
logical data extents● lets you put some of your data on specific
devices / disks
CREATE TABLESPACE 'history_log' LOCATION '/mnt/san2/history_log';
ALTER TABLE history_log TABLESPACE history_log;
![Page 23: bigelephants-130102185156-phpapp01](https://reader035.vdocuments.net/reader035/viewer/2022062523/577cceeb1a28ab9e788e798a/html5/thumbnails/23.jpg)
tablespace reasons● parallelize access
● your largest “fact table” on one tablespace● its indexes on another
– not as useful if you have a good SAN
● temp tablespace for temp tables● move key join tables to SSD● migrate to new storage one table at a time
![Page 24: bigelephants-130102185156-phpapp01](https://reader035.vdocuments.net/reader035/viewer/2022062523/577cceeb1a28ab9e788e798a/html5/thumbnails/24.jpg)
What is ETLand how do I do it?
![Page 25: bigelephants-130102185156-phpapp01](https://reader035.vdocuments.net/reader035/viewer/2022062523/577cceeb1a28ab9e788e798a/html5/thumbnails/25.jpg)
Extract, Transform, Load● how you turn external raw data into
normalized database data● Apache logs → web analytics DB● CSV POS files → financial reporting DB● OLTP server → 10-year data warehouse
● also called ELT when the transformation is done inside the database● PostgreSQL is particularly good for ELT
![Page 26: bigelephants-130102185156-phpapp01](https://reader035.vdocuments.net/reader035/viewer/2022062523/577cceeb1a28ab9e788e798a/html5/thumbnails/26.jpg)
L: INSERT● batch INSERTs into 100's or 1000's per
transaction● row-at-a-time is very slow
● create and load import tables in one transaction
● add indexes and constraints after load● insert several streams in parallel
● but not more than CPU cores
![Page 27: bigelephants-130102185156-phpapp01](https://reader035.vdocuments.net/reader035/viewer/2022062523/577cceeb1a28ab9e788e798a/html5/thumbnails/27.jpg)
L: COPY● Powerful, efficient delimited file loader
● almost bug-free - we use it for backup● 3-5X faster than inserts● works with most delimited files
● Not fault-tolerant● also have to know structure in advance● try pg_loader for better COPY
![Page 28: bigelephants-130102185156-phpapp01](https://reader035.vdocuments.net/reader035/viewer/2022062523/577cceeb1a28ab9e788e798a/html5/thumbnails/28.jpg)
L: COPYCOPY weblog_new FROM '/mnt/transfers/weblogs/weblog-20110605.csv' with csv;
COPY traffic_snapshot FROM 'traffic_20110605192241' delimiter '|' nulls as 'N';
\copy weblog_summary_june TO 'Desktop/weblog-june2011.csv' with csv header;
![Page 29: bigelephants-130102185156-phpapp01](https://reader035.vdocuments.net/reader035/viewer/2022062523/577cceeb1a28ab9e788e798a/html5/thumbnails/29.jpg)
L: in 9.1: FDW
CREATE FOREIGN TABLE raw_hits ( hit_time TIMESTAMP,page TEXT )
SERVER file_fdwOPTIONS (format 'csv', delimiter ';', filename '/var/log/hits.log');
![Page 30: bigelephants-130102185156-phpapp01](https://reader035.vdocuments.net/reader035/viewer/2022062523/577cceeb1a28ab9e788e798a/html5/thumbnails/30.jpg)
L: in 9.1: FDW
CREATE TABLE hits_2011041617 ASSELECT page, count(*)FROM raw_hitsWHERE hit_time > '2011-04-16 16:00:00' ANDhit_time <= '2011-04-16 17:00:00'
GROUP BY page;
![Page 31: bigelephants-130102185156-phpapp01](https://reader035.vdocuments.net/reader035/viewer/2022062523/577cceeb1a28ab9e788e798a/html5/thumbnails/31.jpg)
T: temporary tables
CREATE TEMPORARY TABLEON COMMIT DROP sales_records_june_rollup AS SELECT seller_id, location, sell_date, sum(sale_amount), array_agg(item_id)FROM raw_salesWHERE sell_date BETWEEN '2011-06-01' AND '2011-06-30 23:59:59.999'GROUP BY seller_id, location, sell_date;
![Page 32: bigelephants-130102185156-phpapp01](https://reader035.vdocuments.net/reader035/viewer/2022062523/577cceeb1a28ab9e788e798a/html5/thumbnails/32.jpg)
in 9.1: unlogged tables● like myISAM without the risk
CREATE UNLOGGED TABLE cleaned_log_importAS SELECT hit_time, pageFROM raw_hits, hit_watermarkWHERE hit_time > last_watermark AND is_valid(page);
![Page 33: bigelephants-130102185156-phpapp01](https://reader035.vdocuments.net/reader035/viewer/2022062523/577cceeb1a28ab9e788e798a/html5/thumbnails/33.jpg)
T: stored procedures● multiple languages
● SQL PL/pgSQL● PL/Perl PL/Python PL/PHP● PL/R PL/Java● allows you to use exernal data processing
libraries in the database● custom aggregates, operators, more
![Page 34: bigelephants-130102185156-phpapp01](https://reader035.vdocuments.net/reader035/viewer/2022062523/577cceeb1a28ab9e788e798a/html5/thumbnails/34.jpg)
CREATE OR REPLACE FUNCTION normalize_query ( queryin text )RETURNS TEXT LANGUAGE PLPERL STABLE STRICT AS $f$# this function "normalizes" queries by stripping out constants. # some regexes by Guillaume Smet under The PostgreSQL License.local $_ = $_[0];#first cleanup the whitespace s/\s+/ /g; s/\s,/,/g; s/,(\S)/, $1/g; s/^\s//g; s/\s$//g;#remove any double quotes and quoted text s/\\'//g; s/'[^']*'/''/g; s/''('')+/''/g;#remove TRUE and FALSE s/(\W)TRUE(\W)/$1BOOL$2/gi; s/(\W)FALSE(\W)/$1BOOL$2/gi;#remove any bare numbers or hex numbers s/([^a-zA-Z_\$-])-?([0-9]+)/${1}0/g; s/([^a-z_\$-])0x[0-9a-f]{1,10}/${1}0x/ig;#normalize any IN statements s/(IN\s*)\([\'0x,\s]*\)/${1}(...)/ig;#return the normalized queryreturn $_;$f$;
![Page 35: bigelephants-130102185156-phpapp01](https://reader035.vdocuments.net/reader035/viewer/2022062523/577cceeb1a28ab9e788e798a/html5/thumbnails/35.jpg)
CREATE OR REPLACE FUNCTION f_graph2() RETURNS text AS 'sql <- paste("SELECT id as x,hit as y FROM mytemp LIMIT 30",sep="");str <- c(pg.spi.exec(sql));mymain <- "Graph 2";mysub <- paste("The worst offender is: ",str[1,3]," with ",str[1,2]," hits",sep="");myxlab <- "Top 30 IP Addresses";myylab <- "Number of Hits";pdf(''/tmp/graph2.pdf'');plot(str,type="b",main=mymain,sub=mysub,xlab=myxlab,ylab=myylab,lwd=3);mtext("Probes by intrusive IP Addresses",side=3);dev.off();print(''DONE'');' LANGUAGE plr;
![Page 36: bigelephants-130102185156-phpapp01](https://reader035.vdocuments.net/reader035/viewer/2022062523/577cceeb1a28ab9e788e798a/html5/thumbnails/36.jpg)
![Page 37: bigelephants-130102185156-phpapp01](https://reader035.vdocuments.net/reader035/viewer/2022062523/577cceeb1a28ab9e788e798a/html5/thumbnails/37.jpg)
ELT Tips● bulk insert into a new table instead of
updating/deleting an existing table● update all columns in one operation
instead of one at a time● use views and custom functions to simplify
your queries● inserting into your long-term tables should
be the very last step – no updates after!
![Page 38: bigelephants-130102185156-phpapp01](https://reader035.vdocuments.net/reader035/viewer/2022062523/577cceeb1a28ab9e788e798a/html5/thumbnails/38.jpg)
What's awindowing query?
![Page 39: bigelephants-130102185156-phpapp01](https://reader035.vdocuments.net/reader035/viewer/2022062523/577cceeb1a28ab9e788e798a/html5/thumbnails/39.jpg)
regular aggregate
![Page 40: bigelephants-130102185156-phpapp01](https://reader035.vdocuments.net/reader035/viewer/2022062523/577cceeb1a28ab9e788e798a/html5/thumbnails/40.jpg)
windowing function
![Page 41: bigelephants-130102185156-phpapp01](https://reader035.vdocuments.net/reader035/viewer/2022062523/577cceeb1a28ab9e788e798a/html5/thumbnails/41.jpg)
TABLE events (event_id INT,event_type TEXT,start TIMESTAMPTZ,duration INTERVAL,event_desc TEXT
);
![Page 42: bigelephants-130102185156-phpapp01](https://reader035.vdocuments.net/reader035/viewer/2022062523/577cceeb1a28ab9e788e798a/html5/thumbnails/42.jpg)
SELECT MAX(concurrent)FROM (SELECT SUM(tally) OVER (ORDER BY start)AS concurrent
FROM (SELECT start, 1::INT as tally
FROM events UNION ALL SELECT (start + duration), -1 FROM events ) AS event_vert) AS ec;
![Page 43: bigelephants-130102185156-phpapp01](https://reader035.vdocuments.net/reader035/viewer/2022062523/577cceeb1a28ab9e788e798a/html5/thumbnails/43.jpg)
UPDATE partition_name SET drop_month = dropitFROM ( SELECT round_id,
CASE WHEN ( ( row_number() over (partition by team_id order by team_id, total_points) )
<= ( drop_lowest ) ) THEN 0 ELSE 1 END as dropit FROM (
SELECT team.team_id, round.round_id, month_points as total_points,row_number() OVER (
partition by team.team_id, kal.positions order by team.team_id, kal.positions, month_points desc ) as ordinal, at_least, numdrop as drop_lowest
FROM partition_name as rdropJOIN round USING (round_id) JOIN team USING (team_id) JOIN pick ON round.round_id = pick.round_id
and pick.pick_period @> this_periodLEFT OUTER JOIN keep_at_least kal
ON rdrop.pool_id = kal.pool_idand pick.position_id = any ( kal.positions )WHERE rdrop.pool_id = this_pool
AND team.team_id = this_team ) as rankingWHERE ordinal > at_least or at_least is null) as droplow
WHERE droplow.round_id = partition_name .round_id AND partition_name .pool_id = this_pool AND dropit = 0;
![Page 44: bigelephants-130102185156-phpapp01](https://reader035.vdocuments.net/reader035/viewer/2022062523/577cceeb1a28ab9e788e798a/html5/thumbnails/44.jpg)
SELECT round_id, CASE WHEN ( ( row_number() OVER (partition by team_id
order by team_id, total_points) )<= ( drop_lowest ) )
THEN 0 ELSE 1 END as dropitFROM (
SELECT team.team_id, round.round_id, month_points as total_points,
row_number() OVER (partition by team.team_id,
kal.positions order by team.team_id,
kal.positions, month_points desc ) as ordinal
![Page 45: bigelephants-130102185156-phpapp01](https://reader035.vdocuments.net/reader035/viewer/2022062523/577cceeb1a28ab9e788e798a/html5/thumbnails/45.jpg)
stream processing SQL● replace multiple queries with a single
query● avoid scanning large tables multiple times
● replace pages of application code● and MB of data transmission
● SQL alternative to map/reduce● (for some data mining tasks)
![Page 46: bigelephants-130102185156-phpapp01](https://reader035.vdocuments.net/reader035/viewer/2022062523/577cceeb1a28ab9e788e798a/html5/thumbnails/46.jpg)
How do I partition my tables?
![Page 47: bigelephants-130102185156-phpapp01](https://reader035.vdocuments.net/reader035/viewer/2022062523/577cceeb1a28ab9e788e798a/html5/thumbnails/47.jpg)
Postgres partitioning● based on table inheritance and constraint
exclusion● partitions are also full tables● explicit constraints define the range of the
partion● triggers or RULEs handle insert/update
![Page 48: bigelephants-130102185156-phpapp01](https://reader035.vdocuments.net/reader035/viewer/2022062523/577cceeb1a28ab9e788e798a/html5/thumbnails/48.jpg)
CREATE TABLE sales (sell_date TIMESTAMPTZ NOT NULL,seller_id INT NOT NULL,item_id INT NOT NULL,sale_amount NUMERIC NOT NULL,narrative TEXT );
![Page 49: bigelephants-130102185156-phpapp01](https://reader035.vdocuments.net/reader035/viewer/2022062523/577cceeb1a28ab9e788e798a/html5/thumbnails/49.jpg)
CREATE TABLE sales_2011_06 (CONSTRAINT partition_date_rangeCHECK (sell_date >= '2011-06-01'AND sell_date < '2011-07-01' )
) INHERITS ( sales );
![Page 50: bigelephants-130102185156-phpapp01](https://reader035.vdocuments.net/reader035/viewer/2022062523/577cceeb1a28ab9e788e798a/html5/thumbnails/50.jpg)
CREATE FUNCTION sales_insert ()RETURNS trigger LANGUAGE plpgsql AS $f$BEGIN
CASE WHEN sell_date < '2011-06-01'THEN INSERT INTO sales_2011_05 VALUES (NEW.*)
WHEN sell_date < '2011-07-01'THEN INSERT INTO sales_2011_06 VALUES (NEW.*)
WHEN sell_date >= '2011-07-01'THEN INSERT INTO sales_2011_07 VALUES (NEW.*)
ELSEINSERT INTO sales_overflow VALUES (NEW.*)
END;RETURN NULL;
END;$f$;
CREATE TRIGGER sales_insert BEFORE INSERT ON salesFOR EACH ROW EXECUTE PROCEDURE sales_insert();
![Page 51: bigelephants-130102185156-phpapp01](https://reader035.vdocuments.net/reader035/viewer/2022062523/577cceeb1a28ab9e788e798a/html5/thumbnails/51.jpg)
Postgres partitioning● Good for:
● “rolling off” data● DB maintenance● queries which use
the partition key● under 300
partitions● insert performance
● Bad for:● administration● queries which do
not use the partition key
● JOINs● over 300 partitions● update
performance
![Page 52: bigelephants-130102185156-phpapp01](https://reader035.vdocuments.net/reader035/viewer/2022062523/577cceeb1a28ab9e788e798a/html5/thumbnails/52.jpg)
you need a data expiration policy
● you can't plan your DW otherwise● sets your storage requirements● lets you project how queries will run when
database is “full”● will take a lot of meetings
● people don't like talking about deleting data
![Page 53: bigelephants-130102185156-phpapp01](https://reader035.vdocuments.net/reader035/viewer/2022062523/577cceeb1a28ab9e788e798a/html5/thumbnails/53.jpg)
you need a data expiration policy
● raw import data 1 month● detail-level transactions 3 years● detail-level web logs 1 year● rollups 10 years
![Page 54: bigelephants-130102185156-phpapp01](https://reader035.vdocuments.net/reader035/viewer/2022062523/577cceeb1a28ab9e788e798a/html5/thumbnails/54.jpg)
What's a materialized view?
![Page 55: bigelephants-130102185156-phpapp01](https://reader035.vdocuments.net/reader035/viewer/2022062523/577cceeb1a28ab9e788e798a/html5/thumbnails/55.jpg)
query results as table● calculate once, read many time
● complex/expensive queries● frequently referenced
● not necessarily a whole query● often part of a query
● manually maintained in PostgreSQL● automagic support not complete yet
![Page 56: bigelephants-130102185156-phpapp01](https://reader035.vdocuments.net/reader035/viewer/2022062523/577cceeb1a28ab9e788e798a/html5/thumbnails/56.jpg)
SELECT page, COUNT(*) as total_hits
FROM hit_counterWHERE date_trunc('day', hit_date)BETWEEN ( now() AND now() - INTERVAL '7 days' )
ORDER BY total_hits DESC LIMIT 10;
![Page 57: bigelephants-130102185156-phpapp01](https://reader035.vdocuments.net/reader035/viewer/2022062523/577cceeb1a28ab9e788e798a/html5/thumbnails/57.jpg)
CREATE TABLE page_hits (page TEXT,hit_day DATE,total_hits INT,CONSTRAINT page_hits_pk PRIMARY KEY(hit_day, page)
);
![Page 58: bigelephants-130102185156-phpapp01](https://reader035.vdocuments.net/reader035/viewer/2022062523/577cceeb1a28ab9e788e798a/html5/thumbnails/58.jpg)
each day:
INSERT INTO page_hitsSELECT page, date_trunc('day', hit_date)as hit_day,
COUNT(*) as total_hitsFROM hit_counterWHERE date_trunc('day', hit_date)= date_trunc('day',
now() - INTERVAL '1 day')ORDER BY total_hits DESC;
![Page 59: bigelephants-130102185156-phpapp01](https://reader035.vdocuments.net/reader035/viewer/2022062523/577cceeb1a28ab9e788e798a/html5/thumbnails/59.jpg)
SELECT page, total_hitsFROM page_hitsWHERE hit_date BETWEEN now() ANDnow() - INTERVAL '7 days';
![Page 60: bigelephants-130102185156-phpapp01](https://reader035.vdocuments.net/reader035/viewer/2022062523/577cceeb1a28ab9e788e798a/html5/thumbnails/60.jpg)
maintaining matviewsBEST: update matviews
at batch load time
GOOD: update matview accordingto clock/calendar
BAD for DW: update matviewsusing a trigger
![Page 61: bigelephants-130102185156-phpapp01](https://reader035.vdocuments.net/reader035/viewer/2022062523/577cceeb1a28ab9e788e798a/html5/thumbnails/61.jpg)
matview tips● matviews should be small
● 1/10 to ¼ of RAM● each matview should support several
queries● or one really really important one
● truncate + insert, don't update● index matviews like crazy
![Page 62: bigelephants-130102185156-phpapp01](https://reader035.vdocuments.net/reader035/viewer/2022062523/577cceeb1a28ab9e788e798a/html5/thumbnails/62.jpg)
Contact● Josh Berkus: [email protected]
● blog: blogs.ittoolbox.com/database/soup● PostgreSQL: www.postgresql.org
● pgexperts: www.pgexperts.com● Upcoming Events
● pgCon: Ottawa: May 17-20● OpenSourceBridge: Portland: June
This talk is copyright 2010 Josh Berkus and is licensed under the creative commons attribution license. Special thanks for materials to: Elein Mustain (PL/R), Hitoshi Harada and David Fetter (windowing functions), Andrew Dunstan (file_FDW)