grokking techtalk #20: postgresql internals 101
TRANSCRIPT
![Page 1: Grokking TechTalk #20: PostgreSQL Internals 101](https://reader033.vdocuments.net/reader033/viewer/2022052405/5a647ccd7f8b9a27568b4f8d/html5/thumbnails/1.jpg)
Huy Nguyen
CTO, Cofounder - Holistics SoftwareCofounder, Grokking Vietnam
PostgreSQL Internals 101
/post:gres:Q:L/
![Page 2: Grokking TechTalk #20: PostgreSQL Internals 101](https://reader033.vdocuments.net/reader033/viewer/2022052405/5a647ccd7f8b9a27568b4f8d/html5/thumbnails/2.jpg)
About Me
Education:
● Pho Thong Nang Khieu, Tin 04-07
● National University of Singapore (NUS), Computer Science Major.
Work:
● Software Engineer Intern, SenseGraphics (Stockholm, Sweden)
● Software Engineer Intern, Facebook (California, US)
● Data Infrastructure Engineer, Viki (Singapore)
Now:
● Co-founder & CTO, Holistics Software
● Co-founder, Grokking Vietnam
[email protected] facebook.com/huy bit.ly/huy-linkedin
![Page 3: Grokking TechTalk #20: PostgreSQL Internals 101](https://reader033.vdocuments.net/reader033/viewer/2022052405/5a647ccd7f8b9a27568b4f8d/html5/thumbnails/3.jpg)
● This talk covers a very small part of PostgreSQL concepts/internals
● As with any RDBMS, PostgreSQL is a complex system, and it’s still evolving.
● Mainly revolve around explaining “Uber’s MySQL vs PostgreSQL” article.
● Not Covered: Memory Management, Query Planning, Replication, etc...
Agenda
● Uber’s Article
● Table Heap
● B-Tree Index
● MVCC
● MySQL Structure
● PostgreSQL vs MySQL (Uber Use-case)
● Index-only Scan
● Heap-only Tuple (HOT)
![Page 4: Grokking TechTalk #20: PostgreSQL Internals 101](https://reader033.vdocuments.net/reader033/viewer/2022052405/5a647ccd7f8b9a27568b4f8d/html5/thumbnails/4.jpg)
Uber migrating from PostgreSQL to MySQL
![Page 5: Grokking TechTalk #20: PostgreSQL Internals 101](https://reader033.vdocuments.net/reader033/viewer/2022052405/5a647ccd7f8b9a27568b4f8d/html5/thumbnails/5.jpg)
Uber’s Use Case● Table with lots of indexes (cover almost/all columns)● Lots of UPDATEs
⇒ MySQL handles this better than PostgreSQL
● Read more here
![Page 6: Grokking TechTalk #20: PostgreSQL Internals 101](https://reader033.vdocuments.net/reader033/viewer/2022052405/5a647ccd7f8b9a27568b4f8d/html5/thumbnails/6.jpg)
● Everything is under base directory ($PGDATA). /var/lib/postgresql/9.x/main
● Each database is a folder name after its oid
Physical Structure
http://www.interdb.jp/pg/pgsql01.html
![Page 7: Grokking TechTalk #20: PostgreSQL Internals 101](https://reader033.vdocuments.net/reader033/viewer/2022052405/5a647ccd7f8b9a27568b4f8d/html5/thumbnails/7.jpg)
demodb=# select oid, relname, relfilenode from pg_class where relname = 'test';
oid | relname | relfilenode --------+---------+------------- 416854 | test | 416854(1 row)
Physical Structure
Each table’s data is in 1 or multiple files (max 1GB each)
![Page 8: Grokking TechTalk #20: PostgreSQL Internals 101](https://reader033.vdocuments.net/reader033/viewer/2022052405/5a647ccd7f8b9a27568b4f8d/html5/thumbnails/8.jpg)
TRUNCATE table;
vs
DELETE FROM table;
![Page 9: Grokking TechTalk #20: PostgreSQL Internals 101](https://reader033.vdocuments.net/reader033/viewer/2022052405/5a647ccd7f8b9a27568b4f8d/html5/thumbnails/9.jpg)
demodb=# select oid, relname, relfilenode from pg_class where relname = 'test'; oid | relname | relfilenode --------+---------+------------- 416854 | test | 416854(1 row)
demodb=# truncate test;TRUNCATE TABLEINSERT 0 1
demodb=# select oid, relname, relfilenode from pg_class where relname = 'test'; oid | relname | relfilenode --------+---------+------------- 416854 | test | 416857(1 row)
![Page 10: Grokking TechTalk #20: PostgreSQL Internals 101](https://reader033.vdocuments.net/reader033/viewer/2022052405/5a647ccd7f8b9a27568b4f8d/html5/thumbnails/10.jpg)
Tuple Address (ctid)
ctid id name
(0, 2) 1 Alice
(0, 5) 2 Bob
(1, 3) 3 Charlie
ctid (tuple ID): a pair of (block, location) to position the tuple in the data file.
![Page 11: Grokking TechTalk #20: PostgreSQL Internals 101](https://reader033.vdocuments.net/reader033/viewer/2022052405/5a647ccd7f8b9a27568b4f8d/html5/thumbnails/11.jpg)
Heap Table Structure
Page: a block of content, default to 8KB each.
Line pointers: 4-byte number address, holds pointer to each tuple.
For tuple with size > 2KB, a special storage method called TOAST is used.
![Page 12: Grokking TechTalk #20: PostgreSQL Internals 101](https://reader033.vdocuments.net/reader033/viewer/2022052405/5a647ccd7f8b9a27568b4f8d/html5/thumbnails/12.jpg)
● Problem: Someone reading data, while someone else is writing to it
● Reader might see inconsistent piece of data
● MVCC: Allow reads and writes to happen concurrently
MVCC - Multi-version Concurrency Control
![Page 13: Grokking TechTalk #20: PostgreSQL Internals 101](https://reader033.vdocuments.net/reader033/viewer/2022052405/5a647ccd7f8b9a27568b4f8d/html5/thumbnails/13.jpg)
MVCC - Table
xmin xmax id name
1 5 1 Alice
2 3 2 Bob
3 2 Robert
4 3 Charlie
1. INSERT Alice
2. INSERT Bob
3. UPDATE Bob → Robert
4. INSERT Charlie
5. DELETE Alice
● xmin: transaction ID that inserts this tuple
● xmax: transaction that removes this tuple
![Page 14: Grokking TechTalk #20: PostgreSQL Internals 101](https://reader033.vdocuments.net/reader033/viewer/2022052405/5a647ccd7f8b9a27568b4f8d/html5/thumbnails/14.jpg)
INSERT
1
http://www.interdb.jp/pg/pgsql05.html
![Page 15: Grokking TechTalk #20: PostgreSQL Internals 101](https://reader033.vdocuments.net/reader033/viewer/2022052405/5a647ccd7f8b9a27568b4f8d/html5/thumbnails/15.jpg)
DELETE
1
http://www.interdb.jp/pg/pgsql05.html
![Page 16: Grokking TechTalk #20: PostgreSQL Internals 101](https://reader033.vdocuments.net/reader033/viewer/2022052405/5a647ccd7f8b9a27568b4f8d/html5/thumbnails/16.jpg)
UPDATE
http://www.interdb.jp/pg/pgsql05.html
![Page 17: Grokking TechTalk #20: PostgreSQL Internals 101](https://reader033.vdocuments.net/reader033/viewer/2022052405/5a647ccd7f8b9a27568b4f8d/html5/thumbnails/17.jpg)
Because each UPDATE creates new tuple (and marks old tuple deleted), lots of UPDATEs will soon increase the table’s physical
size.
Table Bloat
![Page 18: Grokking TechTalk #20: PostgreSQL Internals 101](https://reader033.vdocuments.net/reader033/viewer/2022052405/5a647ccd7f8b9a27568b4f8d/html5/thumbnails/18.jpg)
Index (B-tree)
H
B
A C
Balanced search tree.
Root node and inner nodes contain keys and pointers to lower level nodes
Leaf nodes contain keys and pointers to the heap (ctid)
When table has new tuples, new tuple is added to index tree.Heap
ctid
D
A1
…. ….
![Page 19: Grokking TechTalk #20: PostgreSQL Internals 101](https://reader033.vdocuments.net/reader033/viewer/2022052405/5a647ccd7f8b9a27568b4f8d/html5/thumbnails/19.jpg)
Write Amplifications
● Each UPDATE inserts new tuple.
● New index tuples● ⇒ multiple writes
● Extra overhead to Write-ahead Log (WAL)
● Carried over through network
● Applied on Slave
H
B
A C
Heap
ctid
D
A1
…. ….
![Page 20: Grokking TechTalk #20: PostgreSQL Internals 101](https://reader033.vdocuments.net/reader033/viewer/2022052405/5a647ccd7f8b9a27568b4f8d/html5/thumbnails/20.jpg)
MySQL / InnoDB
● MVCC: Inline update of tuples
● Table Layout: B+ tree on Primary Key
● Index: points to primary key
![Page 21: Grokking TechTalk #20: PostgreSQL Internals 101](https://reader033.vdocuments.net/reader033/viewer/2022052405/5a647ccd7f8b9a27568b4f8d/html5/thumbnails/21.jpg)
MySQL data is B+ Tree (on primary key)
Leaf nodes contain actual rows data
MySQL Table (B+ tree)
H
B
A Crow data
...
primary key
![Page 22: Grokking TechTalk #20: PostgreSQL Internals 101](https://reader033.vdocuments.net/reader033/viewer/2022052405/5a647ccd7f8b9a27568b4f8d/html5/thumbnails/22.jpg)
MySQL Index
● MySQL: the node’s value store primary key
● A lookup on secondary index requires 2 index traversals: secondary index + primary index.
H
B
A C
Table
D
A1
…. ….
primary key
![Page 23: Grokking TechTalk #20: PostgreSQL Internals 101](https://reader033.vdocuments.net/reader033/viewer/2022052405/5a647ccd7f8b9a27568b4f8d/html5/thumbnails/23.jpg)
https://blog.jcole.us/2013/01/10/btree-index-structures-in-innodb/
![Page 24: Grokking TechTalk #20: PostgreSQL Internals 101](https://reader033.vdocuments.net/reader033/viewer/2022052405/5a647ccd7f8b9a27568b4f8d/html5/thumbnails/24.jpg)
PostgreSQL vs MySQL (Uber case)
PostgreSQL MySQL
MVCC New Tuple Per UPDATE Inline update of tuple (with rollback segments)
Index Lookup Store physical address (ctid) By primary key
Table Layout Heap-table structure Primary-key table structure
![Page 25: Grokking TechTalk #20: PostgreSQL Internals 101](https://reader033.vdocuments.net/reader033/viewer/2022052405/5a647ccd7f8b9a27568b4f8d/html5/thumbnails/25.jpg)
PostgreSQL vs MySQL (Uber case)
PostgreSQL MySQL
select on primary key log(N) + heap read log(n) + direct read
update Update all indexes;1 data write
Do not update indexes;2 data writes
select on index key log(n) + O(1) heap read log(n) + log(n) primary index read
sequential scan Page sequential scan Index-order scan
![Page 26: Grokking TechTalk #20: PostgreSQL Internals 101](https://reader033.vdocuments.net/reader033/viewer/2022052405/5a647ccd7f8b9a27568b4f8d/html5/thumbnails/26.jpg)
Index-only Scan (Covering Index)
Index on (product_id, revenue)
SELECT SUM(revenue) FROM table WHERE product_id = 123
If the index itself has all the data needed, no Heap Table lookup is required.
![Page 27: Grokking TechTalk #20: PostgreSQL Internals 101](https://reader033.vdocuments.net/reader033/viewer/2022052405/5a647ccd7f8b9a27568b4f8d/html5/thumbnails/27.jpg)
Visibility Map
Per table’s page
VM[i] is set: all tuples in page i are visible to current transactions
VM is only updated by VACUUM
https://www.slideshare.net/pgdayasia/introduction-to-vacuum-freezing-and-xid
![Page 28: Grokking TechTalk #20: PostgreSQL Internals 101](https://reader033.vdocuments.net/reader033/viewer/2022052405/5a647ccd7f8b9a27568b4f8d/html5/thumbnails/28.jpg)
Heap-only Tuple (HOT)
● No new index needs to be updated
Conditions:● Must not update a column that’s
indexed● New tuple must be in the same
page
http://slideplayer.com/slide/9883483/
![Page 29: Grokking TechTalk #20: PostgreSQL Internals 101](https://reader033.vdocuments.net/reader033/viewer/2022052405/5a647ccd7f8b9a27568b4f8d/html5/thumbnails/29.jpg)
● Clean up dead tuples
● Freeze old tuples (prevent transactions wraparound)
● VACUUM only frees old tuples
● VACUUM FULL reclaims old disk spaces, but blocks writes
VACUUM
![Page 30: Grokking TechTalk #20: PostgreSQL Internals 101](https://reader033.vdocuments.net/reader033/viewer/2022052405/5a647ccd7f8b9a27568b4f8d/html5/thumbnails/30.jpg)
● Add a new column (safe)
● Add a column with a default (unsafe)
● Add a column that is non-nullable (unsafe)
● Drop a column (safe)
● Add a default value to an existing column (safe)
● Add an index (unsafe)
Safe & Unsafe Operations In PostgreSQL
http://leopard.in.ua/2016/09/20/safe-and-unsafe-operations-postgresql
![Page 31: Grokking TechTalk #20: PostgreSQL Internals 101](https://reader033.vdocuments.net/reader033/viewer/2022052405/5a647ccd7f8b9a27568b4f8d/html5/thumbnails/31.jpg)
References
● Why Uber Engineering switched from PostgreSQL to MySQL - https://eng.uber.com/mysql-migration/
● PostgreSQL Documentations - https://www.postgresql.org/docs/current/static/
● The Internals of PostgreSQLhttp://www.interdb.jp/pg/
● http://leopard.in.ua/2016/09/20/safe-and-unsafe-operations-postgresql
● http://slideplayer.com/slide/9883483/
● https://www.slideshare.net/pgdayasia/introduction-to-vacuum-freezing-and-xid
![Page 32: Grokking TechTalk #20: PostgreSQL Internals 101](https://reader033.vdocuments.net/reader033/viewer/2022052405/5a647ccd7f8b9a27568b4f8d/html5/thumbnails/32.jpg)
Huy Nguyen
![Page 33: Grokking TechTalk #20: PostgreSQL Internals 101](https://reader033.vdocuments.net/reader033/viewer/2022052405/5a647ccd7f8b9a27568b4f8d/html5/thumbnails/33.jpg)
Physical Structure
https://www.postgresql.org/docs/current/static/storage-file-layout.html
![Page 34: Grokking TechTalk #20: PostgreSQL Internals 101](https://reader033.vdocuments.net/reader033/viewer/2022052405/5a647ccd7f8b9a27568b4f8d/html5/thumbnails/34.jpg)
Transaction Isolation
BEGIN TRANSACTION;
SELECT * FROM table;
SELECT pg_sleep(10);
SELECT * FROM table;
COMMIT;
under READ COMMITTED, the second SELECT may return any data. A concurrent transaction may update the record, delete it, insert new records. The second select will always see the new data.
under REPEATABLE READ the second SELECT is guaranteed to see the rows that has seen at first select unchanged. New rows may be added by a concurrent transaction in that one minute, but the existing rows cannot be deleted nor changed.
under SERIALIZABLE reads the second select is guaranteed to see exactly the same rows as the first. No row can change, nor deleted, nor new rows could be inserted by a concurrent transaction.
https://stackoverflow.com/questions/4034976/difference-between-read-commit-and-repeatable-read
![Page 35: Grokking TechTalk #20: PostgreSQL Internals 101](https://reader033.vdocuments.net/reader033/viewer/2022052405/5a647ccd7f8b9a27568b4f8d/html5/thumbnails/35.jpg)
PostgreSQL ProcessesThere are multiple processes handling different use cases.
● postmaster process: handles database cluster management.
● Many backend processes (one for each connection)
● Background processes: stats collector, autovacuum, checkpoint, WAL writer, etc.
http://www.interdb.jp/pg/pgsql02.html
![Page 36: Grokking TechTalk #20: PostgreSQL Internals 101](https://reader033.vdocuments.net/reader033/viewer/2022052405/5a647ccd7f8b9a27568b4f8d/html5/thumbnails/36.jpg)
Database Cluster
● database cluster: a database instance in a single machine.
● A database contains many database objects (schema, table, index, view, function, etc)
● Each object is represented by an oid
Database Cluster
Database 1 Database 2 Database n...
tables indexesviews,
materialized views
functions
schema
sequences...
role (user/group