benchmark of kvp vs. hstore - doc

8/12/2019 Benchmark of KVP vs. Hstore - Doc

1/50

Key/Value Pair versus hstoreBenchmarking Entity-Attribute-Value Structures in

PostgreSQL

HSR Hochschule fr Technik Rapperswil

Institut fr Software

Oberseestrasse 10

P tf h 1475


2/50

Table of Figures Key/Value Pair versus hstore

Benchmarking Entity-Attribute-Value Structures in PostgreSQL.

Table of Contents

Table of Figures IV

Tables IV

Figures IV

Listings V

List Of Abbreviations VI

1 Introduction 11.1 Project description 1

1.2 Restrictions on the scope of the project 1

2 Overview 2

2.1 PostgreSQL 2

2.2 Key Value Pair 2

2.3 Hstore 3

2.3.1 Functions 4

2.3.2 Working principle 5

2.4 Benchmark Tools 6

2.4.1 Pgbench 6

2.4.2 HSR Texas Geo Database Benchmark 6

3 B h k l 8


3/50



Appendix 33


4/50



Table of Figures

Tables

Table 1: KVP additional information table ................................................................................................................. 3

Table 2: KVP table ................................................................................................................................................................. 3

Table 3: Columns of a test dataset record ............................................................................................................... 12

Table 4: Number of dataset length ............................................................................................................................. 13

Table 5: Input parameters for test data generator .............................................................................................. 14

Table 6: Input parameters for benchmark application ..................................................................................... 15

Table 7: Hardware specification of system under test ...................................................................................... 16

Table 8: Software specification of system under test ......................................................................................... 16

Table 9: Hstore table abstract ....................................................................................................................................... 26

Table 10: KVP table abstract ......................................................................................................................................... 28

Table 11: Hstore tuple examples ................................................................................................................................. 30

Figures


5/50



Figure 17: Overview of KVP with index on key and combined index against hstore........................ 23

Figure 18: 10 to 2.5K: KVP with index on key and combined index against hstore........................... 24Figure 19: Index size overview .................................................................................................................................... 24

Figure 20: Index size for 10 to 5000 records ........................................................................................................ 25

Listings

Listing 1: Hstore data type definition ........................................................................................................................... 5

Listing 2: Registering a PostgreSQL operator ........................................................................................................... 5

Listing 3: Defining a PostgreSQL function .................................................................................................................. 5

Listing 4: KVP Benchmark Table ................................................................................................................................. 10

Listing 5: Hstore Benchmark Table ............................................................................................................................ 10

Listing 6: KVP index .......................................................................................................................................................... 11

Listing 7: Hstore index ..................................................................................................................................................... 11

Listing 8: KVP select example ....................................................................................................................................... 11

Listing 9: KVP select example ....................................................................................................................................... 11

Li i 10 T f d d KVP SQL 12


6/50

List Of Abbreviations Key/Value Pair versus hstore


List Of Abbreviations

Abbreviation Description

bash Bourne-again shell

CPU Central Processing Unit

DB2 Commercial relational database management system developed by IBM

GiST Generalized Search Tree

KVP Key Value Pair

ms milliseconds

MSSQL Commercial relational database management system produced by Microsoft

OpenFTS OpenSource Full Text Search engine

Oracle Commercial object-relational database management system produced by

Oracle Corporation

PgSQL PostgreSQLOpen source object relational database

PostGIS Adds support for geographic objects to the PostgreSQL object-relational da-


7/50

Chapter 1: Introduction Key/Value Pair versus hstore


1

Introduction

The following chapter describes the scope of the project and its boundaries and restrictions. In

general the goal is to benchmark the performance of PostgreSQL key-value-pairs against Post-

greSQL hstore data type.

1.1 Project description

As part of this term paper a project evolved to benchmark PostgreSQL key-value-pairs, further

referred to as KVP, versus PostgreSQL in combination with hstore, further referred to as hstoredata type (probably an abbreviation for hash storage structure la Perl hash). Hstore is part of

PostgreSQL distribution since version 8.2 as an additional module and storage for semi structural

data with GiST index access. The PostgreSQL core distribution does not know of key value pair

(KVP) information in a single attribute. That means it is not possible to store an associative array

e.g. {surname : John, name : Smith} in a attribute and query Johns name. This additional fun c-

tionality was introduced by Oleg Bartunov and Teodor Sigaev and enhanced by Andrew Gierth

under the synonym hstore. Hstore is an enhancement for PostgreSQL, which provides a new

data type and a bunch of functions to store and query for KVP information. Dictionaries or asso-

ciative arrays are the parent synonym for key value pairs (KVP) or abstract data structures

(ADT). They handle pairs, also known as items, as keys and their corresponding values. Most

modern script languages support dictionaries/associative arrays as a primary container type.

KVP is also called entity attribute value model (EAV) or object attribute value mode.


8/50

Chapter 2: Overview Key/Value Pair versus hstore


2

Overview

Today PostgreSQL has a huge community, not only because it is for free, but also due to the fact

that it has a lot of extensions like the geospatial extension PostGIS or the hstore mentioned be-

fore. The following chapter first describes what PostgreSQL is, then It explains the difference be-

tween key-value-pairs (KVP) as a table structure versus KVP using the hstore data type.

By studying subchapter2.2 Key Value Pair and2.3 Hstore you will recognize that KVP stores the

key and one or more related values in different table columns whereas hstore introduces a new

abstract data type allowing storing an associated array in the form of unique keys and related

values within a single table column. Suitable is KVP for easy data storage and data capture, rows

with many attributes that are rarely examined, and semi structured data.

2.1 PostgreSQL

PostgreSQL is an open source relational database, even an object-relational database according

(PostgreSQL Global Development Group). Since its life of over 15 years, PostgreSQL has a provenstanding in different applications fields. This, because it implements a set of capabilities that are

well known from proprietary software vendors like Oracle, IBM DB2 or Microsoft and of course it

provides all the other features such as scalability, maintainability, asynchronous replication, etc.

This and many more brings PostgreSQL in a position of a real competitor for proprietary soft-

ware vendors in companies of different size and as of actual state, PostgreSQL is an enterprise

class database (PostgreSQL Global Development Group).


9/50



enhanced with an identifier and an additional table is needed to store additional information.

From this it follows that the base schema of the KVP structure look likes the following.

Table: bench_kvp_info

id : Integer attribute_1 : Text attribute_n : Text

Table 1: KVP additional information table

Table: bench_kvp

id_fk : Integer key : Text value : Text

Table 2: KVP table

The bench_kvp_info table holds a unique identifier for specific information and the additional

data to it. For example it could hold the information of a restaurant such as street, postal code,

phone number and so on. The bench_kvp table stores in addition information that is not foresee-

able. Such as those, that could describe or give more specific information to the restaurant like

the type of cuisine, a URL to its homepage, and so on.

Key value pairs are information that specify and information more exactly but not necessary

mandatory for all data in the information table. This structure allows adding easily new non-

mandatory information without touching the table schema.

In Postgresql it can be setup as follow:

1

CREATE TABLEbench_kvp_info (

id integerPRIMARY KEY,


10/50



Inserting a tuple is as easy as creating an attribute of type hstore:

1

INSERT INTObench_hstore(kvp_hstore) VALUES(

hstore(id=>1,

surname=>McNeal,forename=>Bob

)

);

2

INSERT INTObench_hstore(kvp_hstore) VALUES(

hstore(

id=>2,

surname=>Gates)

);

Figure 3: hstore insert SQL example

2.3.1

Functions

As you can see in the above example, the length of the array may vary from tuple to tuple. Impor-

tant to see is that each line the associated array has a key and a value separated by comma e.g.

hstore(id=>2, surname=>Gates)

hstore(=> , , => )

means that we have two different unique keys, id and surname, and each unique key has a va l-

ue, for id it is 2 and for surname it is Gates. Unique means that in a tuple a key can only be

defined once. For example the id surname can only appear once in the same tuple; the following

hstore is not allowed:


11/50



Or maybe you want to know all possible keys in a hstore:

SELECTskeys(kvp_hstore) ASkeys FROMbench_hstore GROUPBYkeys;

To become the key only once in the result list, a GROUP BY clause by keys need to be added to

the statement.

2.3.2

Working principle

Hstore is implemented in C as a PostgreSQL add-on and provides a SQL script to install the data

type and all the PostgreSQL functions. Hstore tries to build a buffer over all the keys, which are in

the hstore data type, if they are in alphabetical order. If not in some special functions the array

will be sorted to have it alphabetical order. Hstore as data type is defined as follow:

1

CREATE TYPEhstore (

INTERNALLENGTH = -1,

INPUT = hstore_in,

OUTPUT = hstore_out,

RECEIVE = hstore_recv,

SEND = hstore_send,

STORAGE = extended

);

Listing 1: Hstore data type definition

The important parameter is the INPUT one, which is linked to a C method. The hstore_in me-

thod parses the hstore string to a C structure that holds the key, value, and length of the key and

value as well as the position in the array. The position is needed because the array is not really


12/50



For more information please visit the official PostgreSQL hstore documentation2.

2.4 Benchmark Tools

Currently two programs should be mentioned for benchmarking PostgreSQL. Both, pgbench and

HSR Texas Geo Database Benchmark, are running in sequential mode SQL statements to test the

database under test. For the test proposed in this paper an own benchmark tools has been writ-

ten to fulfill the desired hypotheses.

2.4.1 Pgbench

Pgbench is shipped in the PostgreSQL distribution package and runs test on a PostgreSQL in-

stance in a sequential mode. Sequential mode means that the same SQL statement is run over and

over in possible multiple concurrent database sessions, which fulfill the multi-processing archi-

tecture. At the end of the benchmark it calculates the average transaction time per seconds.

Pgbench provides an own scripting language to customize the test scripts for using own data sets

and test SQL statements. In addition it includes some industry-standard test cases, which let you

compare PostgreSQL with other database products (Smith, 2010, S. 189).

Custom scripts in pgbench allow you to create your own test scripts. It can handle statements

with variable, which are known in Java as prepared statements. It is possible to define a SELECT

statement like this:


13/50



tor the behavior. For all this queries this program provides the test data that comes from Texas

USA. A test script looks like this:

1

SELECTcount(*)FROM{dataset polygons} pg

WHEREST_Intersects(@bbox, pg.geo);

This SELECT statement counts all polygons that intersect with a given bounding box @bbox.

In general, the benchmark program is based on a cube.

Different queries can be run on different systems by us-

ing different dataset. Each dataset will be installed oneach system and all queries are run on all systems times

the number of datasets. This guarantees that the differ-

ent systems can be compared, because all are using the

same data and statements.

Figure 4: HSR Texas Geo Database

Benchmark Cube

Source: (Krummenacher, 2009)


14/50

Chapter 3: Benchmark proposal Key/Value Pair versus hstore


3

Benchmark proposal

Before we can have a look at the benchmarking utility and the result, we need first to considerwhat the ingredients of a benchmark are. The term benchmark can be substitute into three dif-

ferent processes described in the following chapter.

3.1 Terms

The term benchmark has a high cohesion to the term test. Looking into the Cambridge dictionary

benchmark is defined as follows:

a level of quality which can be used as a standard when comparing other things

(Cambridge University Press).

That by which the existence, quality, or genuineness of anything is or may be de-

termined; [] (Oxford University Press).

The definition suggests that two different things of approximately the same topic need to be put

in contrast to each other. That means that the things need to be converted in a form that makes

them comparable. At this point the term test come in place, which does exactly this transforma-

tion:

an act of using something to find out whether it is working correctly or how effec-

tive it is (Cambridge University Press)


15/50



So some questions need here to be considered:

What exactly I want to test?

What should be the data and can they be transformed in the test process?

Does the data fit into the given environment?

Etc.

All these questions are very fundamental and often at the first moment very easy to answer.

However, finding the right data that fit into the environment and test process is not that easy e.g.

test data has a wrong encoding and cannot be loaded,

data does not cover the whole test design, and no significant result can be achieved,

etc..

3.3 Execution phase

The test execution phase defines the design of the chosen test technique. In general it can be di-

vided into the following fields:

Load Testing: measures and establishes benchmarks for the system under test by pushing

transactions to the system. It can be incremental, or can be set amount that is proportional to the

values of the system.

Performance Testing: that is run repeatedly until acceptable performance levels are achieved

through database tuning activities.


16/50



3.5 Performance Benchmark Design

To benchmark the KVP and hstore in PostgreSQL the decision has been taken in favor of a per-fromance test. In this benchmark it is not important how stable and scalable PostgreSQL is, it is

more interesting how does KVP and hstore perform on given preconfigured PostgreSQL envi-

ronment.

3.5.1

Table Schema

As described in chapter2.2 Key Value Pair and2.3 Hstore,the table schemas need to be defined

in such a way that the comparison between KVP and hstore is fair. The goal of the schema defi-

nitions is to have for both an associative array in matters of the data, which need to be stored. It

is not important to have an equal representation of the key value pairs in the database, however

the philosophy of what information type at its granularity need to be stored and queried is im-

portant. It means that the data are note foreseeable in sense of additional information that could

be provided to a specific data record.

In this benchmark we use the following table schemas to represent the associative array in a da-

tabase table.

For KVP

1

CREATE TABLEbench_kvp_id (bench_id BIGINTPRIMARY KEY);

2

CREATE TABLEbench_kvp (

bench_id BIGINTREFERENCESbench_kvp_id(bench_id),

key TEXTNOT NULL,


17/50



1

CREATE INDEXkvpidx1 ONbench_kvp (key);

2

CREATE INDEXkvpidx2 ONbench_kvp (key, value);

Listing 6: KVP index

Index for KVP shall be tested in two different ways. Firstly with a single index on the key

attribute and secondly a combined index on the attributes key and value.

For hstore an index can be created as follow:

1

CREATE INDEXhidx ONbench_hstore USING GIST(bench_hstore);

Listing 7: Hstore index

3.5.2

Statements

To query a tuple based on a key value pair we have for each, KVP and hstore, an own SELECT

statement. Because KVP needs for each key value pair a new tuple we have first to find the

unique identifier to the key value pair and then we can select the information we need. This ex-

ample selects all the information of a person with surname McNeal:

1

SELECT* FROMbench_kvp WHEREbench_id = (

SELECTbench_id FROMbench_kvp

WHEREkey = 'surname' ANDvalue = 'McNeal');

Listing 8: KVP select example

By using hstore we need first to convert the attribute which stores the hstore string into a hstore


18/50



surname : Text Mandatory. A fancy name.

forename : Text Optional: A fancy name. Can be empty to have a variable KVP length.

zip : Integer Optional: A number between 1000 and 9000. Can be empty to have a va-

riable KVP length.

comment : Text Optional: A dummy text. Can be empty to have a variable KVP length.

Table 3: Columns of a test dataset record

and an abstract of a test file looks as follow:

1

id,forename,surname,zip,comment2

1,cucyp,,6593,lorem ipsum dolor sit amet consetetur sadipscingelitr sed diam nonumy eirmod tempor invidunt ut labore et do-

lore magna aliquyam erat sed diam voluptua at vero eos et accu-sam et justo duo dolores et ea rebum stet clita kasd gubergren

no sea takimata sanctus est lorem ipsum dolor sit amet

3

2,kasarzyc,ecnalehad,8463,

4

3,inwa,,,

Figure 5: Example of a test data file

The test data records needs to be transformed into a valid KVP SQL statement like this:

1

INSERT INTObench_kvp_info(id) VALUES(1)

2

INSERT INTObench_kvp(id, key, value) VALUES(1, id, 1);

3

INSERT INTObench_kvp(id, key, value) VALUES(1, forname,

cucyp);

4

INSERT INTObench_kvp(id, key, value) VALUES(1, zip, 6593);

5

INSERT INTObench_kvp(id, key, value) VALUES(1, comment, lo-

rem ipsum dolor sit amet consetetur sadipscing elitr );


19/50



Datasets

10 records 100 records 500 records




Table 4: Number of dataset length

Additionally, for KVP and hstore one test circle, means test phase and benchmark phase, take

place once with database index and once without for each dataset length. In total 24 different test

cycle were executed:

Figure 6: Number of test cicle

# of length defines the number of different datasets that need to be run, # of types means the

amount of different data sources that need to be testes. In our case it is KVP and hstore. #of i n-

dices means that the test will be run once with indexed data and once without. This results in 20

different test cycles.

3.6 Test Application

For benchmarking KVP and hstore an own application has been written in Python. It supports all

three phases: Generate / Preprocessing, Execution, and Bechmark / Analysis. It has been written

12 [# of length]2 [# of types]2 [# of indices]3 [warm start] =144 [cicles]


20/50



Figure 7: Test / benchmark application incl. test data generator

n standalone processes

Processes

(Adapters)

Adapter

< load

Benchmark

Data generator

Log and Graphs

hstore

{name: Joe, }

{name: Anne, }

{name:Smith, }

{name:Bob, }

{name:Marco,

KVP

.csv

Data

set >insert >

insert >

Task queue

Response queue

inhe-

rit >

< get

< add

< write response

< get excute task >

excute task >


21/50



Parameter Description

-t or --type String: Type of database test. Currently pgsql and pgsqlhstore sup-

ported.

-x or --processes Integer: The amount of parallel processes that should be allocated. If this

parameter is not set, then the software tries to find out the maximum

parallel processes of the CPU architecture.

-s or --server String: Server or hostname where the database runs e.g. localhost

-p or --port Integer: Port of the database server.

-d or --database String: Database name.

-u or --user String: A user how has the rights to create tables, do insertions and run

queries.

-p or --password String: Users password.

-a or --data String: File that includes all test records.

-i or --index Boolean: Defines whether an index on the tables should be allocated or

not.

-n or --no-hot-start Boolean: The flag defines if a hot start is required. If it is not set, it runs

the test 2 times before it measures on the 3 round the transaction time.

-l or --log Boolean: Whether a log file should be created or not.


22/50

Chapter 4: Benchmark Mai 2011 Key/Value Pair versus hstore


4

Benchmark Mai 2011

The following chapter specifies the hardware and software used for the system under test anddescribes the results as well as the findings.

4.1 Technical specification

The server on which KVP and hstore need to be tested has the following hardware specification:

Type Comment

Processor

CPU Intel(R) Xeon(R) CPU E5520 @ 2.27GHz

Instructions set 64-bit

# of cores 4

# of threads 8

# of CPUs 2

Memory

Total 24732476 kB about 24 GB

Speed 1066 MHz

Idle modus Only Ubuntu and PostgreSQL are running on the same hard disk whereas

23677164 kB about 23 GB RAM is free


23/50



4.2.1

Preparing Database

To prepare the database you need first to login as a user that has the privilege to create a data-

base such as user postgres:

1 sudo supostgres

As postgres user you can run the first script called step_1.sh which creates a new database user

benchmark, a database called benchmark and runs the hstore script that install the hsotre

data type and a bunch of PostgreSQL functions.

1

./step_1.sh

The content of the step_1.sh script is as follow:

1

#!/bin/bash

2

# first login as user with privilege to create a database e.g.

sudo su postgres

3

4

# create new user benchmark

5

createuser-l -D -R -S benchmark

6

7

# alter user's password

8

psql-U postgres -c "ALTER USER benchmark WITH PASSWORD'benchmark'"

9

10

# create new database benchmark

11

createdb-U postgres benchmark

12

13

# create language plpgsql on benchmark database

14

createlang-U postgres -d benchmark plpgsql


24/50



10

11

# create test data sets12

# - for 500

13

pythongenerator.py -a 500 -t words -l 5014

mvdata/testdata.csv data/testdata_500.csv

# etc.

Figure 9: Test data generation script

4.2.3

Executing Benchmark

Now all prerequisites are fulfilled and the benchmark can be started. Also for this step an exam-

ple script is available. Run the following command to benchmark KVP and hstore once with index

and once without based on the generated datasets in the previous script.

1

./step_3.sh

The step_3.sh script includes the following statements.

1

#!/bin/bash2

############

3

#for hstore

4

############5

# - for 10

6

# - without index

7

pythonbenchmark.py -t pgsqlhstore-s localhost -p 5432 -d

benchmark -u benchmark -w benchmark -a data/testdata_10.csv-l

-g -n

8

mvoutput/1.png output/hstore 10 1.png


25/50



33

mvoutput/log_summary.csv output/kvp_10_log_summary.csv

34

psql-d benchmark -c "EXPLAIN ANALYZE SELECT * FROM bench_kvpWHERE bench_id = (SELECT bench_id FROM bench_kvp WHERE key =

'id' AND value = '7');"> output/analyze.log35

mvoutput/analyze.log output/kvp_10_analyze.log

36

37

# - with index

38

pythonbenchmark.py -t pgsql-s localhost -p 5432 -d benchmark

-u benchmark -w benchmark -a data/testdata_10.csv-i -l -g -n

39

mvoutput/1.png output/kvp_10_index_1.png40

mvoutput/2.png output/kvp_10_index_2.png

41

mvoutput/log.csv output/kvp_10_index_log.csv

42

mvoutput/log_summary.csv output/kvp_10_index_log_summary.csv

43

psql-d benchmark -c "EXPLAIN ANALYZE SELECT * FROM bench_kvpWHERE bench_id = (SELECT bench_id FROM bench_kvp WHERE key =

'id' AND value = '7');"> output/analyze.log

44

mvoutput/analyze.log output/kvp_10_index_analyze.log

# etc.

Figure 10: Benchmark script example

4.3 Results

The test has been executed in May 2011 based on the test design described in chapter Error!

Reference source not found. and the hardware specification in chapter 4.1. All the test logs

were aggregated into a single file showing the start and end time as well as the duration and the

average time in seconds per SELECT statement. The detail aggregation and the full extent dia-


26/50



Figure 11: Overview KVP vs. hstore benchmark

From 10 to approximately 500 records the KVP is much faster by querying a key value pair. Af-

terwards hstore demonstrates its strength especially when having more than 2500 tuples in the

hstore table, whereas by using an index on the KVP table, hstore is faster by a factor of 4.04 and

without an index by a factor of 7.9. This sounds like KVP is a performance killer, which is not the

truth, because if we look at the absolute querying time per SELECT statement, than KVP needs in


27/50



The circumstance changes, if we use a combined index for the KVP table on the attributes key

and value. Hstore is still a tick faster than the KVP schema, but the difference between the ave r-

age SELECT transaction time for KVP indexed table shrunk extremely fast. Overall we can say

that the combined index comes very near to the hstore schema. Despite the fact that the com-

bined index gives some performance boost, we can still see the problem, that the more tuples we

store in the KVP table the higher will be the difference between the hstore and the KVP. This con-

cludes that the more arbitrary data results need to be store, the faster grows the combined index

and the longer needs PostgreSQL to find the right key value pair combination. But more on that

later in chapter4.4 Findings.


28/50



Figure 14: Benchmark KVP hstore from 10 to 2.5K with combined index

Having a nearer look to the data sets between 10 and 2500 tuples we see that the average trans-

action time on a 2500 big data set shrunk to 0.00948 (9.48 ms) from 0.01512 seconds (15.12

ms) that leads to the fact that a combined index is faster by a factor of 1.6.

15 12ms


29/50



Small data sets show a contrary perspective. From 10 to 500 records the difference between the

hstore and a KVP with a combined index is negative. This means that the KVP is faster then the

hstore. From more or less 500 records upwards hstore will be faster even only a little bit.

Figure 16: 10 to 2.5K: Difference between KVP single and combined index


30/50



Figure 18: 10 to 2.5K: KVP with index on key and combined index against hstore

Analyzing the indices of hstore and KVP shows, that KVP needs a lot more size on the disk to

build the index. Both an index on the attribute key and a combined index on the attributes key

and value whereat the combined index need more disk size then the index on the attribute key


31/50



Especially in the area of 1000 tuples and more is the different significant. KVP with an index on

the attribute key needs by 1000 tuples 3.73 times more disk space than the GiST index on a

hstore and 3.55 times for the combined index.

Figure 20: Index size for 10 to 5000 records


32/50



duo dolores et ea rebum stet clita kasd gubergren no sea takimata sanc-

tus est lorem ipsum dolor sit amet ", "surname"=>"ebsaveq", "fore-

name"=>"maeznidus"

2 "id"=>"2", "zip"=>"6489", "comment"=>"lorem ipsum dolor sit amet con-

setetur sadipscing elitr sed diam nonumy eirmod tempor invidunt ut la-

bore et dolore magna aliquyam erat sed diam voluptua at vero eos et ac-

cusam et justo duo dolores et ea rebum stet clita kasd gubergren no sea

takimata sanctus est lorem ipsum dolor sit amet ", "surname"=>"epofod",

"forename"=>"teer"

2500 "id"=>"2500", "comment"=>"lorem ipsum dolor sit amet consetetur sa-

dipscing elitr sed diam nonumy eirmod tempor invidunt ut labore et do-

lore magna aliquyam erat sed diam voluptua at vero eos et accusam et

justo duo dolores et ea rebum stet clita kasd gubergren no sea takimata

sanctus est lorem ipsum dolor sit amet ", "forename"=>"sorietet"Table 9: Hstore table abstract

What we need to do is to analyze an example query. As an example we take this one:

1 EXPLAIN ANALYZESELECT* FROMbench_hstore

WHEREhstore(bench_hstore)->'id'='1735';

Listing 12: Explain Analyze statement for hstore without index


33/50



1 Bitmap Heap Scanon bench_hstore

(cost=4.27..11.33 rows=2 width=218)(actual time=0.481..0.534 rows=1 loops=1)

2

Recheck Cond: (bench_hstore @> '"id"=>"1735"'::hstore)3

-> Bitmap Index Scanon hidx_2_5k

(cost=0.00..4.27 rows=2 width=0)

(actual time=0.308..0.308 rows=70 loops=1)

4

Index Cond: (bench_hstore @> '"id"=>"1735"'::hstore)

5

Total runtime: 0.721 ms

6

(5 rows)

Listing 15: Output of Explain Analyze statement for hstore with index

Now lets have a look at the KVP tables. Remember that we have in the schema two different

tables but we only need the table with the attributes key and value. For the same amount of

array entries in the KVP table a multiple of tuples will be stored. Each key value pair entry in the

array needs to be a separate tuple in the table. If we have the following keys id, surname, fo r-

name, zip, and comment for a 2500 big array and each key in this array has an assigned value,

then it will results to 12500 tuples in the database table. Compared to hstore it has 5 times more

tuples or to be more precise the sum of all filled keys.

, whereas

The database table includes then for each key value pair an own tuple:

Database table: bench_kvp

bench_id : BIGINT key : TEXT NOT NULL value : TEXT

tuples= value in the array value null


34/50



2 surname epofod

2 forename Teer

2500 id 2500

2500 Comment lorem ipsum dolor sit amet consetetur sadipsc-

ing elitr sed diam nonumy eirmod tempor invi-

dunt ut labore et dolore magna aliquyam erat

sed diam voluptua at vero eos et accusam et jus-

to duo dolores et ea rebum stet clita kasd gu-

bergren no sea takimata sanctus est lorem ip-

sum dolor sit amet

2500 forename sorietet

Table 10: KVP table abstract

Taking the same key value pair we used for hstore gives the following KVP SQL statement:

1 EXPLAIN ANALYZESELECT * FROMbench_kvp

WHEREbench_id = (

SELECTbench_id FROMbench_kvp

WHEREkey= 'id'ANDvalue= '1735'

);

Listing 16: Explain Analyze statement for KVP

which results in the following EXPLAIN output:


35/50



The difficulty of creating an own hstore like database schema is, that more sequences and there-

fore more reads of pages on the disk are needed which results in higher cost units. The lower size

of the tuples in bytes does compensate the amount of pages to be read on the disk. Because in theshort run it looks like that at the best-case only 60 bytes for the first sequence and 8 bytes for

second sequence are needed. That is not the truth because if the first read in the second sequence

finds the key value pair, than 8 bytes are consumed. With the found identifier the first sequence

read all the tuples that match the identifier. In our case we have 5 key value pair combinations

that means 5 tuples in the KVP table. Each tuple consumes 60 bytes, which is for all 5 tuples 300

bytes, plus the 8 bytes for the second sequence results in a total size of 308 bytes. Compared to

the hstore in uses in the best-case only 213 bytes.

Using an index on the key attribute can enforce finding the unique identifier for a given key value

pair. Analyzing the technique shows that we have an additional sequence.

1

Seq Scanon bench_kvp (cost=199.48..406.88 rows=3 width=60)

(actual time=2.268..2.730 rows=2 loops=1)2

Filter: (bench_id = $0)

3

InitPlan 1 (returns $0)4

-> Bitmap Heap Scanon bench_kvp

(cost=62.99..199.48 rows=1 width=8)


5

Recheck Cond: (key = 'id'::text)

6

Filter: (value = '1735'::text)7

-> Bitmap Index Scanon kvpidx

(cost=0.00..62.99 rows=2499 width=0)


8

Index Cond: (key = 'id'::text)


36/50



This factor results, because no additional sequence is needed, like we had it when using only an

index on the attribute key. The first index scancan directly find the unique identifier for a given

key value pair and reading afterward all tuples for that identifier. The bytes, which are needed togather the data, are exactly the same, as we needed for the first index alternative. In contrast to

the first alternative the combined index on the attributes key and value will grow very fast,

because for each key value pair combination a new entry in the index is needed. It is in the nature

of the key value pair philosophy, that only arbitrary unforeseen information is stored as key val-

ue pair and therefore the probability that an equal key value pair appears in the table is very un-

likely.

1 Seq Scanon bench_kvp (cost=8.27..215.91 rows=3 width=60)


2

Filter: (bench_id = $0)

3

InitPlan 1 (returns $0)

4

-> Index Scanusing kvpidx2 on bench_kvp

(cost=0.00..8.27 rows=1 width=8)

(actual time=0.048..0.049 rows=1 loops=1)5

Index Cond: ((key = 'id'::text) AND

(value = '1735'::text))

6

Total runtime: 2.028 ms

7

(6 rows)

Listing 21: Output of Explain Analyze statement for KVP with combined index

Lastly we need to have a short look on the hstore implementation. Please consider for that chap-

ter2.3.2 Working principle.


37/50



4.5 Conclusion

As described in the previous chapters hstore perform much faster then a KVP schema described

in chapter2.2.The stored data of type hstore are not lost in the database and can be migrated

with a minimum of effort to another schema, because it is stored as string in the form of an asso-

ciative array in the database. In addition, hstore provides PostgreSQL functions to transform the

associative array into a column row like table, as it is know in every database management sys-

tem. The keys in the associative array are transposed to columns and each row in the array is a

tuple in the column row like table. The values are transformed to the values in a tuple. Therefore

the fear for a later migration should not be the criteria of not using hstore.

Also the way how it is implemented nees much less size on the disk for the indices and costs less

performance. This shows the explain analysis in chapter4.4 and the graphs on page24 and25.

The cost of reading data is much lower then the one of a KVP schema. To remember, the cost de-

fines a factor of reading a page from the disk. The higher it is the more it needs to read on the

disk and the slower it will be. In addition hstore buffers all the keys and values to provide a faster

read and along a single buffer entry it stores the position of the key and value in the string, which

is of type hstore in the database table that represents the associative array. That means when

hstore found the key it does not need to substring the string because it already knows the posi-

tion in the string.

Thus, it is to consider that for small datasets hstore is not the preferable method to store key val-

ue pairs. Especially when having an array size of 1 to more or less 500 records. At this size a KVP

Bibli h K /V l P i h


38/50

Bibliography Key/Value Pair versus hstore


Bibliography

Bartunov, O., Sigaev, T., & Gierth, A. (n.d.). PostgreSQL 9.0: hstore. Retrieved Mai 1, 2011, from

http://www.postgresql.org/docs/9.0/static/hstore.html

Cambridge University Press. (n.d.). Cambridge Dictionary Online. Retrieved April 26, 2011, from

http://dictionary.cambridge.org/

Gonnerman, C. (2003). Python Name Generators. Retrieved April 27, 2011, from Alderon's Tower:

http://tower.newcenturycomputers.net/namegen.html

Krummenacher, R. (2009, December 21). HSR Texas Geo Database Benchmar. Retrieved June 3,

2011, from Wiki GISpunkt HSR:

http://www.gis.hsr.ch/wiki/HSR_Texas_Geo_Database_Benchmark

Nasby, J. (2010, May 13). Introduction to VACUUM, ANALYZE, EXPLAIN, and COUNT. Retrieved

May 27, 2011, from PostgreSQL wiki:

http://wiki.postgresql.org/wiki/Introduction_to_VACUUM,_ANALYZE,_EXPLAIN,_and_C

OUNT

Oxford University Press. (n.d.). Oxford English Dictionary. Retrieved April 28, 2011, from

http://www.oed.com

PostgreSQL Global Development Group. (n.d.). PostgreSQL: About. Retrieved Mai 26, 2011, from

http://www.postgresql.org/About

A di K /V l P i h t


39/50

Appendix Key/Value Pair versus hstore


Appendix

Benchmark with KVP index on attribute key

Benchmark of hstore and KVP once with index (w) and once without index (o). For KVP an index

on the attribute key has been choosen.




40/50

University of Applied Science Rapperswil 34




41/50




42/50



Benchmark with combined KVP index

Benchmark of hstore and KVP once with index (w) and once without index (o). For KVP a com-

bined index on the attribute key and value has been choosen.




43/50





44/50





45/50


Differences between KVP and hstore




46/50





47/50


Average SELECT time for KVP and hstore




48/50





49/50


KVP and hstore index sizes




50/50


benchmark of kvp vs. hstore - doc

Documents