benchmark of kvp vs. hstore - doc

Upload: lkjdgfdsgfdh

Post on 03-Jun-2018

224 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/12/2019 Benchmark of KVP vs. Hstore - Doc

    1/50

    Key/Value Pair versus hstoreBenchmarking Entity-Attribute-Value Structures in

    PostgreSQL

    HSR Hochschule fr Technik Rapperswil

    Institut fr Software

    Oberseestrasse 10

    P tf h 1475

  • 8/12/2019 Benchmark of KVP vs. Hstore - Doc

    2/50

    Table of Figures Key/Value Pair versus hstore

    Benchmarking Entity-Attribute-Value Structures in PostgreSQL.

    Table of Contents

    Table of Figures IV

    Tables IV

    Figures IV

    Listings V

    List Of Abbreviations VI

    1 Introduction 11.1 Project description 1

    1.2 Restrictions on the scope of the project 1

    2 Overview 2

    2.1 PostgreSQL 2

    2.2 Key Value Pair 2

    2.3 Hstore 3

    2.3.1 Functions 4

    2.3.2 Working principle 5

    2.4 Benchmark Tools 6

    2.4.1 Pgbench 6

    2.4.2 HSR Texas Geo Database Benchmark 6

    3 B h k l 8

  • 8/12/2019 Benchmark of KVP vs. Hstore - Doc

    3/50

    Table of Figures Key/Value Pair versus hstore

    Benchmarking Entity-Attribute-Value Structures in PostgreSQL.

    Appendix 33

  • 8/12/2019 Benchmark of KVP vs. Hstore - Doc

    4/50

    Table of Figures Key/Value Pair versus hstore

    Benchmarking Entity-Attribute-Value Structures in PostgreSQL.

    Table of Figures

    Tables

    Table 1: KVP additional information table ................................................................................................................. 3

    Table 2: KVP table ................................................................................................................................................................. 3

    Table 3: Columns of a test dataset record ............................................................................................................... 12

    Table 4: Number of dataset length ............................................................................................................................. 13

    Table 5: Input parameters for test data generator .............................................................................................. 14

    Table 6: Input parameters for benchmark application ..................................................................................... 15

    Table 7: Hardware specification of system under test ...................................................................................... 16

    Table 8: Software specification of system under test ......................................................................................... 16

    Table 9: Hstore table abstract ....................................................................................................................................... 26

    Table 10: KVP table abstract ......................................................................................................................................... 28

    Table 11: Hstore tuple examples ................................................................................................................................. 30

    Figures

  • 8/12/2019 Benchmark of KVP vs. Hstore - Doc

    5/50

    Table of Figures Key/Value Pair versus hstore

    Benchmarking Entity-Attribute-Value Structures in PostgreSQL.

    Figure 17: Overview of KVP with index on key and combined index against hstore........................ 23

    Figure 18: 10 to 2.5K: KVP with index on key and combined index against hstore........................... 24Figure 19: Index size overview .................................................................................................................................... 24

    Figure 20: Index size for 10 to 5000 records ........................................................................................................ 25

    Listings

    Listing 1: Hstore data type definition ........................................................................................................................... 5

    Listing 2: Registering a PostgreSQL operator ........................................................................................................... 5

    Listing 3: Defining a PostgreSQL function .................................................................................................................. 5

    Listing 4: KVP Benchmark Table ................................................................................................................................. 10

    Listing 5: Hstore Benchmark Table ............................................................................................................................ 10

    Listing 6: KVP index .......................................................................................................................................................... 11

    Listing 7: Hstore index ..................................................................................................................................................... 11

    Listing 8: KVP select example ....................................................................................................................................... 11

    Listing 9: KVP select example ....................................................................................................................................... 11

    Li i 10 T f d d KVP SQL 12

  • 8/12/2019 Benchmark of KVP vs. Hstore - Doc

    6/50

    List Of Abbreviations Key/Value Pair versus hstore

    Benchmarking Entity-Attribute-Value Structures in PostgreSQL.

    List Of Abbreviations

    Abbreviation Description

    bash Bourne-again shell

    CPU Central Processing Unit

    DB2 Commercial relational database management system developed by IBM

    GiST Generalized Search Tree

    KVP Key Value Pair

    ms milliseconds

    MSSQL Commercial relational database management system produced by Microsoft

    OpenFTS OpenSource Full Text Search engine

    Oracle Commercial object-relational database management system produced by

    Oracle Corporation

    PgSQL PostgreSQLOpen source object relational database

    PostGIS Adds support for geographic objects to the PostgreSQL object-relational da-

  • 8/12/2019 Benchmark of KVP vs. Hstore - Doc

    7/50

    Chapter 1: Introduction Key/Value Pair versus hstore

    Benchmarking Entity-Attribute-Value Structures in PostgreSQL.

    1

    Introduction

    The following chapter describes the scope of the project and its boundaries and restrictions. In

    general the goal is to benchmark the performance of PostgreSQL key-value-pairs against Post-

    greSQL hstore data type.

    1.1 Project description

    As part of this term paper a project evolved to benchmark PostgreSQL key-value-pairs, further

    referred to as KVP, versus PostgreSQL in combination with hstore, further referred to as hstoredata type (probably an abbreviation for hash storage structure la Perl hash). Hstore is part of

    PostgreSQL distribution since version 8.2 as an additional module and storage for semi structural

    data with GiST index access. The PostgreSQL core distribution does not know of key value pair

    (KVP) information in a single attribute. That means it is not possible to store an associative array

    e.g. {surname : John, name : Smith} in a attribute and query Johns name. This additional fun c-

    tionality was introduced by Oleg Bartunov and Teodor Sigaev and enhanced by Andrew Gierth

    under the synonym hstore. Hstore is an enhancement for PostgreSQL, which provides a new

    data type and a bunch of functions to store and query for KVP information. Dictionaries or asso-

    ciative arrays are the parent synonym for key value pairs (KVP) or abstract data structures

    (ADT). They handle pairs, also known as items, as keys and their corresponding values. Most

    modern script languages support dictionaries/associative arrays as a primary container type.

    KVP is also called entity attribute value model (EAV) or object attribute value mode.

  • 8/12/2019 Benchmark of KVP vs. Hstore - Doc

    8/50

    Chapter 2: Overview Key/Value Pair versus hstore

    Benchmarking Entity-Attribute-Value Structures in PostgreSQL.

    2

    Overview

    Today PostgreSQL has a huge community, not only because it is for free, but also due to the fact

    that it has a lot of extensions like the geospatial extension PostGIS or the hstore mentioned be-

    fore. The following chapter first describes what PostgreSQL is, then It explains the difference be-

    tween key-value-pairs (KVP) as a table structure versus KVP using the hstore data type.

    By studying subchapter2.2 Key Value Pair and2.3 Hstore you will recognize that KVP stores the

    key and one or more related values in different table columns whereas hstore introduces a new

    abstract data type allowing storing an associated array in the form of unique keys and related

    values within a single table column. Suitable is KVP for easy data storage and data capture, rows

    with many attributes that are rarely examined, and semi structured data.

    2.1 PostgreSQL

    PostgreSQL is an open source relational database, even an object-relational database according

    (PostgreSQL Global Development Group). Since its life of over 15 years, PostgreSQL has a provenstanding in different applications fields. This, because it implements a set of capabilities that are

    well known from proprietary software vendors like Oracle, IBM DB2 or Microsoft and of course it

    provides all the other features such as scalability, maintainability, asynchronous replication, etc.

    This and many more brings PostgreSQL in a position of a real competitor for proprietary soft-

    ware vendors in companies of different size and as of actual state, PostgreSQL is an enterprise

    class database (PostgreSQL Global Development Group).

  • 8/12/2019 Benchmark of KVP vs. Hstore - Doc

    9/50

    Chapter 2: Overview Key/Value Pair versus hstore

    Benchmarking Entity-Attribute-Value Structures in PostgreSQL.

    enhanced with an identifier and an additional table is needed to store additional information.

    From this it follows that the base schema of the KVP structure look likes the following.

    Table: bench_kvp_info

    id : Integer attribute_1 : Text attribute_n : Text

    Table 1: KVP additional information table

    Table: bench_kvp

    id_fk : Integer key : Text value : Text

    Table 2: KVP table

    The bench_kvp_info table holds a unique identifier for specific information and the additional

    data to it. For example it could hold the information of a restaurant such as street, postal code,

    phone number and so on. The bench_kvp table stores in addition information that is not foresee-

    able. Such as those, that could describe or give more specific information to the restaurant like

    the type of cuisine, a URL to its homepage, and so on.

    Key value pairs are information that specify and information more exactly but not necessary

    mandatory for all data in the information table. This structure allows adding easily new non-

    mandatory information without touching the table schema.

    In Postgresql it can be setup as follow:

    1

    CREATE TABLEbench_kvp_info (

    id integerPRIMARY KEY,

  • 8/12/2019 Benchmark of KVP vs. Hstore - Doc

    10/50

    Chapter 2: Overview Key/Value Pair versus hstore

    Benchmarking Entity-Attribute-Value Structures in PostgreSQL.

    Inserting a tuple is as easy as creating an attribute of type hstore:

    1

    INSERT INTObench_hstore(kvp_hstore) VALUES(

    hstore(id=>1,

    surname=>McNeal,forename=>Bob

    )

    );

    2

    INSERT INTObench_hstore(kvp_hstore) VALUES(

    hstore(

    id=>2,

    surname=>Gates)

    );

    Figure 3: hstore insert SQL example

    2.3.1

    Functions

    As you can see in the above example, the length of the array may vary from tuple to tuple. Impor-

    tant to see is that each line the associated array has a key and a value separated by comma e.g.

    hstore(id=>2, surname=>Gates)

    hstore(=> , , => )

    means that we have two different unique keys, id and surname, and each unique key has a va l-

    ue, for id it is 2 and for surname it is Gates. Unique means that in a tuple a key can only be

    defined once. For example the id surname can only appear once in the same tuple; the following

    hstore is not allowed:

  • 8/12/2019 Benchmark of KVP vs. Hstore - Doc

    11/50

    Chapter 2: Overview Key/Value Pair versus hstore

    Benchmarking Entity-Attribute-Value Structures in PostgreSQL.

    Or maybe you want to know all possible keys in a hstore:

    SELECTskeys(kvp_hstore) ASkeys FROMbench_hstore GROUPBYkeys;

    To become the key only once in the result list, a GROUP BY clause by keys need to be added to

    the statement.

    2.3.2

    Working principle

    Hstore is implemented in C as a PostgreSQL add-on and provides a SQL script to install the data

    type and all the PostgreSQL functions. Hstore tries to build a buffer over all the keys, which are in

    the hstore data type, if they are in alphabetical order. If not in some special functions the array

    will be sorted to have it alphabetical order. Hstore as data type is defined as follow:

    1

    CREATE TYPEhstore (

    INTERNALLENGTH = -1,

    INPUT = hstore_in,

    OUTPUT = hstore_out,

    RECEIVE = hstore_recv,

    SEND = hstore_send,

    STORAGE = extended

    );

    Listing 1: Hstore data type definition

    The important parameter is the INPUT one, which is linked to a C method. The hstore_in me-

    thod parses the hstore string to a C structure that holds the key, value, and length of the key and

    value as well as the position in the array. The position is needed because the array is not really

  • 8/12/2019 Benchmark of KVP vs. Hstore - Doc

    12/50

    Chapter 2: Overview Key/Value Pair versus hstore

    Benchmarking Entity-Attribute-Value Structures in PostgreSQL.

    For more information please visit the official PostgreSQL hstore documentation2.

    2.4 Benchmark Tools

    Currently two programs should be mentioned for benchmarking PostgreSQL. Both, pgbench and

    HSR Texas Geo Database Benchmark, are running in sequential mode SQL statements to test the

    database under test. For the test proposed in this paper an own benchmark tools has been writ-

    ten to fulfill the desired hypotheses.

    2.4.1 Pgbench

    Pgbench is shipped in the PostgreSQL distribution package and runs test on a PostgreSQL in-

    stance in a sequential mode. Sequential mode means that the same SQL statement is run over and

    over in possible multiple concurrent database sessions, which fulfill the multi-processing archi-

    tecture. At the end of the benchmark it calculates the average transaction time per seconds.

    Pgbench provides an own scripting language to customize the test scripts for using own data sets

    and test SQL statements. In addition it includes some industry-standard test cases, which let you

    compare PostgreSQL with other database products (Smith, 2010, S. 189).

    Custom scripts in pgbench allow you to create your own test scripts. It can handle statements

    with variable, which are known in Java as prepared statements. It is possible to define a SELECT

    statement like this:

  • 8/12/2019 Benchmark of KVP vs. Hstore - Doc

    13/50

    Chapter 2: Overview Key/Value Pair versus hstore

    Benchmarking Entity-Attribute-Value Structures in PostgreSQL.

    tor the behavior. For all this queries this program provides the test data that comes from Texas

    USA. A test script looks like this:

    1

    SELECTcount(*)FROM{dataset polygons} pg

    WHEREST_Intersects(@bbox, pg.geo);

    This SELECT statement counts all polygons that intersect with a given bounding box @bbox.

    In general, the benchmark program is based on a cube.

    Different queries can be run on different systems by us-

    ing different dataset. Each dataset will be installed oneach system and all queries are run on all systems times

    the number of datasets. This guarantees that the differ-

    ent systems can be compared, because all are using the

    same data and statements.

    Figure 4: HSR Texas Geo Database

    Benchmark Cube

    Source: (Krummenacher, 2009)

  • 8/12/2019 Benchmark of KVP vs. Hstore - Doc

    14/50

    Chapter 3: Benchmark proposal Key/Value Pair versus hstore

    Benchmarking Entity-Attribute-Value Structures in PostgreSQL.

    3

    Benchmark proposal

    Before we can have a look at the benchmarking utility and the result, we need first to considerwhat the ingredients of a benchmark are. The term benchmark can be substitute into three dif-

    ferent processes described in the following chapter.

    3.1 Terms

    The term benchmark has a high cohesion to the term test. Looking into the Cambridge dictionary

    benchmark is defined as follows:

    a level of quality which can be used as a standard when comparing other things

    (Cambridge University Press).

    That by which the existence, quality, or genuineness of anything is or may be de-

    termined; [] (Oxford University Press).

    The definition suggests that two different things of approximately the same topic need to be put

    in contrast to each other. That means that the things need to be converted in a form that makes

    them comparable. At this point the term test come in place, which does exactly this transforma-

    tion:

    an act of using something to find out whether it is working correctly or how effec-

    tive it is (Cambridge University Press)

  • 8/12/2019 Benchmark of KVP vs. Hstore - Doc

    15/50

    Chapter 3: Benchmark proposal Key/Value Pair versus hstore

    Benchmarking Entity-Attribute-Value Structures in PostgreSQL.

    So some questions need here to be considered:

    What exactly I want to test?

    What should be the data and can they be transformed in the test process?

    Does the data fit into the given environment?

    Etc.

    All these questions are very fundamental and often at the first moment very easy to answer.

    However, finding the right data that fit into the environment and test process is not that easy e.g.

    test data has a wrong encoding and cannot be loaded,

    data does not cover the whole test design, and no significant result can be achieved,

    etc..

    3.3 Execution phase

    The test execution phase defines the design of the chosen test technique. In general it can be di-

    vided into the following fields:

    Load Testing: measures and establishes benchmarks for the system under test by pushing

    transactions to the system. It can be incremental, or can be set amount that is proportional to the

    values of the system.

    Performance Testing: that is run repeatedly until acceptable performance levels are achieved

    through database tuning activities.

  • 8/12/2019 Benchmark of KVP vs. Hstore - Doc

    16/50

    Chapter 3: Benchmark proposal Key/Value Pair versus hstore

    Benchmarking Entity-Attribute-Value Structures in PostgreSQL.

    3.5 Performance Benchmark Design

    To benchmark the KVP and hstore in PostgreSQL the decision has been taken in favor of a per-fromance test. In this benchmark it is not important how stable and scalable PostgreSQL is, it is

    more interesting how does KVP and hstore perform on given preconfigured PostgreSQL envi-

    ronment.

    3.5.1

    Table Schema

    As described in chapter2.2 Key Value Pair and2.3 Hstore,the table schemas need to be defined

    in such a way that the comparison between KVP and hstore is fair. The goal of the schema defi-

    nitions is to have for both an associative array in matters of the data, which need to be stored. It

    is not important to have an equal representation of the key value pairs in the database, however

    the philosophy of what information type at its granularity need to be stored and queried is im-

    portant. It means that the data are note foreseeable in sense of additional information that could

    be provided to a specific data record.

    In this benchmark we use the following table schemas to represent the associative array in a da-

    tabase table.

    For KVP

    1

    CREATE TABLEbench_kvp_id (bench_id BIGINTPRIMARY KEY);

    2

    CREATE TABLEbench_kvp (

    bench_id BIGINTREFERENCESbench_kvp_id(bench_id),

    key TEXTNOT NULL,

  • 8/12/2019 Benchmark of KVP vs. Hstore - Doc

    17/50

    Chapter 3: Benchmark proposal Key/Value Pair versus hstore

    Benchmarking Entity-Attribute-Value Structures in PostgreSQL.

    1

    CREATE INDEXkvpidx1 ONbench_kvp (key);

    2

    CREATE INDEXkvpidx2 ONbench_kvp (key, value);

    Listing 6: KVP index

    Index for KVP shall be tested in two different ways. Firstly with a single index on the key

    attribute and secondly a combined index on the attributes key and value.

    For hstore an index can be created as follow:

    1

    CREATE INDEXhidx ONbench_hstore USING GIST(bench_hstore);

    Listing 7: Hstore index

    3.5.2

    Statements

    To query a tuple based on a key value pair we have for each, KVP and hstore, an own SELECT

    statement. Because KVP needs for each key value pair a new tuple we have first to find the

    unique identifier to the key value pair and then we can select the information we need. This ex-

    ample selects all the information of a person with surname McNeal:

    1

    SELECT* FROMbench_kvp WHEREbench_id = (

    SELECTbench_id FROMbench_kvp

    WHEREkey = 'surname' ANDvalue = 'McNeal');

    Listing 8: KVP select example

    By using hstore we need first to convert the attribute which stores the hstore string into a hstore

  • 8/12/2019 Benchmark of KVP vs. Hstore - Doc

    18/50

    Chapter 3: Benchmark proposal Key/Value Pair versus hstore

    Benchmarking Entity-Attribute-Value Structures in PostgreSQL.

    surname : Text Mandatory. A fancy name.

    forename : Text Optional: A fancy name. Can be empty to have a variable KVP length.

    zip : Integer Optional: A number between 1000 and 9000. Can be empty to have a va-

    riable KVP length.

    comment : Text Optional: A dummy text. Can be empty to have a variable KVP length.

    Table 3: Columns of a test dataset record

    and an abstract of a test file looks as follow:

    1

    id,forename,surname,zip,comment2

    1,cucyp,,6593,lorem ipsum dolor sit amet consetetur sadipscingelitr sed diam nonumy eirmod tempor invidunt ut labore et do-

    lore magna aliquyam erat sed diam voluptua at vero eos et accu-sam et justo duo dolores et ea rebum stet clita kasd gubergren

    no sea takimata sanctus est lorem ipsum dolor sit amet

    3

    2,kasarzyc,ecnalehad,8463,

    4

    3,inwa,,,

    Figure 5: Example of a test data file

    The test data records needs to be transformed into a valid KVP SQL statement like this:

    1

    INSERT INTObench_kvp_info(id) VALUES(1)

    2

    INSERT INTObench_kvp(id, key, value) VALUES(1, id, 1);

    3

    INSERT INTObench_kvp(id, key, value) VALUES(1, forname,

    cucyp);

    4

    INSERT INTObench_kvp(id, key, value) VALUES(1, zip, 6593);

    5

    INSERT INTObench_kvp(id, key, value) VALUES(1, comment, lo-

    rem ipsum dolor sit amet consetetur sadipscing elitr );

  • 8/12/2019 Benchmark of KVP vs. Hstore - Doc

    19/50

    Chapter 3: Benchmark proposal Key/Value Pair versus hstore

    Benchmarking Entity-Attribute-Value Structures in PostgreSQL.

    Datasets

    10 records 100 records 500 records

    1000 records 2500 records 5000 records

    10000 records 20000 records 35000 records

    50000 records 100000 records 250000 records

    Table 4: Number of dataset length

    Additionally, for KVP and hstore one test circle, means test phase and benchmark phase, take

    place once with database index and once without for each dataset length. In total 24 different test

    cycle were executed:

    Figure 6: Number of test cicle

    # of length defines the number of different datasets that need to be run, # of types means the

    amount of different data sources that need to be testes. In our case it is KVP and hstore. #of i n-

    dices means that the test will be run once with indexed data and once without. This results in 20

    different test cycles.

    3.6 Test Application

    For benchmarking KVP and hstore an own application has been written in Python. It supports all

    three phases: Generate / Preprocessing, Execution, and Bechmark / Analysis. It has been written

    12 [# of length]2 [# of types]2 [# of indices]3 [warm start] =144 [cicles]

  • 8/12/2019 Benchmark of KVP vs. Hstore - Doc

    20/50

    Chapter 3: Benchmark proposal Key/Value Pair versus hstore

    Benchmarking Entity-Attribute-Value Structures in PostgreSQL.

    Figure 7: Test / benchmark application incl. test data generator

    n standalone processes

    Processes

    (Adapters)

    Adapter

    < load

    Benchmark

    Data generator

    Log and Graphs

    hstore

    {name: Joe, }

    {name: Anne, }

    {name:Smith, }

    {name:Bob, }

    {name:Marco,

    KVP

    .csv

    Data

    set >insert >

    insert >

    Task queue

    Response queue

    inhe-

    rit >

    < get

    < add

    < write response

    < get excute task >

    excute task >

  • 8/12/2019 Benchmark of KVP vs. Hstore - Doc

    21/50

    Chapter 3: Benchmark proposal Key/Value Pair versus hstore

    Benchmarking Entity-Attribute-Value Structures in PostgreSQL.

    Parameter Description

    -t or --type String: Type of database test. Currently pgsql and pgsqlhstore sup-

    ported.

    -x or --processes Integer: The amount of parallel processes that should be allocated. If this

    parameter is not set, then the software tries to find out the maximum

    parallel processes of the CPU architecture.

    -s or --server String: Server or hostname where the database runs e.g. localhost

    -p or --port Integer: Port of the database server.

    -d or --database String: Database name.

    -u or --user String: A user how has the rights to create tables, do insertions and run

    queries.

    -p or --password String: Users password.

    -a or --data String: File that includes all test records.

    -i or --index Boolean: Defines whether an index on the tables should be allocated or

    not.

    -n or --no-hot-start Boolean: The flag defines if a hot start is required. If it is not set, it runs

    the test 2 times before it measures on the 3 round the transaction time.

    -l or --log Boolean: Whether a log file should be created or not.

  • 8/12/2019 Benchmark of KVP vs. Hstore - Doc

    22/50

    Chapter 4: Benchmark Mai 2011 Key/Value Pair versus hstore

    Benchmarking Entity-Attribute-Value Structures in PostgreSQL.

    4

    Benchmark Mai 2011

    The following chapter specifies the hardware and software used for the system under test anddescribes the results as well as the findings.

    4.1 Technical specification

    The server on which KVP and hstore need to be tested has the following hardware specification:

    Type Comment

    Processor

    CPU Intel(R) Xeon(R) CPU E5520 @ 2.27GHz

    Instructions set 64-bit

    # of cores 4

    # of threads 8

    # of CPUs 2

    Memory

    Total 24732476 kB about 24 GB

    Speed 1066 MHz

    Idle modus Only Ubuntu and PostgreSQL are running on the same hard disk whereas

    23677164 kB about 23 GB RAM is free

  • 8/12/2019 Benchmark of KVP vs. Hstore - Doc

    23/50

    Chapter 4: Benchmark Mai 2011 Key/Value Pair versus hstore

    Benchmarking Entity-Attribute-Value Structures in PostgreSQL.

    4.2.1

    Preparing Database

    To prepare the database you need first to login as a user that has the privilege to create a data-

    base such as user postgres:

    1 sudo supostgres

    As postgres user you can run the first script called step_1.sh which creates a new database user

    benchmark, a database called benchmark and runs the hstore script that install the hsotre

    data type and a bunch of PostgreSQL functions.

    1

    ./step_1.sh

    The content of the step_1.sh script is as follow:

    1

    #!/bin/bash

    2

    # first login as user with privilege to create a database e.g.

    sudo su postgres

    3

    4

    # create new user benchmark

    5

    createuser-l -D -R -S benchmark

    6

    7

    # alter user's password

    8

    psql-U postgres -c "ALTER USER benchmark WITH PASSWORD'benchmark'"

    9

    10

    # create new database benchmark

    11

    createdb-U postgres benchmark

    12

    13

    # create language plpgsql on benchmark database

    14

    createlang-U postgres -d benchmark plpgsql

  • 8/12/2019 Benchmark of KVP vs. Hstore - Doc

    24/50

    Chapter 4: Benchmark Mai 2011 Key/Value Pair versus hstore

    Benchmarking Entity-Attribute-Value Structures in PostgreSQL.

    10

    11

    # create test data sets12

    # - for 500

    13

    pythongenerator.py -a 500 -t words -l 5014

    mvdata/testdata.csv data/testdata_500.csv

    # etc.

    Figure 9: Test data generation script

    4.2.3

    Executing Benchmark

    Now all prerequisites are fulfilled and the benchmark can be started. Also for this step an exam-

    ple script is available. Run the following command to benchmark KVP and hstore once with index

    and once without based on the generated datasets in the previous script.

    1

    ./step_3.sh

    The step_3.sh script includes the following statements.

    1

    #!/bin/bash2

    ############

    3

    #for hstore

    4

    ############5

    # - for 10

    6

    # - without index

    7

    pythonbenchmark.py -t pgsqlhstore-s localhost -p 5432 -d

    benchmark -u benchmark -w benchmark -a data/testdata_10.csv-l

    -g -n

    8

    mvoutput/1.png output/hstore 10 1.png

  • 8/12/2019 Benchmark of KVP vs. Hstore - Doc

    25/50

    Chapter 4: Benchmark Mai 2011 Key/Value Pair versus hstore

    Benchmarking Entity-Attribute-Value Structures in PostgreSQL.

    33

    mvoutput/log_summary.csv output/kvp_10_log_summary.csv

    34

    psql-d benchmark -c "EXPLAIN ANALYZE SELECT * FROM bench_kvpWHERE bench_id = (SELECT bench_id FROM bench_kvp WHERE key =

    'id' AND value = '7');"> output/analyze.log35

    mvoutput/analyze.log output/kvp_10_analyze.log

    36

    37

    # - with index

    38

    pythonbenchmark.py -t pgsql-s localhost -p 5432 -d benchmark

    -u benchmark -w benchmark -a data/testdata_10.csv-i -l -g -n

    39

    mvoutput/1.png output/kvp_10_index_1.png40

    mvoutput/2.png output/kvp_10_index_2.png

    41

    mvoutput/log.csv output/kvp_10_index_log.csv

    42

    mvoutput/log_summary.csv output/kvp_10_index_log_summary.csv

    43

    psql-d benchmark -c "EXPLAIN ANALYZE SELECT * FROM bench_kvpWHERE bench_id = (SELECT bench_id FROM bench_kvp WHERE key =

    'id' AND value = '7');"> output/analyze.log

    44

    mvoutput/analyze.log output/kvp_10_index_analyze.log

    # etc.

    Figure 10: Benchmark script example

    4.3 Results

    The test has been executed in May 2011 based on the test design described in chapter Error!

    Reference source not found. and the hardware specification in chapter 4.1. All the test logs

    were aggregated into a single file showing the start and end time as well as the duration and the

    average time in seconds per SELECT statement. The detail aggregation and the full extent dia-

  • 8/12/2019 Benchmark of KVP vs. Hstore - Doc

    26/50

    Chapter 4: Benchmark Mai 2011 Key/Value Pair versus hstore

    Benchmarking Entity-Attribute-Value Structures in PostgreSQL.

    Figure 11: Overview KVP vs. hstore benchmark

    From 10 to approximately 500 records the KVP is much faster by querying a key value pair. Af-

    terwards hstore demonstrates its strength especially when having more than 2500 tuples in the

    hstore table, whereas by using an index on the KVP table, hstore is faster by a factor of 4.04 and

    without an index by a factor of 7.9. This sounds like KVP is a performance killer, which is not the

    truth, because if we look at the absolute querying time per SELECT statement, than KVP needs in

  • 8/12/2019 Benchmark of KVP vs. Hstore - Doc

    27/50

    Chapter 4: Benchmark Mai 2011 Key/Value Pair versus hstore

    Benchmarking Entity-Attribute-Value Structures in PostgreSQL.

    The circumstance changes, if we use a combined index for the KVP table on the attributes key

    and value. Hstore is still a tick faster than the KVP schema, but the difference between the ave r-

    age SELECT transaction time for KVP indexed table shrunk extremely fast. Overall we can say

    that the combined index comes very near to the hstore schema. Despite the fact that the com-

    bined index gives some performance boost, we can still see the problem, that the more tuples we

    store in the KVP table the higher will be the difference between the hstore and the KVP. This con-

    cludes that the more arbitrary data results need to be store, the faster grows the combined index

    and the longer needs PostgreSQL to find the right key value pair combination. But more on that

    later in chapter4.4 Findings.

  • 8/12/2019 Benchmark of KVP vs. Hstore - Doc

    28/50

    Chapter 4: Benchmark Mai 2011 Key/Value Pair versus hstore

    Benchmarking Entity-Attribute-Value Structures in PostgreSQL.

    Figure 14: Benchmark KVP hstore from 10 to 2.5K with combined index

    Having a nearer look to the data sets between 10 and 2500 tuples we see that the average trans-

    action time on a 2500 big data set shrunk to 0.00948 (9.48 ms) from 0.01512 seconds (15.12

    ms) that leads to the fact that a combined index is faster by a factor of 1.6.

    15 12ms

  • 8/12/2019 Benchmark of KVP vs. Hstore - Doc

    29/50

    Chapter 4: Benchmark Mai 2011 Key/Value Pair versus hstore

    Benchmarking Entity-Attribute-Value Structures in PostgreSQL.

    Small data sets show a contrary perspective. From 10 to 500 records the difference between the

    hstore and a KVP with a combined index is negative. This means that the KVP is faster then the

    hstore. From more or less 500 records upwards hstore will be faster even only a little bit.

    Figure 16: 10 to 2.5K: Difference between KVP single and combined index

  • 8/12/2019 Benchmark of KVP vs. Hstore - Doc

    30/50

    Chapter 4: Benchmark Mai 2011 Key/Value Pair versus hstore

    Benchmarking Entity-Attribute-Value Structures in PostgreSQL.

    Figure 18: 10 to 2.5K: KVP with index on key and combined index against hstore

    Analyzing the indices of hstore and KVP shows, that KVP needs a lot more size on the disk to

    build the index. Both an index on the attribute key and a combined index on the attributes key

    and value whereat the combined index need more disk size then the index on the attribute key

  • 8/12/2019 Benchmark of KVP vs. Hstore - Doc

    31/50

    Chapter 4: Benchmark Mai 2011 Key/Value Pair versus hstore

    Benchmarking Entity-Attribute-Value Structures in PostgreSQL.

    Especially in the area of 1000 tuples and more is the different significant. KVP with an index on

    the attribute key needs by 1000 tuples 3.73 times more disk space than the GiST index on a

    hstore and 3.55 times for the combined index.

    Figure 20: Index size for 10 to 5000 records

  • 8/12/2019 Benchmark of KVP vs. Hstore - Doc

    32/50

    Chapter 4: Benchmark Mai 2011 Key/Value Pair versus hstore

    Benchmarking Entity-Attribute-Value Structures in PostgreSQL.

    duo dolores et ea rebum stet clita kasd gubergren no sea takimata sanc-

    tus est lorem ipsum dolor sit amet ", "surname"=>"ebsaveq", "fore-

    name"=>"maeznidus"

    2 "id"=>"2", "zip"=>"6489", "comment"=>"lorem ipsum dolor sit amet con-

    setetur sadipscing elitr sed diam nonumy eirmod tempor invidunt ut la-

    bore et dolore magna aliquyam erat sed diam voluptua at vero eos et ac-

    cusam et justo duo dolores et ea rebum stet clita kasd gubergren no sea

    takimata sanctus est lorem ipsum dolor sit amet ", "surname"=>"epofod",

    "forename"=>"teer"

    2500 "id"=>"2500", "comment"=>"lorem ipsum dolor sit amet consetetur sa-

    dipscing elitr sed diam nonumy eirmod tempor invidunt ut labore et do-

    lore magna aliquyam erat sed diam voluptua at vero eos et accusam et

    justo duo dolores et ea rebum stet clita kasd gubergren no sea takimata

    sanctus est lorem ipsum dolor sit amet ", "forename"=>"sorietet"Table 9: Hstore table abstract

    What we need to do is to analyze an example query. As an example we take this one:

    1 EXPLAIN ANALYZESELECT* FROMbench_hstore

    WHEREhstore(bench_hstore)->'id'='1735';

    Listing 12: Explain Analyze statement for hstore without index

  • 8/12/2019 Benchmark of KVP vs. Hstore - Doc

    33/50

    Chapter 4: Benchmark Mai 2011 Key/Value Pair versus hstore

    Benchmarking Entity-Attribute-Value Structures in PostgreSQL.

    1 Bitmap Heap Scanon bench_hstore

    (cost=4.27..11.33 rows=2 width=218)(actual time=0.481..0.534 rows=1 loops=1)

    2

    Recheck Cond: (bench_hstore @> '"id"=>"1735"'::hstore)3

    -> Bitmap Index Scanon hidx_2_5k

    (cost=0.00..4.27 rows=2 width=0)

    (actual time=0.308..0.308 rows=70 loops=1)

    4

    Index Cond: (bench_hstore @> '"id"=>"1735"'::hstore)

    5

    Total runtime: 0.721 ms

    6

    (5 rows)

    Listing 15: Output of Explain Analyze statement for hstore with index

    Now lets have a look at the KVP tables. Remember that we have in the schema two different

    tables but we only need the table with the attributes key and value. For the same amount of

    array entries in the KVP table a multiple of tuples will be stored. Each key value pair entry in the

    array needs to be a separate tuple in the table. If we have the following keys id, surname, fo r-

    name, zip, and comment for a 2500 big array and each key in this array has an assigned value,

    then it will results to 12500 tuples in the database table. Compared to hstore it has 5 times more

    tuples or to be more precise the sum of all filled keys.

    , whereas

    The database table includes then for each key value pair an own tuple:

    Database table: bench_kvp

    bench_id : BIGINT key : TEXT NOT NULL value : TEXT

    tuples= value in the array value null

  • 8/12/2019 Benchmark of KVP vs. Hstore - Doc

    34/50

    Chapter 4: Benchmark Mai 2011 Key/Value Pair versus hstore

    Benchmarking Entity-Attribute-Value Structures in PostgreSQL.

    2 surname epofod

    2 forename Teer

    2500 id 2500

    2500 Comment lorem ipsum dolor sit amet consetetur sadipsc-

    ing elitr sed diam nonumy eirmod tempor invi-

    dunt ut labore et dolore magna aliquyam erat

    sed diam voluptua at vero eos et accusam et jus-

    to duo dolores et ea rebum stet clita kasd gu-

    bergren no sea takimata sanctus est lorem ip-

    sum dolor sit amet

    2500 forename sorietet

    Table 10: KVP table abstract

    Taking the same key value pair we used for hstore gives the following KVP SQL statement:

    1 EXPLAIN ANALYZESELECT * FROMbench_kvp

    WHEREbench_id = (

    SELECTbench_id FROMbench_kvp

    WHEREkey= 'id'ANDvalue= '1735'

    );

    Listing 16: Explain Analyze statement for KVP

    which results in the following EXPLAIN output:

  • 8/12/2019 Benchmark of KVP vs. Hstore - Doc

    35/50

    Chapter 4: Benchmark Mai 2011 Key/Value Pair versus hstore

    Benchmarking Entity-Attribute-Value Structures in PostgreSQL.

    The difficulty of creating an own hstore like database schema is, that more sequences and there-

    fore more reads of pages on the disk are needed which results in higher cost units. The lower size

    of the tuples in bytes does compensate the amount of pages to be read on the disk. Because in theshort run it looks like that at the best-case only 60 bytes for the first sequence and 8 bytes for

    second sequence are needed. That is not the truth because if the first read in the second sequence

    finds the key value pair, than 8 bytes are consumed. With the found identifier the first sequence

    read all the tuples that match the identifier. In our case we have 5 key value pair combinations

    that means 5 tuples in the KVP table. Each tuple consumes 60 bytes, which is for all 5 tuples 300

    bytes, plus the 8 bytes for the second sequence results in a total size of 308 bytes. Compared to

    the hstore in uses in the best-case only 213 bytes.

    Using an index on the key attribute can enforce finding the unique identifier for a given key value

    pair. Analyzing the technique shows that we have an additional sequence.

    1

    Seq Scanon bench_kvp (cost=199.48..406.88 rows=3 width=60)

    (actual time=2.268..2.730 rows=2 loops=1)2

    Filter: (bench_id = $0)

    3

    InitPlan 1 (returns $0)4

    -> Bitmap Heap Scanon bench_kvp

    (cost=62.99..199.48 rows=1 width=8)

    (actual time=0.925..1.227 rows=1 loops=1)

    5

    Recheck Cond: (key = 'id'::text)

    6

    Filter: (value = '1735'::text)7

    -> Bitmap Index Scanon kvpidx

    (cost=0.00..62.99 rows=2499 width=0)

    (actual time=0.373..0.373 rows=2500 loops=1)

    8

    Index Cond: (key = 'id'::text)

  • 8/12/2019 Benchmark of KVP vs. Hstore - Doc

    36/50

    Chapter 4: Benchmark Mai 2011 Key/Value Pair versus hstore

    Benchmarking Entity-Attribute-Value Structures in PostgreSQL.

    This factor results, because no additional sequence is needed, like we had it when using only an

    index on the attribute key. The first index scancan directly find the unique identifier for a given

    key value pair and reading afterward all tuples for that identifier. The bytes, which are needed togather the data, are exactly the same, as we needed for the first index alternative. In contrast to

    the first alternative the combined index on the attributes key and value will grow very fast,

    because for each key value pair combination a new entry in the index is needed. It is in the nature

    of the key value pair philosophy, that only arbitrary unforeseen information is stored as key val-

    ue pair and therefore the probability that an equal key value pair appears in the table is very un-

    likely.

    1 Seq Scanon bench_kvp (cost=8.27..215.91 rows=3 width=60)

    (actual time=1.376..1.954 rows=5 loops=1)

    2

    Filter: (bench_id = $0)

    3

    InitPlan 1 (returns $0)

    4

    -> Index Scanusing kvpidx2 on bench_kvp

    (cost=0.00..8.27 rows=1 width=8)

    (actual time=0.048..0.049 rows=1 loops=1)5

    Index Cond: ((key = 'id'::text) AND

    (value = '1735'::text))

    6

    Total runtime: 2.028 ms

    7

    (6 rows)

    Listing 21: Output of Explain Analyze statement for KVP with combined index

    Lastly we need to have a short look on the hstore implementation. Please consider for that chap-

    ter2.3.2 Working principle.

  • 8/12/2019 Benchmark of KVP vs. Hstore - Doc

    37/50

    Chapter 4: Benchmark Mai 2011 Key/Value Pair versus hstore

    Benchmarking Entity-Attribute-Value Structures in PostgreSQL.

    4.5 Conclusion

    As described in the previous chapters hstore perform much faster then a KVP schema described

    in chapter2.2.The stored data of type hstore are not lost in the database and can be migrated

    with a minimum of effort to another schema, because it is stored as string in the form of an asso-

    ciative array in the database. In addition, hstore provides PostgreSQL functions to transform the

    associative array into a column row like table, as it is know in every database management sys-

    tem. The keys in the associative array are transposed to columns and each row in the array is a

    tuple in the column row like table. The values are transformed to the values in a tuple. Therefore

    the fear for a later migration should not be the criteria of not using hstore.

    Also the way how it is implemented nees much less size on the disk for the indices and costs less

    performance. This shows the explain analysis in chapter4.4 and the graphs on page24 and25.

    The cost of reading data is much lower then the one of a KVP schema. To remember, the cost de-

    fines a factor of reading a page from the disk. The higher it is the more it needs to read on the

    disk and the slower it will be. In addition hstore buffers all the keys and values to provide a faster

    read and along a single buffer entry it stores the position of the key and value in the string, which

    is of type hstore in the database table that represents the associative array. That means when

    hstore found the key it does not need to substring the string because it already knows the posi-

    tion in the string.

    Thus, it is to consider that for small datasets hstore is not the preferable method to store key val-

    ue pairs. Especially when having an array size of 1 to more or less 500 records. At this size a KVP

    Bibli h K /V l P i h

  • 8/12/2019 Benchmark of KVP vs. Hstore - Doc

    38/50

    Bibliography Key/Value Pair versus hstore

    Benchmarking Entity-Attribute-Value Structures in PostgreSQL.

    Bibliography

    Bartunov, O., Sigaev, T., & Gierth, A. (n.d.). PostgreSQL 9.0: hstore. Retrieved Mai 1, 2011, from

    http://www.postgresql.org/docs/9.0/static/hstore.html

    Cambridge University Press. (n.d.). Cambridge Dictionary Online. Retrieved April 26, 2011, from

    http://dictionary.cambridge.org/

    Gonnerman, C. (2003). Python Name Generators. Retrieved April 27, 2011, from Alderon's Tower:

    http://tower.newcenturycomputers.net/namegen.html

    Krummenacher, R. (2009, December 21). HSR Texas Geo Database Benchmar. Retrieved June 3,

    2011, from Wiki GISpunkt HSR:

    http://www.gis.hsr.ch/wiki/HSR_Texas_Geo_Database_Benchmark

    Nasby, J. (2010, May 13). Introduction to VACUUM, ANALYZE, EXPLAIN, and COUNT. Retrieved

    May 27, 2011, from PostgreSQL wiki:

    http://wiki.postgresql.org/wiki/Introduction_to_VACUUM,_ANALYZE,_EXPLAIN,_and_C

    OUNT

    Oxford University Press. (n.d.). Oxford English Dictionary. Retrieved April 28, 2011, from

    http://www.oed.com

    PostgreSQL Global Development Group. (n.d.). PostgreSQL: About. Retrieved Mai 26, 2011, from

    http://www.postgresql.org/About

    A di K /V l P i h t

  • 8/12/2019 Benchmark of KVP vs. Hstore - Doc

    39/50

    Appendix Key/Value Pair versus hstore

    Benchmarking Entity-Attribute-Value Structures in PostgreSQL.

    Appendix

    Benchmark with KVP index on attribute key

    Benchmark of hstore and KVP once with index (w) and once without index (o). For KVP an index

    on the attribute key has been choosen.

    Appendix Key/Value Pair versus hstore

    Benchmarking Entity-Attribute-Value Structures in PostgreSQL.

  • 8/12/2019 Benchmark of KVP vs. Hstore - Doc

    40/50

    University of Applied Science Rapperswil 34

    Appendix Key/Value Pair versus hstore

    Benchmarking Entity-Attribute-Value Structures in PostgreSQL.

  • 8/12/2019 Benchmark of KVP vs. Hstore - Doc

    41/50

    University of Applied Science Rapperswil 35

    Appendix Key/Value Pair versus hstore

  • 8/12/2019 Benchmark of KVP vs. Hstore - Doc

    42/50

    Appendix Key/Value Pair versus hstore

    Benchmarking Entity-Attribute-Value Structures in PostgreSQL.

    Benchmark with combined KVP index

    Benchmark of hstore and KVP once with index (w) and once without index (o). For KVP a com-

    bined index on the attribute key and value has been choosen.

    Appendix Key/Value Pair versus hstore

    Benchmarking Entity-Attribute-Value Structures in PostgreSQL.

  • 8/12/2019 Benchmark of KVP vs. Hstore - Doc

    43/50

    University of Applied Science Rapperswil 37

    Appendix Key/Value Pair versus hstore

    Benchmarking Entity-Attribute-Value Structures in PostgreSQL.

  • 8/12/2019 Benchmark of KVP vs. Hstore - Doc

    44/50

    University of Applied Science Rapperswil 38

    Appendix Key/Value Pair versus hstore

    Benchmarking Entity-Attribute-Value Structures in PostgreSQL.

  • 8/12/2019 Benchmark of KVP vs. Hstore - Doc

    45/50

    University of Applied Science Rapperswil 39

    Differences between KVP and hstore

    Appendix Key/Value Pair versus hstore

    Benchmarking Entity-Attribute-Value Structures in PostgreSQL.

  • 8/12/2019 Benchmark of KVP vs. Hstore - Doc

    46/50

    University of Applied Science Rapperswil 40

    Appendix Key/Value Pair versus hstore

    Benchmarking Entity-Attribute-Value Structures in PostgreSQL.

  • 8/12/2019 Benchmark of KVP vs. Hstore - Doc

    47/50

    University of Applied Science Rapperswil 41

    Average SELECT time for KVP and hstore

    Appendix Key/Value Pair versus hstore

    Benchmarking Entity-Attribute-Value Structures in PostgreSQL.

  • 8/12/2019 Benchmark of KVP vs. Hstore - Doc

    48/50

    University of Applied Science Rapperswil 42

    Appendix Key/Value Pair versus hstore

    Benchmarking Entity-Attribute-Value Structures in PostgreSQL.

  • 8/12/2019 Benchmark of KVP vs. Hstore - Doc

    49/50

    University of Applied Science Rapperswil 43

    KVP and hstore index sizes

    Appendix Key/Value Pair versus hstore

    Benchmarking Entity-Attribute-Value Structures in PostgreSQL.

  • 8/12/2019 Benchmark of KVP vs. Hstore - Doc

    50/50

    University of Applied Science Rapperswil 44