a survey of open source tools for business intelligence...a survey of open source tools for business...

34
A Survey of Open Source Tools for Business Intelligence Christian Thomsen and Torben Bach Pedersen Department of Computer Science Aalborg University

Upload: others

Post on 07-Jul-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

A Survey of Open Source Tools for Business Intelligence

Christian Thomsen and Torben Bach PedersenDepartment of Computer ScienceAalborg University

DDBW20: January 2006 2

Agenda

About the survey Criteria ETL tools OLAP servers OLAP clients DBMSs Summary

DDBW20: January 2006 3

The EIAO Project

The European Internet Accessibility Observatory (EIAO) project will… … build a crawler to automatically collect data

about accessibility of web resources For example for blind people

… apply BI technologies to analyze the data … be based on open source software

More details are available at eiao.net At Aalborg University, we build the data

warehouse for the EIAO project

DDBW20: January 2006 4

Motivation

The open source products are not very visible within the BI field

The data warehouse for EIAO will be based on open source tools

Survey done to get an overview of available products

DDBW20: January 2006 5

The survey

We considered ETL, OLAP servers, OLAP clients, and DBMSs Data mining tools not considered (yet)

We searched the Internet for open source tools Collected data by examining source

code, homepages and documentation Done in November—December 2004

DDBW20: January 2006 6

Criteria

General Availability for different platforms License

ETL tools Supported data sources and targets ROLAP or MOLAP Incremental load in automatic fashion Methods to specify ETL process Supported data cleansing

DDBW20: January 2006 7

Criteria – continued

OLAP servers ROLAP, MOLAP, HOLAP Handling of “large” data sets Supported DBMSs Use of aggregates APIs and query languages

DDBW20: January 2006 8

Criteria – continued

OLAP clients Supported OLAP servers APIs and query languages Prescheduled reports Export facilities

DBMSs Handling of “large” data sets Materialized views Bitmap indexes Replication and partitioning Star joins

DDBW20: January 2006 9

Agenda

About the survey Criteria ETL tools OLAP servers OLAP clients DBMSs Summary

DDBW20: January 2006 10

ETL tools

Considered products Bee 1.1.0 (June 2005: 2.0.0) CloverETL 1.1.2 (May 2005: pre 1.1.5) Octopus 3.0.1 (June 2005: 3.4.1) Many disregarded, “empty” projects on the

web Bee uses Perl’s DBI, the other use JDBC All the products are ROLAP oriented There seems to be no support for

automatic, incremental load

DDBW20: January 2006 11

ETL tools – continued

Process specified in XML files Can be done by means of a GUI in Bee

and Octopus … and soon also in CloverETL

The user must code the cleansing functionality (transformations) Octopus includes some basic

transformations

DDBW20: January 2006 12

ETL tools – continued

The open source products are (still) not as good as their commercial competitors

There is a lack of good documentation

DDBW20: January 2006 13

Agenda

About the survey Criteria ETL tools OLAP servers OLAP clients DBMSs Summary

DDBW20: January 2006 14

OLAP servers

Considered products Lemur “0.0” Bee 1.1.0 (June 2005: 2.0.0) Mondrian 1.0.1 (April 2005: 1.1.5)

Now more products have been announced HydraCube Palo

DDBW20: January 2006 15

OLAP servers – continued

Lemur is HOLAP oriented But is a research project and is not

ready for commercial use Bee and Mondrian are ROLAP

oriented Bee uses MySQL Mondrian uses JDBC

DDBW20: January 2006 16

OLAP servers – continued

Bee aims to be able to handle 50GB efficiently

Mondrian’s documentation states that it handles large data sets if the underlying DBMS does

It is not possible to choose which pre-computed aggregates to use From May 2005, Mondrian has been

using existing materialized views

DDBW20: January 2006 17

OLAP servers – continued

Mondrian’s API is similar to ADO MD Support for JOLAP and XMLA planned

Mondrian supports (to some extend) the MDX query language

Bee’s API(s) and query language(s) are not described in the found documentation

DDBW20: January 2006 18

New OLAP servers

Palo (opensourceolap.org) MOLAP oriented All memory stored entirely in memory Preview expected ultimo August 2005

HydraCube (hydracube.sourceforge.net) MOLAP oriented Not an in-memory server

Uses BerkeleyDB Supports MDX against distributed DB Performs aggregation at query time

DDBW20: January 2006 19

Agenda

About the survey Criteria ETL tools OLAP servers OLAP clients DBMSs Summary

DDBW20: January 2006 20

OLAP clients

Considered products Bee 1.1.0 (June 2005: 2.0.0) JPivot 1.2.0 (August 2005: 1.4.0)

There seems to be no support for prescheduled reports in the products

Bee uses the Bee OLAP server JPivot uses Mondrian (and MDX queries)

The newest version also supports native XMLA access

DDBW20: January 2006 21

OLAP clients – continued

Bee exports to Excel, PDF, PNG, PowerPoint, text, and XML formats

JPivot exports to Excel and PDF Both support different kinds of 2D

and 3D graphs Both are used through web

interfaces

DDBW20: January 2006 22

OLAP clients – continued

Pictures from http://bee.insightstrategy.cz

DDBW20: January 2006 23

Agenda

About the survey Criteria ETL tools OLAP servers OLAP clients DBMSs Summary

DDBW20: January 2006 24

DBMSs

Considered products MonetDB 4.4.2 MySQL 4.1.2 MaxDB 7.5 PostgreSQL 8.0

These are the most visible Other open source products are

available

DDBW20: January 2006 25

DBMSs – continued

All of the DBMSs are designed to handle large data sets

All of them are used for BI in production environments

MaxDB, MySQL, and PostgreSQL support one-way replication A multiway solution is planned for PostgreSQL

No support for materialized views, bitmap indexes or star joins But PostgreSQL 8.1 will have support for

bitmaps

DDBW20: January 2006 26

DBMSs – continued

Partitioning not yet supported MySQL 5.1 will support partitioning

MySQL can currently do some partitioning by using NDB Cluster

PostgreSQL 8.1 will support partitioning into subtables

DDBW20: January 2006 27

Bizgres – a DBMS for BI

Bizgres project started in April 2005 Current release: 0.7

Aims to make PostgreSQL the open source standard for data warehousing and BI Will build a complete DB platform for BI

exclusively from free software

DDBW20: January 2006 28

Bizgres – continued

Support for bitmap indexes and partitioning Will be included in PostgreSQL 8.1

Support for materialized views planned

More information at bizgres.org

DDBW20: January 2006 29

ETL Summary

DDBW20: January 2006 30

OLAP Server Summary

DDBW20: January 2006 31

OLAP Client Summary

DDBW20: January 2006 32

DBMS Summary

DDBW20: January 2006 33

Summary

The different categories are not equally mature The DBMSs are very mature The ETL tools are not mature

Still a lot of work to do within the field of open source BI But a lot of activity takes place now

New releases, new products, co-operation (Bizgres and Mondrian)

DDBW20: January 2006 34

Acknowledgements

This work was supported by the European Internet Accessibility Observatory (EIAO) project, funded by the European Commission under Contract no. 004526

Thanks to EIAO partners for useful input to this survey