a survey of open source tools for business intelligence...a survey of open source tools for business...
TRANSCRIPT
A Survey of Open Source Tools for Business Intelligence
Christian Thomsen and Torben Bach PedersenDepartment of Computer ScienceAalborg University
DDBW20: January 2006 2
Agenda
About the survey Criteria ETL tools OLAP servers OLAP clients DBMSs Summary
DDBW20: January 2006 3
The EIAO Project
The European Internet Accessibility Observatory (EIAO) project will… … build a crawler to automatically collect data
about accessibility of web resources For example for blind people
… apply BI technologies to analyze the data … be based on open source software
More details are available at eiao.net At Aalborg University, we build the data
warehouse for the EIAO project
DDBW20: January 2006 4
Motivation
The open source products are not very visible within the BI field
The data warehouse for EIAO will be based on open source tools
Survey done to get an overview of available products
DDBW20: January 2006 5
The survey
We considered ETL, OLAP servers, OLAP clients, and DBMSs Data mining tools not considered (yet)
We searched the Internet for open source tools Collected data by examining source
code, homepages and documentation Done in November—December 2004
DDBW20: January 2006 6
Criteria
General Availability for different platforms License
ETL tools Supported data sources and targets ROLAP or MOLAP Incremental load in automatic fashion Methods to specify ETL process Supported data cleansing
DDBW20: January 2006 7
Criteria – continued
OLAP servers ROLAP, MOLAP, HOLAP Handling of “large” data sets Supported DBMSs Use of aggregates APIs and query languages
DDBW20: January 2006 8
Criteria – continued
OLAP clients Supported OLAP servers APIs and query languages Prescheduled reports Export facilities
DBMSs Handling of “large” data sets Materialized views Bitmap indexes Replication and partitioning Star joins
DDBW20: January 2006 9
Agenda
About the survey Criteria ETL tools OLAP servers OLAP clients DBMSs Summary
DDBW20: January 2006 10
ETL tools
Considered products Bee 1.1.0 (June 2005: 2.0.0) CloverETL 1.1.2 (May 2005: pre 1.1.5) Octopus 3.0.1 (June 2005: 3.4.1) Many disregarded, “empty” projects on the
web Bee uses Perl’s DBI, the other use JDBC All the products are ROLAP oriented There seems to be no support for
automatic, incremental load
DDBW20: January 2006 11
ETL tools – continued
Process specified in XML files Can be done by means of a GUI in Bee
and Octopus … and soon also in CloverETL
The user must code the cleansing functionality (transformations) Octopus includes some basic
transformations
DDBW20: January 2006 12
ETL tools – continued
The open source products are (still) not as good as their commercial competitors
There is a lack of good documentation
DDBW20: January 2006 13
Agenda
About the survey Criteria ETL tools OLAP servers OLAP clients DBMSs Summary
DDBW20: January 2006 14
OLAP servers
Considered products Lemur “0.0” Bee 1.1.0 (June 2005: 2.0.0) Mondrian 1.0.1 (April 2005: 1.1.5)
Now more products have been announced HydraCube Palo
DDBW20: January 2006 15
OLAP servers – continued
Lemur is HOLAP oriented But is a research project and is not
ready for commercial use Bee and Mondrian are ROLAP
oriented Bee uses MySQL Mondrian uses JDBC
DDBW20: January 2006 16
OLAP servers – continued
Bee aims to be able to handle 50GB efficiently
Mondrian’s documentation states that it handles large data sets if the underlying DBMS does
It is not possible to choose which pre-computed aggregates to use From May 2005, Mondrian has been
using existing materialized views
DDBW20: January 2006 17
OLAP servers – continued
Mondrian’s API is similar to ADO MD Support for JOLAP and XMLA planned
Mondrian supports (to some extend) the MDX query language
Bee’s API(s) and query language(s) are not described in the found documentation
DDBW20: January 2006 18
New OLAP servers
Palo (opensourceolap.org) MOLAP oriented All memory stored entirely in memory Preview expected ultimo August 2005
HydraCube (hydracube.sourceforge.net) MOLAP oriented Not an in-memory server
Uses BerkeleyDB Supports MDX against distributed DB Performs aggregation at query time
DDBW20: January 2006 19
Agenda
About the survey Criteria ETL tools OLAP servers OLAP clients DBMSs Summary
DDBW20: January 2006 20
OLAP clients
Considered products Bee 1.1.0 (June 2005: 2.0.0) JPivot 1.2.0 (August 2005: 1.4.0)
There seems to be no support for prescheduled reports in the products
Bee uses the Bee OLAP server JPivot uses Mondrian (and MDX queries)
The newest version also supports native XMLA access
DDBW20: January 2006 21
OLAP clients – continued
Bee exports to Excel, PDF, PNG, PowerPoint, text, and XML formats
JPivot exports to Excel and PDF Both support different kinds of 2D
and 3D graphs Both are used through web
interfaces
DDBW20: January 2006 23
Agenda
About the survey Criteria ETL tools OLAP servers OLAP clients DBMSs Summary
DDBW20: January 2006 24
DBMSs
Considered products MonetDB 4.4.2 MySQL 4.1.2 MaxDB 7.5 PostgreSQL 8.0
These are the most visible Other open source products are
available
DDBW20: January 2006 25
DBMSs – continued
All of the DBMSs are designed to handle large data sets
All of them are used for BI in production environments
MaxDB, MySQL, and PostgreSQL support one-way replication A multiway solution is planned for PostgreSQL
No support for materialized views, bitmap indexes or star joins But PostgreSQL 8.1 will have support for
bitmaps
DDBW20: January 2006 26
DBMSs – continued
Partitioning not yet supported MySQL 5.1 will support partitioning
MySQL can currently do some partitioning by using NDB Cluster
PostgreSQL 8.1 will support partitioning into subtables
DDBW20: January 2006 27
Bizgres – a DBMS for BI
Bizgres project started in April 2005 Current release: 0.7
Aims to make PostgreSQL the open source standard for data warehousing and BI Will build a complete DB platform for BI
exclusively from free software
DDBW20: January 2006 28
Bizgres – continued
Support for bitmap indexes and partitioning Will be included in PostgreSQL 8.1
Support for materialized views planned
More information at bizgres.org
DDBW20: January 2006 33
Summary
The different categories are not equally mature The DBMSs are very mature The ETL tools are not mature
Still a lot of work to do within the field of open source BI But a lot of activity takes place now
New releases, new products, co-operation (Bizgres and Mondrian)