data analysis clustering trees

8/18/2019 Data Analysis Clustering Trees

http://slidepdf.com/reader/full/data-analysis-clustering-trees 1/4

ELKI

ELKI (for Environment for DeveLoping KDD-

Applications Supported by Index-Structures ) is a

knowledge discovery in databases (KDD, “data mining”)

software framework developed for use in research

and teaching by the database systems research unit of

Professor Hans-Peter Kriegel at the Ludwig Maximilian

University of Munich, Germany. It aims at allowing the

development and evaluation of advanced data mining

algorithms and their interaction with database index

structures.

1 Description

The ELKI framework is written in Java and built around

a modular architecture. Most currently included al-

gorithms belong to clustering, outlier detection[1] and

database indexes. A key concept of ELKI is to allow the

combination of arbitrary algorithms, data types, distance

functions and indexes and evaluate these combinations.

When developing new algorithms or index structures, the

existing components can be reused and combined.

2 Objectives

The university project is developed for use in teaching

and research. The source code is written with exten-

sibility, readability and reusability in mind, but is also

well-optimized for performance. Since the experimental

evaluation of algorithms depends on many environmental

factors, ELKI aims at providing a shared codebase with

comparable implementations of many algorithms.

As research project, it currently does not offer integra-tion with business intelligence applications or an inter-

face to common database management systems via SQL.

The copyleft (AGPL) license may also be a hindrance to

commercial usage. Furthermore, the application of the

algorithms requires knowledge about their usage, param-

eters, and study of original literature. The audience are

students, researchers and software engineers.

3 Architecture

ELKI is modeled around a database core, which usesa vertical data layout that stores data in column groups

(similar to column families in NoSQL databases).

This database core provides nearest neighbor search,

range/radius search, and distance query functionality with

index acceleration for a wide range of dissimilarity mea-

sures. Algorithms based on such queries (e.g. k-nearest-

neighbor algorithm, local outlier factor and DBSCAN)

can be implemented easily and benefit from the index ac-

celeration. The database core also provides fast and mem-

ory efficient collections for object collections and asso-

ciative structures such as nearest neighbor lists.

ELKI makes extensive use of Java interfaces, so that

it can be extended easily in many places. For exam-ple, custom data types, distance functions, index struc-

tures, algorithms, input parsers, and output modules can

be added and combined without modifying the existing

code. This includes the possibility of defining a custom

distance function and using existing indexes for acceler-

ation.

ELKI uses a service loader architecture to allow publish-

ing extensions as separate jar files.

4 VisualizationThe visualization module uses SVG for scalable graphics

output, and Apache Batik for rendering of the user inter-

face as well as lossless export into PostScript and PDF

for easy inclusion in scientific publications in LaTeX.

Exported files can be edited with SVG editors such as

Inkscape. Since cascading style sheets are used, the

graphics design can be restyled easily. Unfortunately,

Batik is rather slow and memory intensive, so the visu-

alizations are not very scalable to large data sets.

5 Awards

Version 0.4, presented at the “Symposium on Spatial

and Temporal Databases” 2011, which included various

methods for spatial outlier detection,[2] won the confer-

ence’s “best demonstration paper award”.

6 Included algorithms

Select included algorithms:

[3]

• Cluster analysis:

1

https://en.wikipedia.org/wiki/Cluster_analysis

https://en.wikipedia.org/wiki/Cascading_style_sheets

https://en.wikipedia.org/wiki/Inkscape

https://en.wikipedia.org/wiki/LaTeX

https://en.wikipedia.org/wiki/Portable_Document_Format

https://en.wikipedia.org/wiki/PostScript

https://en.wikipedia.org/wiki/Apache_Batik

https://en.wikipedia.org/wiki/Scalable_Vector_Graphics

https://en.wikipedia.org/wiki/Jar_file

https://en.wikipedia.org/wiki/Service_provider_interface

https://en.wikipedia.org/wiki/DBSCAN

https://en.wikipedia.org/wiki/Local_outlier_factor

https://en.wikipedia.org/wiki/K-nearest-neighbor_algorithm

https://en.wikipedia.org/wiki/K-nearest-neighbor_algorithm

https://en.wikipedia.org/wiki/Distance_function


https://en.wikipedia.org/wiki/Index_(database)

https://en.wikipedia.org/wiki/Nearest_neighbor_search

https://en.wikipedia.org/wiki/NoSQL_(concept)

https://en.wikipedia.org/wiki/Column_family

https://en.wikipedia.org/wiki/Database

https://en.wikipedia.org/wiki/Software_engineer

https://en.wikipedia.org/wiki/Researcher

https://en.wikipedia.org/wiki/Student

https://en.wikipedia.org/wiki/Affero_General_Public_License

https://en.wikipedia.org/wiki/Copyleft

https://en.wikipedia.org/wiki/SQL

https://en.wikipedia.org/wiki/Database_management_system

https://en.wikipedia.org/wiki/Business_intelligence

https://en.wikipedia.org/wiki/Evaluation




https://en.wikipedia.org/wiki/Outlier


https://en.wikipedia.org/wiki/Java_(programming_language)



https://en.wikipedia.org/wiki/Ludwig_Maximilian_University_of_Munich

https://en.wikipedia.org/wiki/Ludwig_Maximilian_University_of_Munich

https://en.wikipedia.org/wiki/Hans-Peter_Kriegel

https://en.wikipedia.org/wiki/Software_framework

https://en.wikipedia.org/wiki/Knowledge_discovery_in_databases



2 10 REFERENCES

• K-means clustering

• K-medians clustering

• Expectation-maximization algorithm

• Hierarchical clustering (including SLINK and

CLINK)• Single-linkage clustering

• DBSCAN (Density-Based Spatial Clustering

of Applications with Noise, with full index ac-

celeration for arbitrary distance functions)

• OPTICS (Ordering Points To Identify the

Clustering Structure), including the extensions

OPTICS-OF, DeLi-Clu, HiSC, HiCO and

DiSH

• SUBCLU (Density-Connected Subspace

Clustering for High-Dimensional Data)

• Canopy clustering algorithm

• Anomaly detection:

• LOF (Local outlier factor)

• OPTICS-OF

• DB-Outlier (Distance-Based Outliers)

• LOCI (Local Correlation Integral)

• LDOF (Local Distance-Based Outlier Factor)

• EM-Outlier

• Spatial index structures:

• R-tree

• R*-tree

• M-tree

• k-d tree

• Locality sensitive hashing

• Evaluation:

• Receiver operating characteristic (ROC curve)

• Scatter plot

• Histogram

• Parallel coordinates (also in 3D, using

OpenGL)

• Other:

• Apriori algorithm

• Dynamic time warping

• Principal component analysis

• Multidimensional scaling

7 Version history

Version 0.1 (July 2008) contained several Algorithms

from cluster analysis and anomaly detection, as well as

some index structures such as the R*-tree. The focus of

the first release was on subspace clustering and correlation

clustering algorithms.[4]

Version 0.2 (July 2009) added functionality for time se-

ries analysis, in particular distance functions for time

series.[5]

Version 0.3 (March 2010) extended the choice

of anomaly detection algorithms and visualization

modules.[6]

Version 0.4 (September 2011) added algorithms for geo

data mining and support for multi-relational database and

index structures.[2]

Version 0.5 (April 2012) focuses on the evaluation ofcluster analysis results, adding new visualizations and

some new algorithms.[7]

Version 0.6 (June 2013) introduces a new 3D adaption of

parallel coordinates for data visualization, apart from the

usual additions of algorithms and index structures.[8]

Version 0.7 (August 2015) adds support for uncertain

data types, and algorithms for the analysis of uncertain

data.[9]

8 Related applications• Weka: A similar project by the University of

Waikato, with a focus on classification algorithms.

• RapidMiner: An application available commercially

(an old version is available as open-source, too) with

a focus on machine learning.

• KNIME: An open source platform which integrates

various components for machine learning and data

mining.

9 External links

• Official web page of ELKI with download and doc-

umentation.

10 References

[1] Hans-Peter Kriegel, Peer Kröger, Arthur Zimek (2009).

“Outlier Detection Techniques (Tutorial)" (PDF). 13th

Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD 2009) (Bangkok, Thailand). Re-

trieved 2010-03-26.

http://www.dbs.ifi.lmu.de/Publikationen/Papers/tutorial_slides.pdf


http://elki.dbs.ifi.lmu.de/

https://en.wikipedia.org/wiki/Data_mining

https://en.wikipedia.org/wiki/Data_mining

https://en.wikipedia.org/wiki/Machine_learning

https://en.wikipedia.org/wiki/KNIME

https://en.wikipedia.org/wiki/Machine_learning

https://en.wikipedia.org/wiki/RapidMiner

https://en.wikipedia.org/wiki/Classification_(machine_learning)

https://en.wikipedia.org/wiki/Weka_(machine_learning)

https://en.wikipedia.org/wiki/Parallel_coordinates


https://en.wikipedia.org/wiki/Anomaly_detection

https://en.wikipedia.org/wiki/Time_series_analysis

https://en.wikipedia.org/wiki/Time_series_analysis

https://en.wikipedia.org/wiki/Correlation_clustering

https://en.wikipedia.org/wiki/Correlation_clustering

https://en.wikipedia.org/wiki/Subspace_clustering

https://en.wikipedia.org/wiki/R*-tree




https://en.wikipedia.org/wiki/Multidimensional_scaling

https://en.wikipedia.org/wiki/Principal_component_analysis

https://en.wikipedia.org/wiki/Dynamic_time_warping

https://en.wikipedia.org/wiki/Apriori_algorithm

https://en.wikipedia.org/wiki/OpenGL

https://en.wikipedia.org/wiki/Parallel_coordinates

https://en.wikipedia.org/wiki/Histogram

https://en.wikipedia.org/wiki/Scatter_plot

https://en.wikipedia.org/wiki/Receiver_operating_characteristic

https://en.wikipedia.org/wiki/Locality_sensitive_hashing

https://en.wikipedia.org/wiki/K-d_tree

https://en.wikipedia.org/wiki/M-tree

https://en.wikipedia.org/wiki/R*-tree

https://en.wikipedia.org/wiki/R-tree

https://en.wikipedia.org/wiki/Spatial_index

https://en.wikipedia.org/wiki/Expectation-maximization_algorithm

https://en.wikipedia.org/wiki/OPTICS_algorithm

https://en.wikipedia.org/wiki/Local_Outlier_Factor


https://en.wikipedia.org/wiki/Canopy_clustering_algorithm

https://en.wikipedia.org/wiki/SUBCLU

https://en.wikipedia.org/wiki/OPTICS_algorithm

https://en.wikipedia.org/wiki/DBSCAN

https://en.wikipedia.org/wiki/Single-linkage_clustering

https://en.wikipedia.org/wiki/Hierarchical_clustering

https://en.wikipedia.org/wiki/Expectation-maximization_algorithm

https://en.wikipedia.org/wiki/K-medians_clustering

https://en.wikipedia.org/wiki/K-means_clustering



3

[2] Elke Achtert, Achmed Hettab, Hans-Peter Kriegel, Erich

Schubert, Arthur Zimek (2011). Spatial Outlier Detec-

tion: Data, Algorithms, Visualizations . 12th International

Symposium on Spatial and Temporal Databases (SSTD

2011). Minneapolis, MN: Spinger. doi:10.1007/978-3-

642-22922-0_41.

[3] excerpt from “Data Mining Algorithms in ELKI 0.4”. Re-

trieved August 17, 2011.

[4] Elke Achtert, Hans-Peter Kriegel, Arthur Zimek (2008).

ELKI: A Software System for Evaluation of Subspace Clus-

tering Algorithms (PDF). Proceedings of the 20th inter-

national conference on Scientific and Statistical Database

Management (SSDBM 08). Hong Kong, China: Springer.

doi:10.1007/978-3-540-69497-7_41.

[5] Elke Achtert, Thomas Bernecker, Hans-Peter Kriegel,

Erich Schubert, Arthur Zimek (2009). ELKI in time:

ELKI 0.2 for the performance evaluation of distance mea-

sures for time series (PDF). Proceedings of the 11th Inter-

national Symposium on Advances in Spatial and Temporal

Databases (SSTD 2010). Aalborg, Dänemark: Springer.

doi:10.1007/978-3-642-02982-0_35.

[6] Elke Achtert, Hans-Peter Kriegel, Lisa Reichert, Erich

Schubert, Remigius Wojdanowski, Arthur Zimek (2010).

Visual Evaluation of Outlier Detection Models . 15th Inter-

national Conference on Database Systems for Advanced

Applications (DASFAA 2010). Tsukuba, Japan: Spinger.

doi:10.1007/978-3-642-12098-5_34.

[7] Elke Achtert, Sascha Goldhofer, Hans-Peter Kriegel,

Erich Schubert, Arthur Zimek (2012). Evaluation of

Clusterings Metrics and Visual Support . 28th International

Conference on Data Engineering (ICDE). Washington,DC. doi:10.1109/ICDE.2012.128.

[8] Elke Achtert, Hans-Peter Kriegel, Erich Schubert, Arthur

Zimek (2013). Interactive Data Mining with 3D-Parallel-

Coordinate-Trees . Proceedings of the ACM International

Conference on Management of Data (SIGMOD). New

York City, NY. doi:10.1145/2463676.2463696.

[9] Erich Schubert, Alexander Koos, Tobias Emrich, Andreas

Züfle, Klaus Arthur Schmid, Arthur Zimek (2015). “A

Framework for Clustering Uncertain Data.” (PDF). Pro-

ceedings of the VLDB Endowment 8 (12): 1976–1987.

http://www.vldb.org/pvldb/vol8/p1976-schubert.pdf

http://www.vldb.org/pvldb/vol8/p1976-schubert.pdf

https://dx.doi.org/10.1145%252F2463676.2463696

https://en.wikipedia.org/wiki/Digital_object_identifier

https://en.wikipedia.org/wiki/SIGMOD


https://dx.doi.org/10.1109%252FICDE.2012.128



https://dx.doi.org/10.1007%252F978-3-642-12098-5_34



https://dx.doi.org/10.1007%252F978-3-642-02982-0_35


http://www.dbs.ifi.lmu.de/~zimek/publications/SSTD2009/sstd09-elki-paper.pdf




https://dx.doi.org/10.1007%252F978-3-540-69497-7_41


http://www.dbs.ifi.lmu.de/~zimek/publications/SSDBM2008/elkipaper.pdf

http://www.dbs.ifi.lmu.de/~zimek/publications/SSDBM2008/elkipaper.pdf


http://elki.dbs.ifi.lmu.de/wiki/Algorithms

https://dx.doi.org/10.1007%252F978-3-642-22922-0_41

https://dx.doi.org/10.1007%252F978-3-642-22922-0_41





4 11 TEXT AND IMAGE SOURCES, CONTRIBUTORS, AND LICENSES

11 Text and image sources, contributors, and licenses

11.1 Text

• ELKI Source: https://en.wikipedia.org/wiki/ELKI?oldid=694513163 Contributors: Korath, Gaius Cornelius, JLaTondre, SmackBot, Tim-

otheus Canens, Lambiam, Dicklyon, Pgr94, Harrigan, Cydebot, Varnent, JohnBlackburne, Mild Bill Hiccup, Auntof6, Qwfp, Addbot,

Chzz, Yobot, Rodamaker, Chire, Mkiaeeha, Dexbot, Oritnk, Rober9876543210 and Anonymous: 12

11.2 Images

• File:ELKI_Screenshot.jpg Source: https://upload.wikimedia.org/wikipedia/commons/f/fa/ELKI_Screenshot.jpg License: CC0 Contrib-

utors: Own work Original artist: Chire

11.3 Content license

• Creative Commons Attribution-Share Alike 3.0

https://creativecommons.org/licenses/by-sa/3.0/

http://localhost/var/www/apps/conversion/tmp/scratch_2//commons.wikimedia.org/wiki/User:Chire

https://upload.wikimedia.org/wikipedia/commons/f/fa/ELKI_Screenshot.jpg

https://en.wikipedia.org/wiki/ELKI?oldid=694513163

data analysis clustering trees

Documents