2. visualization in data mining
TRANSCRIPT
![Page 1: 2. visualization in data mining](https://reader030.vdocuments.net/reader030/viewer/2022032617/55a975d21a28ab00708b458f/html5/thumbnails/1.jpg)
Mobile no - +91-8184811318
1
![Page 2: 2. visualization in data mining](https://reader030.vdocuments.net/reader030/viewer/2022032617/55a975d21a28ab00708b458f/html5/thumbnails/2.jpg)
2
Motivation
Visualization for Data Mining
• Huge amounts of information
• Limited display capacity of output devices
Visual Data Mining (VDM) is a new approach forexploring very large data sets, combining traditionalmining methods and information visualization techniques.
![Page 3: 2. visualization in data mining](https://reader030.vdocuments.net/reader030/viewer/2022032617/55a975d21a28ab00708b458f/html5/thumbnails/3.jpg)
3
Levels of VDM
No or very limited integration Corresponds to the application of either traditional information visualization or automated data mining methods.
Loose integration Visualization and automated mining methods are applied
sequentially.
The result of one step can be used as input for another step.
Full integration Automated mining and visualization methods applied in parallel.
Combination of the results.
![Page 4: 2. visualization in data mining](https://reader030.vdocuments.net/reader030/viewer/2022032617/55a975d21a28ab00708b458f/html5/thumbnails/4.jpg)
4
Methods of Data Visualization
Different methods are available for visualization of data based on type of data
Data can be
Univariate
Bivariate
Multivariate
![Page 5: 2. visualization in data mining](https://reader030.vdocuments.net/reader030/viewer/2022032617/55a975d21a28ab00708b458f/html5/thumbnails/5.jpg)
5
Univariate data
Measurement of single quantitative variable
Characterize distribution
Represented using following methods
Histogram
Pie Chart
![Page 6: 2. visualization in data mining](https://reader030.vdocuments.net/reader030/viewer/2022032617/55a975d21a28ab00708b458f/html5/thumbnails/6.jpg)
6
Histogram
![Page 7: 2. visualization in data mining](https://reader030.vdocuments.net/reader030/viewer/2022032617/55a975d21a28ab00708b458f/html5/thumbnails/7.jpg)
7
Pie Chart
![Page 8: 2. visualization in data mining](https://reader030.vdocuments.net/reader030/viewer/2022032617/55a975d21a28ab00708b458f/html5/thumbnails/8.jpg)
8
Bivariate Data
Constitutes of paired samples of two quantitative variables
Variables are related
Represented using following methods
Scatter plots
Line graphs
![Page 9: 2. visualization in data mining](https://reader030.vdocuments.net/reader030/viewer/2022032617/55a975d21a28ab00708b458f/html5/thumbnails/9.jpg)
9
Scatter plots
![Page 10: 2. visualization in data mining](https://reader030.vdocuments.net/reader030/viewer/2022032617/55a975d21a28ab00708b458f/html5/thumbnails/10.jpg)
10
Line graphs
![Page 11: 2. visualization in data mining](https://reader030.vdocuments.net/reader030/viewer/2022032617/55a975d21a28ab00708b458f/html5/thumbnails/11.jpg)
11
Multivariate Data Multi dimensional representation of multivariate
data
Represented using following methods
Icon based methods
Pixel based methods
Dynamic parallel coordinate system
![Page 12: 2. visualization in data mining](https://reader030.vdocuments.net/reader030/viewer/2022032617/55a975d21a28ab00708b458f/html5/thumbnails/12.jpg)
12
Icon based Methods
![Page 13: 2. visualization in data mining](https://reader030.vdocuments.net/reader030/viewer/2022032617/55a975d21a28ab00708b458f/html5/thumbnails/13.jpg)
13
Pixel Based Methods
Approach:
Each attribute value is represented by one colored pixel (the value ranges of the attributes are mapped to a fixed color map).
The values of each attribute are presented in separate sub windows.
Examples: Dense Pixel Displays
![Page 14: 2. visualization in data mining](https://reader030.vdocuments.net/reader030/viewer/2022032617/55a975d21a28ab00708b458f/html5/thumbnails/14.jpg)
14
Dense Pixel Display
Approach: Each attribute value is represented by one colored
pixel (the value ranges of the attributes are mapped to a fixed color map).
Different attributes are presented in separate sub windows.
![Page 15: 2. visualization in data mining](https://reader030.vdocuments.net/reader030/viewer/2022032617/55a975d21a28ab00708b458f/html5/thumbnails/15.jpg)
15
Visual Data Mining: Framework and Algorithm Development
Ganesh, M., Han, E.H., Kumar, V., Shekar, S., & Srivastava, J. (1996).
Working Paper. Twin Cities, MN: University of Minnesota, Twin Cities Campus.
![Page 16: 2. visualization in data mining](https://reader030.vdocuments.net/reader030/viewer/2022032617/55a975d21a28ab00708b458f/html5/thumbnails/16.jpg)
16
References http://coblitz.codeen.org:3125/citeseer.ist.psu.edu/cache/papers/cs/10335/ftp:
zSzzSzftp.cs.umn.eduzSzdeptzSzuserszSzkumarzSzdatavis.pdf/ganesh96visual.pdf
http://www.ailab.si/blaz/predavanja/ozp/gradivo/2002-Keim-Visualization%20in%20DM-IEEE%20Trans%20Vis.pdf
http://www.geocities.com/anand_palm/
![Page 17: 2. visualization in data mining](https://reader030.vdocuments.net/reader030/viewer/2022032617/55a975d21a28ab00708b458f/html5/thumbnails/17.jpg)
17
Abstract VDM refers to refers to the use of visualization techniques in Data
Mining process to Evaluate Monitor Guide
This paper provides a framework for VDM via the loose coupling of databases and visualization systems.
The paper applies VDM towards designing new algorithms that can learn decision trees by manually refining some of the decisions made by well known algorithms such as C4.5.
![Page 18: 2. visualization in data mining](https://reader030.vdocuments.net/reader030/viewer/2022032617/55a975d21a28ab00708b458f/html5/thumbnails/18.jpg)
18
Components of VQLBCI The three major components of VQLBCI are
Visual Representations, Computations and Events.
![Page 19: 2. visualization in data mining](https://reader030.vdocuments.net/reader030/viewer/2022032617/55a975d21a28ab00708b458f/html5/thumbnails/19.jpg)
19
Visual Development of Algorithms
Most interesting use of visual data mining is the development of new insights and algorithms.
The figure below shows the ER diagram for learning classification decision trees.
This model allows the user to monitor the quality and impact of decisions made by the learning procedure.
Learning procedure can be refined interactively via a visual interface.
![Page 20: 2. visualization in data mining](https://reader030.vdocuments.net/reader030/viewer/2022032617/55a975d21a28ab00708b458f/html5/thumbnails/20.jpg)
20
ER diagram for the search space of decision tree learning algorithm
![Page 21: 2. visualization in data mining](https://reader030.vdocuments.net/reader030/viewer/2022032617/55a975d21a28ab00708b458f/html5/thumbnails/21.jpg)
21
General Framework
Learning a classification decision tree from a training data set can be regarded as a process of searching for the best decision tree that meets user-provided goal constraints.
The problem space of this search process consists of Model Candidates, Model Candidate Generator and Model Constraints.
Many existing classification-learning algorithms like C4.5 and CDP fit nicely within this search framework. New learning algorithms that fit user’s requirements can be developed by defining the components of the problem space.
![Page 22: 2. visualization in data mining](https://reader030.vdocuments.net/reader030/viewer/2022032617/55a975d21a28ab00708b458f/html5/thumbnails/22.jpg)
22
General Framework Model Candidate corresponds to the partial
classification decision tree. Each node of the decision tree is a Model Atom
Search process is the process of finding a final model candidate such that it meets user goal specifications.
Model Candidate Generator transforms the current model candidate into a new model candidate by selecting one model atom to expand from the expandable leaf model atoms.
Model Constraints (used by Model Candidate Generator) provide controls and boundaries to the search space.
![Page 23: 2. visualization in data mining](https://reader030.vdocuments.net/reader030/viewer/2022032617/55a975d21a28ab00708b458f/html5/thumbnails/23.jpg)
23
Search Process
![Page 24: 2. visualization in data mining](https://reader030.vdocuments.net/reader030/viewer/2022032617/55a975d21a28ab00708b458f/html5/thumbnails/24.jpg)
24
Acceptability Constraint Model Constraints consist of Acceptability constraints,
Expandability constraints and a Data-Entropy calculation function.
Acceptability constraint predicate specifies when a model candidate is acceptable and thus allows search process to stop. EX:
A1) Total no of expandable leaf model atoms = 0. A2) Overall error rate of the model candidate <= acceptable error
rate. A3) Total number of model atoms in the model candidate>=
maximal allowable tree size.
A1 is used in C4.5 and CDP
![Page 25: 2. visualization in data mining](https://reader030.vdocuments.net/reader030/viewer/2022032617/55a975d21a28ab00708b458f/html5/thumbnails/25.jpg)
25
Expandability Constraint An Expandability constraint predicate specifies
whether a leaf model atom is expandable or not. EX: C4.5 uses E1 and E2 CDP uses E2 and E3
![Page 26: 2. visualization in data mining](https://reader030.vdocuments.net/reader030/viewer/2022032617/55a975d21a28ab00708b458f/html5/thumbnails/26.jpg)
26
Traversal Strategy
Traversal strategy ranks expandable leaf model atoms based on the model atom attributes. EX:
Increasing order of depth Decreasing order of depth Orders based on other model atom attributes.
![Page 27: 2. visualization in data mining](https://reader030.vdocuments.net/reader030/viewer/2022032617/55a975d21a28ab00708b458f/html5/thumbnails/27.jpg)
27
Steps in Visual Algorithm Development
No single algorithm is the best all the time, performance is highly data dependent.
By changing different predicates of model constraints, users can construct new classification-learning algorithm.
This enables users to find an algorithm that works the best on a given data set.
Two algorithms are developed : BF based on Best First search idea and CDP+ which is a modification of CDP
![Page 28: 2. visualization in data mining](https://reader030.vdocuments.net/reader030/viewer/2022032617/55a975d21a28ab00708b458f/html5/thumbnails/28.jpg)
28
BF This algorithm is based on the Best-First search
idea. For Acceptability criteria, it includes A1 and A2
with a user specified acceptable error rate. The Traversal strategy chosen is T3 In Best-First, expandable leaf model atoms are
ranked according to the decreasing order of the number of misclassified training cases. (local error rate * size of subset training data set)
The traversal strategy will expand a model atom that has the most misclassified training cases, thus reducing the overall error rate the most.
![Page 29: 2. visualization in data mining](https://reader030.vdocuments.net/reader030/viewer/2022032617/55a975d21a28ab00708b458f/html5/thumbnails/29.jpg)
29
CDP + CDP+ is a modification of CDP
CDP has dynamic pruning using expandability constraint E3.
Here, the depth is modified according to the size of the training data set of the model atom.
We set B is the branching factor of the decision tree, t is
the size of training data set belonging to model atom, T is the whole training data set.
![Page 30: 2. visualization in data mining](https://reader030.vdocuments.net/reader030/viewer/2022032617/55a975d21a28ab00708b458f/html5/thumbnails/30.jpg)
30
Comparison of different classification learning algorithms
![Page 31: 2. visualization in data mining](https://reader030.vdocuments.net/reader030/viewer/2022032617/55a975d21a28ab00708b458f/html5/thumbnails/31.jpg)
31
Experiment The new BF and CDP+ algorithms are compared
with the C4.5 and CDP algorithms. Various metrics are selected to compare the
efficiency, accuracy and size of final decision trees of the classification algorithm.
The generation efficiency of the nodes is measured in terms of the total number of nodes generated.
To compare accuracy of the various algorithms, the mean classification error on the test data sets have been computed.
![Page 32: 2. visualization in data mining](https://reader030.vdocuments.net/reader030/viewer/2022032617/55a975d21a28ab00708b458f/html5/thumbnails/32.jpg)
32
Classification error for 10 data sets
![Page 33: 2. visualization in data mining](https://reader030.vdocuments.net/reader030/viewer/2022032617/55a975d21a28ab00708b458f/html5/thumbnails/33.jpg)
33
Nodes generated for 10 data sets
![Page 34: 2. visualization in data mining](https://reader030.vdocuments.net/reader030/viewer/2022032617/55a975d21a28ab00708b458f/html5/thumbnails/34.jpg)
34
Final decision tree size
![Page 35: 2. visualization in data mining](https://reader030.vdocuments.net/reader030/viewer/2022032617/55a975d21a28ab00708b458f/html5/thumbnails/35.jpg)
35
Results/Conclusion
CDP has accuracy comparable to C4.5 while generating considerably fewer nodes.
CDP+ has accuracy comparable to C4.5 while generating considerably fewer nodes.
CDP+ outperformed CDP in error rate and number of nodes generated.
Considering all performance metrics together, CDP+ is the best overall algorithm.
Considering classification accuracy alone, C4.5P is the winner.
![Page 36: 2. visualization in data mining](https://reader030.vdocuments.net/reader030/viewer/2022032617/55a975d21a28ab00708b458f/html5/thumbnails/36.jpg)
36
Conclusion Different datasets require different algorithms for
best results. Diverse user requirements put different
constraints on the final decision tree. The experiment shows that Interactive Visual
Data Mining Framework can help find the most suitable algorithm for a given data set and group of user requirements.
![Page 37: 2. visualization in data mining](https://reader030.vdocuments.net/reader030/viewer/2022032617/55a975d21a28ab00708b458f/html5/thumbnails/37.jpg)
37
Data Mining for Selective Visualization of Large Spatial Datasets
Proceedings of 14th IEEE International Conference on Tools with Artificial Intelligence (ICTAI'02), 2002.
Washington (November 2002), DC, USA,
Shashi Shekhar, Chang-Tien Lu, Pusheng Zhang, Rulin Liu
Computer Science & Engineering DepartmentUniversity of Minnesota
![Page 38: 2. visualization in data mining](https://reader030.vdocuments.net/reader030/viewer/2022032617/55a975d21a28ab00708b458f/html5/thumbnails/38.jpg)
38
References http://citeseer.ist.psu.edu/cache/papers/cs/27216/http:zSzzSzwww-
users.cs.umn.eduzSzzCz7EctluzSzPaperTalkFilezSzits02.pdf/shekhar02cubeview.pdf
http://www.cs.umn.edu/Research/shashi-group/
http://www.cs.umn.edu/Research/shashi-group/Book/sdb-chap1.pdf
http://www.cs.umn.edu/research/shashi-group/alan_planb.pdf
http://coblitz.codeen.org:3125/citeseer.ist.psu.edu/cache/papers/cs/27637/http:zSzzSzwww-users.cs.umn.eduzSzzCz7EpushengzSzpubzSzkdd2001zSzkdd.pdf/shekhar01detecting.pdf
![Page 39: 2. visualization in data mining](https://reader030.vdocuments.net/reader030/viewer/2022032617/55a975d21a28ab00708b458f/html5/thumbnails/39.jpg)
39
Basic Terminology Spatial databases
Alphanumeric data + geographical cordinates Spatial mining
Mining of spatial databases Spatial datawarehouse
Contains geographical data Spatial outliers
Observations that appear to be inconsistent with the remainder of that set of data
![Page 40: 2. visualization in data mining](https://reader030.vdocuments.net/reader030/viewer/2022032617/55a975d21a28ab00708b458f/html5/thumbnails/40.jpg)
40
Spatial Cluster
![Page 41: 2. visualization in data mining](https://reader030.vdocuments.net/reader030/viewer/2022032617/55a975d21a28ab00708b458f/html5/thumbnails/41.jpg)
41
Contribution Propose and implement the CubeView
visualization system General data cube operations Built on the concept of spatial data warehouse to
support data mining and data visualization Efficient and scalable spatial outlier detection
algorithms
![Page 42: 2. visualization in data mining](https://reader030.vdocuments.net/reader030/viewer/2022032617/55a975d21a28ab00708b458f/html5/thumbnails/42.jpg)
42
Challenges in spatial data mining Classical data mining - numbers and categories.
Spatial data – more complex and extended objects such as points, lines and polygons.
Second, classical data mining works with explicit inputs, whereas spatial predicates and attributes are often implicit.
Third, classical data mining treats each input independently of other inputs.
![Page 43: 2. visualization in data mining](https://reader030.vdocuments.net/reader030/viewer/2022032617/55a975d21a28ab00708b458f/html5/thumbnails/43.jpg)
43
Application Domain The Traffic Management Center - Minnesota
Department of Transportation (MNDOT) has a database to archive sensor network.
Sensor network includes about nine hundred stations each of which contains one to four loop detector
Measurement of Volume and occupancy. Volume is # vehicles passing through station in 5-
minute interval Occupancy is percentage of time station is occupied
with vehicles
![Page 44: 2. visualization in data mining](https://reader030.vdocuments.net/reader030/viewer/2022032617/55a975d21a28ab00708b458f/html5/thumbnails/44.jpg)
44
Basic Concepts Spatial Data Warehouse Spatial Data Mining Spatial Outl iers Detection
![Page 45: 2. visualization in data mining](https://reader030.vdocuments.net/reader030/viewer/2022032617/55a975d21a28ab00708b458f/html5/thumbnails/45.jpg)
45
Spatial Data Warehouse Employs data cube structure Outputs - albums of maps. Traffic data warehouse
Measures - volume and occupancy Dimensions - time and space.
![Page 46: 2. visualization in data mining](https://reader030.vdocuments.net/reader030/viewer/2022032617/55a975d21a28ab00708b458f/html5/thumbnails/46.jpg)
46
Spatial Data Mining Process of discovering interesting and useful but
implicit spatial patterns. key goal is to partially ‘automate’ knowledge
discovery Search for “nuggets” of information embedded in
very large quantities of spatial data.
![Page 47: 2. visualization in data mining](https://reader030.vdocuments.net/reader030/viewer/2022032617/55a975d21a28ab00708b458f/html5/thumbnails/47.jpg)
47
Spatial Outliers Detection Suspiciously deviating observations Local instability Each Station
Spatial attributes – time, space Non spatial attributes – volume, occupancy
![Page 48: 2. visualization in data mining](https://reader030.vdocuments.net/reader030/viewer/2022032617/55a975d21a28ab00708b458f/html5/thumbnails/48.jpg)
48
Basic Structure – CubeView
![Page 49: 2. visualization in data mining](https://reader030.vdocuments.net/reader030/viewer/2022032617/55a975d21a28ab00708b458f/html5/thumbnails/49.jpg)
49
CubeView Visualization System
Each node in cube – a visualization style S - Traffic volume of station at all times. TTD – Time of the day TDW – Day of the week STTD – Daily traffic volume of each station TTD TDWS– Traffic volume at each station at different times
on different days
![Page 50: 2. visualization in data mining](https://reader030.vdocuments.net/reader030/viewer/2022032617/55a975d21a28ab00708b458f/html5/thumbnails/50.jpg)
50
Dimension Lattice
![Page 51: 2. visualization in data mining](https://reader030.vdocuments.net/reader030/viewer/2022032617/55a975d21a28ab00708b458f/html5/thumbnails/51.jpg)
51
CubeView Visualization System
![Page 52: 2. visualization in data mining](https://reader030.vdocuments.net/reader030/viewer/2022032617/55a975d21a28ab00708b458f/html5/thumbnails/52.jpg)
52
CubeView Visualization System
![Page 53: 2. visualization in data mining](https://reader030.vdocuments.net/reader030/viewer/2022032617/55a975d21a28ab00708b458f/html5/thumbnails/53.jpg)
53
CubeView Visualization System
![Page 54: 2. visualization in data mining](https://reader030.vdocuments.net/reader030/viewer/2022032617/55a975d21a28ab00708b458f/html5/thumbnails/54.jpg)
54
Data Mining Algorithms for Visualization
Problem Definit ion
Given a spatial graph G ={ S , E } S - s1, s2, s3, s4…….. E – edges (neighborhood of stations) f ( x ) - attribute value for a data record N ( x )- fixed cardinality set of neighbors of x ) - Average attribute value of x neighbors S( x ) - difference of the attribute value of each data
object and the average attribute value of neighbors.
![Page 55: 2. visualization in data mining](https://reader030.vdocuments.net/reader030/viewer/2022032617/55a975d21a28ab00708b458f/html5/thumbnails/55.jpg)
55
Data Mining Algorithms for Visualization
Problem Definit ion cont…
S( x ) - difference of the attribute value of each data object and the average attribute value of neighbors.
Test for detecting an outlier
confidence level threshold θ
![Page 56: 2. visualization in data mining](https://reader030.vdocuments.net/reader030/viewer/2022032617/55a975d21a28ab00708b458f/html5/thumbnails/56.jpg)
56
Data Mining Algorithms for Visualization
Few points
First, the neighborhood can be selected based on a fixed cardinality or a fixed graph distance or a fixed Euclidean distance.
Second, the choice of neighborhood aggregate function can be mean, variance, or auto-correlation.
Third, the choice for comparing a location with its neighbors can be either just a number or a vector of attribute values.
Finally, the statistic for the base distribution can be selected as normal distribution.
![Page 57: 2. visualization in data mining](https://reader030.vdocuments.net/reader030/viewer/2022032617/55a975d21a28ab00708b458f/html5/thumbnails/57.jpg)
57
Data Mining Algorithms for Visualization
Algorithms
Test Parameters Computation(TPC) Algorithm
Route Outlier Detection(ROD) Algorithm
![Page 58: 2. visualization in data mining](https://reader030.vdocuments.net/reader030/viewer/2022032617/55a975d21a28ab00708b458f/html5/thumbnails/58.jpg)
58
Data Mining Algorithms for Visualization
![Page 59: 2. visualization in data mining](https://reader030.vdocuments.net/reader030/viewer/2022032617/55a975d21a28ab00708b458f/html5/thumbnails/59.jpg)
59
Data Mining Algorithms for Visualization
![Page 60: 2. visualization in data mining](https://reader030.vdocuments.net/reader030/viewer/2022032617/55a975d21a28ab00708b458f/html5/thumbnails/60.jpg)
60
Data Mining Algorithms for Visualization
![Page 61: 2. visualization in data mining](https://reader030.vdocuments.net/reader030/viewer/2022032617/55a975d21a28ab00708b458f/html5/thumbnails/61.jpg)
61
Software
http://www.cs.umn.edu/research/shashi-group/vis/traffic_volumemap2.htm
http://www.cs.umn.edu/research/shashi-group/vis/DataCube.htm
![Page 62: 2. visualization in data mining](https://reader030.vdocuments.net/reader030/viewer/2022032617/55a975d21a28ab00708b458f/html5/thumbnails/62.jpg)
62
Visualization and Data Mining techniques
Thank you!!!!