Data Mining SolutionsData Mining Solutions (Westphal & Blaxton, 1998)(Westphal & Blaxton, 1998)
Dr. K. PalaniappanDr. K. Palaniappan
Oct. 7, 1999 Dr. K. Palaniappan 2
Data Mining Tasks (review)Data Mining Tasks (review)(1) Classification or identification(1) Classification or identification - -
automatically label input recordsautomatically label input records
(2) Estimation or regression(2) Estimation or regression - predict - predict magnitude of response or other missing magnitude of response or other missing field given input recordsfield given input records
(3) Segmentation or clustering(3) Segmentation or clustering - group the - group the input records into meaningful sub-input records into meaningful sub-populationspopulations
(4) Description or visualization(4) Description or visualization - looking for - looking for gems and diamonds among pebblesgems and diamonds among pebbles
Oct. 7, 1999 Dr. K. Palaniappan 3
Data Mining Tasks/AlgorithmsData Mining Tasks/Algorithms (handout)(handout)
Classification -Classification - supervised induction supervised induction Analyze historical data to create a model that can predict future Analyze historical data to create a model that can predict future
behaviorbehavior Common tools - neural networks, decision trees, if-then-else rulesCommon tools - neural networks, decision trees, if-then-else rules
Clustering -Clustering - partition database into similar groups or partition database into similar groups or segmentssegments Expert interpretation and modification of clusters neededExpert interpretation and modification of clusters needed
Association -Association - establish relationships about items occurring establish relationships about items occurring togethertogether
Sequence Discovery -Sequence Discovery - identification of associations over identification of associations over timetime
Visualization -Visualization - understanding relationships using visual understanding relationships using visual methods methods
Oct. 7, 1999 Dr. K. Palaniappan 4
Data Mining Tasks/AlgorithmsData Mining Tasks/Algorithms (handout)(handout)
Cluster analysisCluster analysis Linkage analysisLinkage analysis Time series analysisTime series analysis Categorization analysisCategorization analysis VisualizationVisualization Algorithms/TechnologiesAlgorithms/Technologies
Neural networksNeural networks Decision treesDecision trees Time series (?)Time series (?) Genetic algorithmsGenetic algorithms Hybrid approachesHybrid approaches Fuzzy logicFuzzy logic StatisticsStatistics
Oct. 7, 1999 Dr. K. Palaniappan 5
Data Modeling Data Modeling (elaboration)(elaboration)
Data abstractionData abstraction Grouping, binning, categorization, Grouping, binning, categorization,
histogramming of data useful for histogramming of data useful for summarizationsummarization
1-D vs 2-D vs higher dimensional data1-D vs 2-D vs higher dimensional data
Oct. 7, 1999 Dr. K. Palaniappan 6
Data Modeling Data Modeling (elaboration)(elaboration)
Descriptive dataDescriptive data State-based knowledgeState-based knowledge Set of attributes used to describe discrete Set of attributes used to describe discrete
objectsobjects Declarative information - organization Declarative information - organization
structures, credit reports, vendor profilesstructures, credit reports, vendor profiles Transactional dataTransactional data
Episodic info about time and place of eventsEpisodic info about time and place of events Links between object classes to represent traits Links between object classes to represent traits
or conditionsor conditions
Oct. 7, 1999 Dr. K. Palaniappan 7
Problem DefinitionProblem Definition Knowledge representation using Knowledge representation using
hierarchical frameworkshierarchical frameworks Objects--> Relationships-->Networks--> Objects--> Relationships-->Networks-->
Applications-->SystemsApplications-->Systems Procedural vs declarative knowledgeProcedural vs declarative knowledge
Episodic data tagged with temporal and spatial Episodic data tagged with temporal and spatial information (sequence, “knowing how”)information (sequence, “knowing how”)
Semantic data more commonly analyzed Semantic data more commonly analyzed (factual, “knowing that”)(factual, “knowing that”)
MetaknowledgeMetaknowledge
Oct. 7, 1999 Dr. K. Palaniappan 8
Data Preparation & AnalysisData Preparation & Analysis
Define data mining goalsDefine data mining goals Planning questions:Planning questions:
Ready access to all data sourcesReady access to all data sources Data formatData format Integration of data from multiple sources Integration of data from multiple sources
and data basesand data bases Visual or nonvisual analytical methodsVisual or nonvisual analytical methods Visualization for displayVisualization for display Important patternsImportant patterns
Oct. 7, 1999 Dr. K. Palaniappan 9
Sample DatasetsSample Datasets
http://www.kdnuggets.com/datasets.html http://www.kdnuggets.com/datasets.html (Fedstats, Statlog, UC Irvine, KDD)(Fedstats, Statlog, UC Irvine, KDD)
ftp://208.144.240.175/kddcupftp://208.144.240.175/kddcup
ftp://www.epsilon.com/kddcup (fund raising ftp://www.epsilon.com/kddcup (fund raising mailing response dataset)mailing response dataset)
Oct. 7, 1999 Dr. K. Palaniappan 10
Accessing and Preparing DataAccessing and Preparing Data
Capitalization, concatenation, Capitalization, concatenation, representation format, augmentation, representation format, augmentation, abstraction, unit conversion, exclusionabstraction, unit conversion, exclusion
Limiting scope - select appropriate Limiting scope - select appropriate dimensions to extractdimensions to extract
Structuring extractions - number of Structuring extractions - number of records and timerecords and time
Oct. 7, 1999 Dr. K. Palaniappan 11
Accessing and Preparing DataAccessing and Preparing Data
Extraction using data sampling vs report Extraction using data sampling vs report generation (examining the entire dataset)generation (examining the entire dataset)
Maintaining consistency and integrity - Maintaining consistency and integrity - keep track of processing history, data keep track of processing history, data keys, query generation codekeys, query generation code
Data sources and preprocessing - Data sources and preprocessing - databases (SAP, Oracle, Peoplesoft, databases (SAP, Oracle, Peoplesoft, Access, FoxPro, LotusNotes, etc), word Access, FoxPro, LotusNotes, etc), word processors, spreadsheets processors, spreadsheets
Oct. 7, 1999 Dr. K. Palaniappan 12
Accessing and Preparing DataAccessing and Preparing Data
Data integration Data integration Multisource Multisource Multiformat Multiformat Multiplatform Multiplatform Multisecurity Multisecurity MultimediaMultimedia MultiaccessMultiaccess
Converting dataConverting data Long and short data structures Long and short data structures
Oct. 7, 1999 Dr. K. Palaniappan 13
Accessing and Preparing DataAccessing and Preparing Data
Data cleanup Data cleanup Up to 80% of time in data mining process Up to 80% of time in data mining process Errors - data entry (mistyping, incomplete Errors - data entry (mistyping, incomplete
screens), missing data, incompatible screens), missing data, incompatible formats, tampering/improper coding formats, tampering/improper coding
Disambiguation Disambiguation
Oct. 7, 1999 Dr. K. Palaniappan 14
Visual Methods for Analyzing Visual Methods for Analyzing DataData
Discover overall trendsDiscover overall trends Discover smaller hidden patternsDiscover smaller hidden patterns Make unbiased observations/ descriptions Make unbiased observations/ descriptions
about data about data Cognitive limitationsCognitive limitations
Short term memory attentional limitation Short term memory attentional limitation (absorbing multiple pages of tabular or text-(absorbing multiple pages of tabular or text-based output)based output)
Long-term memory - reliance on associations Long-term memory - reliance on associations not being presentednot being presented
Oct. 7, 1999 Dr. K. Palaniappan 15
Cognitive StrengthsCognitive Strengths
Linkage analysis - e.g. telephone calling Linkage analysis - e.g. telephone calling patternspatterns
Scheme-based visualizationScheme-based visualization Positioning algorithms - reveal object Positioning algorithms - reveal object
clustering, hierarchical relationships, clustering, hierarchical relationships, organizational networks, geopositional or organizational networks, geopositional or landscape displays landscape displays
Oct. 7, 1999 Dr. K. Palaniappan 16
Cognitive StrengthsCognitive Strengths
Manipulating display characteristics of Manipulating display characteristics of objects or recordsobjects or records Source data -> Data object -> Object Source data -> Data object -> Object
attributes -> Visualizationattributes -> Visualization Attributes - color, shape, size, x-pos, y-Attributes - color, shape, size, x-pos, y-
pos, elevation, intensity, alignment, pos, elevation, intensity, alignment, label, image, orientation, linklabel, image, orientation, link
Coding attribute information up to 20 or Coding attribute information up to 20 or more dimensions can be displayedmore dimensions can be displayed
Oct. 7, 1999 Dr. K. Palaniappan 17
Analyzing Structural FeaturesAnalyzing Structural Features
Out-of-bounds values - e.g. landscape Out-of-bounds values - e.g. landscape display or scatter diagram of trauma display or scatter diagram of trauma patientspatients
Missing data - e.g. clustering of cellular Missing data - e.g. clustering of cellular communications datacommunications data
Anomalous data - e.g. unusual airline Anomalous data - e.g. unusual airline flightsflights
Oct. 7, 1999 Dr. K. Palaniappan 18
Analyzing Network StructuresAnalyzing Network Structures
Object - link - networkObject - link - network InterconnectivityInterconnectivity Articulation points - data objects that Articulation points - data objects that
connect two or more subnetworks, e.g. connect two or more subnetworks, e.g. detecting bottlenecksdetecting bottlenecks
Identification of subnetworks or discrete Identification of subnetworks or discrete networksnetworks
Missing connections - entities detached Missing connections - entities detached from main networkfrom main network
Oct. 7, 1999 Dr. K. Palaniappan 19
Analyzing Network StructuresAnalyzing Network Structures
Strong/Weak linkages - strength of Strong/Weak linkages - strength of relationships within the networkrelationships within the network
Fan-Out frequency - degree of connectivity, Fan-Out frequency - degree of connectivity, good indicator of unusual behaviorgood indicator of unusual behavior
Pathway analysis - connectivity of objects Pathway analysis - connectivity of objects across a series of linkagesacross a series of linkages
Commonality linkages - e.g. fraud detection, Commonality linkages - e.g. fraud detection, reducing marketing costs, minimizing reducing marketing costs, minimizing transportation and delivery coststransportation and delivery costs
Oct. 7, 1999 Dr. K. Palaniappan 20
Analyzing Network StructuresAnalyzing Network Structures
Emergent patterns of connectivityEmergent patterns of connectivity Groups, liaisons, attached isolates, Groups, liaisons, attached isolates,
detached isolatesdetached isolates
Oct. 7, 1999 Dr. K. Palaniappan 21
Analyzing Temporal PatternsAnalyzing Temporal Patterns
TrendTrend CycleCycle SeasonalSeasonal IrregularIrregular Absolute time cycle of eventsAbsolute time cycle of events Contiguous time cycle eventContiguous time cycle event Visualizing temporal patternsVisualizing temporal patterns Anacapa presentation methodsAnacapa presentation methods