data mining: efficiently extracting interpretable and ...taoli/taoli-talk-1.pdf · data mining:...

42
1 Data Mining: Efficiently Extracting Interpretable and Actionable Patterns Tao Li School of Computing and Information Sciences Florida International University

Upload: hoangdang

Post on 10-Apr-2018

236 views

Category:

Documents


4 download

TRANSCRIPT

Page 1: Data Mining: Efficiently Extracting Interpretable and ...taoli/TaoLi-Talk-1.pdf · Data Mining: Efficiently Extracting Interpretable and Actionable Patterns ... – Efficiently extracting

1

Data Mining Efficiently Extracting Interpretable and Actionable Patterns

Tao Li School of Computing and Information Sciences

Florida International University

2

Data Mining CharacteristicsWe are drowning in information

But starving for knowledge------John Naisbett

Data Mining

An Engineering Process

An Interdisciplinary Field

A Collection of Functionalities

A Combination of Theory and Application

bull Data mining (knowledge discovery from data) ndash Efficiently extracting interpretable and actionable

patternsknowledge from large amount of data collectionsndash Patterns are Interesting

bull non-trivial implicit previously unknown and potentially useful

3

Research Challengesbull Efficiently extracting interpretable and actionable patterns from large amount of data

collectionsndash Performance efficiency effectiveness and scalability

bull Novel algorithmsndash High dimensional large volumendash Efficient data structures

bull Parallel distributed and incremental mining methods

ndash Interpretable and actionable patternsbull Pattern evaluation Incorporation of background knowledge bull Knowledge fusion Integration of the discovered knowledge with existing onebull User interaction

ndash Expression and visualization of data mining resultsndash Interactive mining of knowledge at multiple levels of abstraction

bull Complex structure mining (eg graph relational data)bull Mining from heterogeneous information sourcesbull Mining in Social Network Setting

ndash Linked data between emails webpages blogs citations sequencesndash Static and dynamic structural behavior

bull Security Privacy and Data Integrity

bull New Applicationsndash Anti-terrorism Search advertising Recommendation systems Computing system

management Personalized Search

4

My Research DirectionsResearch Interests

Data Mining Machine Learning and Information Retrieval

Data Mining Machine Learning and Information Retrieval

Practical ProblemsPractical Problems

Generic ToolsGeneric ToolsTheoryTheory

Tools and LibrariesWorkflow and System

ManagementEvent MiningLog ParserVisualization Tools

Personal Music AssistantTools for Bioinformatics

Gene SelectionTissue Classification

Sequence analysisLiterature Assistant

Algorithm IssuesMatrix-based Learning

Framework

Semi-supervised LearningLearning from

heterogeneous data types

Learning from non-vectorial data

Stream data mining

Incremental learning

ApplicationsComputing System Management

Help Desk ApplicationsSelf-managing SystemsSecurity Management

Music Information RetrievalPersonalized Music Service

Adaptive Workflow ManagementProcess MiningWorkflow Optimization

BioinformaticsMicroarray data analysisSequence Analysis

5

Some Projects in My Group

bull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Business Intelligence

ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysisbull Matrix-Based Algorithms for Data Mining

6

Example of Call Center Cases

bull Adding more memory on XXXXXXbull Email (a wrong setting)bull Scanner Accountbull problem with encrypted telnetbull PDFs not opening in IE

ndash S What is your examplendash C An example httpexamplepdfndash S What is your environmentndash C Windows XP with Acrobat Standard ndash S Running a repair on Windowsndash C It doesnrsquot workndash S Uninstalling Acrobat Standard and Reader and then

reinstalling the Readerndash C It doesnrsquot work eitherndash S Install two Microsoft updates for IEndash C It works

7

Motivationbull Problems

ndash Symptoms must be manually described to help-deskndash Help-desk needs to determine solution - difficult to compare with all previous

problemsndash Difficult for one help-desk employee to gain from anothers experiencendash Difficult to determine if solution worked successfully - User feedback may not be

accurate

bull How can data miningmachine learning helpndash Malfunction diagnoses call distribution skill-based routing outbound targeting etcndash 90 of calls are due to the most frequent 10 of problemsndash Problem analysis based on known casesndash Augment known cases when a new case is solvedndash Significantly reducing the cost and shortening time needed to diagnose malfunction

issues improve customer satisfaction

8

Two Different Approaches

bull Semantic-based Case Retrievalndash Using Semantic features to retrieval the most

similar past cases

bull Problem Probingndash Find probequestion set which maximizes

utility (benefit ndash cost)

9

Semantic Feature ExtractionTo detect sentence relations

sentence 1

sentence 2SENNA

(NEC Parser)

frames of sentence 1

frames of sentence 2

finding word relations among words with the same semantic role in each frame pair using

WordNet

Features

Semantic Roles For certain verbs of motionArg0 causer of motion Arg3 start point Arg1 thing in motion Arg4 end pointArg2 distance moved Arg5 direction-- eg [Arg0 I] [ArgM-MOD can][ArgM-NEG not] [rel receive] [Arg1 emails]

Subtypes of ArgM modifier tagLOC location CAU causeEXT extent TMP timeDIS discourse connectives PNC purposeADV general-purpose MNR mannerNEG negation marker DIR directionMOD modal verb

Word Relations WordNet 30 (147306 words)20 word relations egidentical synonyms antonyms attribute hypernym and hyponym-- A is a hypernym of B if B is a kind of Ameronym and holonym-- A is a meronym of B if A is a part or member of B

verb

frame

FeaturesArg0_identicalArg0_synonymsArg0_hypernymArg0_meronymrel_identicalrel_synonymsrel_hypernymRel_meronymArg1_identical

hellip hellip

10

Case Study ndash Online Help Desk

Interface

Most related emails

15496txt 18697txt 11002txt3191txt

Can you switch these two computers

Slides from Dingding Wang

11

Some Projects in My Group

bull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Business Intelligence

ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysisbull Matrix-Based Algorithms for Data Mining

12

SQL-Query-Result NavigationWhat mutual funds to invest There are thousands of them

High return Low risk Low expense hellip

Information overloading problem users often can not form a query that returns answers exactly matching their needs

13

Existing Solutions ndashIterative Refinement

bull Iterative refinementndash Find all funds with 3 Year Return gt 10ndash Find all funds with 3 Year Return gt 10 and 1

Year Return gt 10ndash Find all funds with 3 Year and 1 Year Return gt

10 and expense lt 2ndash hellip

bull Problem too time consuming

14

Existing Solutions - Ranking

bull Rankingndash Similar to TFIDF (Agrawal et al CIDR 2003)ndash Probabilistic ranking (Chaudhuri et al VLDB 2004)ndash Ordering attributes (Das et al SIGMOD 2006)ndash Context-sensitive (Agrawal et al SIGMOD 2006)ndash Based on object relationships (Chakrabarti et al SIGMOD

2006)ndash A lot of work in IR field (eg Shen et al SIGIR 2005)

15

Existing Solutions - Ranking

bull Problem the same ranking function for all users (except the context sensitive work)ndash User 1 prefers high returnsndash User 2 prefers lower risks (often means lower

returns)ndash User 3 prefers a mix of high return funds and low

risk fundsndash One ranking function can not fit all

16

Existing Solutions -Categorization

Chakrabarti et al SIGMOD 2004Divide query results recursively into categories (a tree structure)

The splitting attributes are selected based on query history (more frequently asked attributes are typically put closer to root)

User can navigate this tree and show records in a node

Categorization is complementary to ranking (rank records in a node)

Root(193)

10 lt= 3 Yr return lt 20(71)

3 Yr return gt= 20 (29)

0 lt= 3 Yr return lt 10 (93)

17

Existing Solutions -Categorization

Problem differences between users are not consideredRoot(193)

0 lt= 3 Yr return lt 10 (93)

10 lt= 3 Yr return lt 20(71)

3 Yr return gt= 20 (29)

10 lt=1 Yr return lt 20 (55)

10 lt= Std lt 20 (48)

5 lt= Std lt 10 (7)

0 lt= 1 Yr return lt 10 (63)

0 lt= Std lt 5 (38)

User 1 3

User 2 3 User 2 3

This tree does not help user 2 and 3 much

We try to address this issue for categorization

18

Address Diverse User Preferences (Chen amp Li SIGMOD 2007)

bull Offline infer a set of typical types of user preferencesndash Data are clustered using query history of all existing usersndash Each cluster represents a typical type of user preference

(eg high return funds low risk funds)

bull Runtime ndash A navigational tree is constructed over clusters in the query resultsndash This tree highlights the differences between these clustersndash User navigates this tree to select clusters that he is interested inndash User can then rank or browse records in those clusters

19

Some Projects in My Group

bull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Business Intelligence

ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysisbull Matrix-Based Algorithms for Data Mining

20

System Log Databull Performance Metric

ndash CPUMemorySwap utilization workload average response timebull Application-level Log

ndash Windows NT system log Websphere application log Oracle activitybull Failure data

ndash System crash dumps error reportsbull Network Traffic Data

ndash SNMP Link-level byte counts Netflow Link-level flow statistics PMA link-level packet header traces TCPdump host-side packet traces

bull Request Datandash Apache and Microsoft IIS log

bull Reports from Operatorsndash Trouble ticket data problems reports possible causes and symptoms for failures

bull Othersndash Network-based alert logs Probes

21

logs

logs

logs

HistoricalData Collection

Realtime Analysis

Offline Analysis

Knowledge Base

Real Time Management

Anomaly Detection

Fault Diagnosis

Problem Determination

CorrelationDependencyKnowledge

SummarizationVisualizationRule Construction

Temporal Pattern DiscoveryKnowledge Management

PlanningActions

The Integrated Framework on Data-Driven Computing System Management

Log Data Organization

Componentlogs Active Data Collection

Log Adapter

Situation Identification

and Categorization

22

Raw data-gtData table

Aggregationsummarization

Event Plots

Pattern MiningPattern Mining

OutagesTemporal patternsEvent bursts

Morning rebootingperiodic patterns

Weekly rebootingPeriodic patterns

Hosts

Knowledge validation completion and construction

23

A Case Study

State start stop dependency create connection report request configuration otherPlotX‐axis Time Y‐axis State

24

Pattern Visualization in the Case Study

25

ERN in the Case Study

26

Other Projects in My Group

bull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Business Intelligence

ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysis

27

bull Operationalbusiness Intelligencendash Collecting and analyzing metrics (Key

Performance Indicators) to support better decision-making

bull Focus leading indicator discovery

ndash Using operational metrics to adaptively manage amp improve workflow efficiency

bull Focus workflow exception detection and tracing

28

Examples of KPI Types

Leading Indicators(Value Drivers)

Lagging Indicators(Outcomes )

Number of clients that sales people meet face to face each week

Sales revenue

Complex repairs completed successfully during the first call or visit

Customer satisfaction

Number of parts for which orders exceed forecasts within 30 days of scheduled delivery

Per unit manufacturing costs

Number of customers who are delinquent paying their first bill

Customer churn

29

Methodology for Leading Indicator Learning

Selected Leading Indicators or Leading Indicator Combinations

clustering

Similarity matrix (or Distance matrix) using non-metric distance or Granger Causality

hellip

hellip

hellipThe root KPI may be the key leading indicator

Time seriesproduction model(eg assumption data model)

Leading Indicator Selection based on Domain Knowledge

sequence KPIs

Reduced Time series

define KPIs

correlation

dimensionalityreduction

If no update modelIs Model Valid

New Business Concern(s)

If yes

Leading Indicators

Raw databullProduction metricsbullWorkflow activity logsbullJobDeviceDocument events

30

Some Projects in My Groupbull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Social Network Analysisbull Business Intelligence

ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysisbull Matrix-Based Algorithms for Data Mining

31

Biological Data SourcesGene Expression Text LiteratureSequence Data

Clustering Protein ExpressionProtein-protein Interactions

Transcription FactorBinding SitesGene Ontology

Gene Clusters

Function Co-regulation

32

Problem We are Studyingbull Marker Gene Identification

ndash httpwwwcsfiuedu~yzhan004geneselhtmlndash 15 gene selection algorithms

bull ReliefF F-statistic A-optimality D-Optimality mRNR ReliefF-mRNR GSNR

bull RankGene 11

bull How to perform clustering using data from multiple diverse sources

bull How to perform clustering using partial prior knowledge

33

Some Projects in My Group

bull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Business Intelligence

ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysisbull Matrix-Based Algorithms for Data Mining

34

Business Continuity Information Network (BCIN)

35

BCIN

bull Central questionndash How to help businesses recover and resume

operations quicker after a major hurricane (or other) disaster

3620082008--44--66 Florida International UniversityFlorida International University 3636

3720082008--44--66 Florida International UniversityFlorida International University 3737

3820082008--44--66 Florida International UniversityFlorida International University 3838

3920082008--44--66 Florida International UniversityFlorida International University 3939

40Florida International UniversityFlorida International University

Situation Analysis

John Smith Asst Manager Store 2345

Rcmd Non-Functional 10107 1001am R CarrRequest Roofer assessment amp repair Windowdoor replacementComment Parking lot may drain in 24hrs

Status EmailActions Report Contact JRNLRecommendation Summaries | SA OverallStaff Inadequate 10107 1232pm J SmithFacility Inadequate 10107 1001am R CarrGoods Inadequate 10107 1058am B FallsTrans Functional 10107 1145am L Staff

Facility Non-Func 10107 1001am R Carr

FACILITY Building amp GroundsRoof Damaged 10107 715am R GarciaWindows Damaged 10107 758am R GarciaDoor Damaged 10107 650am R GarciaSecurity NAWater NAParking Damaged 10107 640 am R Garcia

33

41

Acknowledgementsbull Funding

ndash NSF Career Awardndash NIH ndash Faculty Research Awards from IBM Xerox and NECndash Equipment Donation from IBM

bull Collaboratorsndash Dr Chris Ding (University of Texas at Arlington)ndash Dr Michael I Jordan (UC Berkeley)ndash Dr Charles Perng (IBM Research)ndash Dr Tong Sun (Xerox Research)ndash Dr Shenghuo Zhu (NEC Research)

bull PhD Studentsndash Dr David Kaiserndash Dr Wei Pengndash Bo Shaondash Dingding Wangndash Haifeng Wang ndash Yi Zhang

42

Questionbull Email taolicsfiuedubull httpwwwcsfiuedu~taoli

  • Data Mining Efficiently Extracting Interpretable and Actionable Patterns
  • Research Challenges
  • My Research Directions
  • Some Projects in My Group
  • Example of Call Center Cases
  • Motivation
  • Two Different Approaches
  • Semantic Feature Extraction
  • Case Study ndash Online Help Desk
  • Some Projects in My Group
  • SQL-Query-Result Navigation
  • Existing Solutions ndash Iterative Refinement
  • Existing Solutions - Ranking
  • Existing Solutions - Ranking
  • Existing Solutions - Categorization
  • Existing Solutions - Categorization
  • Address Diverse User Preferences (Chen amp Li SIGMOD 2007)
  • Some Projects in My Group
  • System Log Data
  • Knowledge validation completion and construction
  • A Case Study
  • Pattern Visualization in the Case Study
  • ERN in the Case Study
  • Other Projects in My Group
  • Examples of KPI Types
  • Methodology for Leading Indicator Learning
  • Some Projects in My Group
  • Biological Data Sources
  • Problem We are Studying
  • Some Projects in My Group
  • Business Continuity Information Network (BCIN)
  • BCIN
  • Acknowledgements
  • Question
Page 2: Data Mining: Efficiently Extracting Interpretable and ...taoli/TaoLi-Talk-1.pdf · Data Mining: Efficiently Extracting Interpretable and Actionable Patterns ... – Efficiently extracting

2

Data Mining CharacteristicsWe are drowning in information

But starving for knowledge------John Naisbett

Data Mining

An Engineering Process

An Interdisciplinary Field

A Collection of Functionalities

A Combination of Theory and Application

bull Data mining (knowledge discovery from data) ndash Efficiently extracting interpretable and actionable

patternsknowledge from large amount of data collectionsndash Patterns are Interesting

bull non-trivial implicit previously unknown and potentially useful

3

Research Challengesbull Efficiently extracting interpretable and actionable patterns from large amount of data

collectionsndash Performance efficiency effectiveness and scalability

bull Novel algorithmsndash High dimensional large volumendash Efficient data structures

bull Parallel distributed and incremental mining methods

ndash Interpretable and actionable patternsbull Pattern evaluation Incorporation of background knowledge bull Knowledge fusion Integration of the discovered knowledge with existing onebull User interaction

ndash Expression and visualization of data mining resultsndash Interactive mining of knowledge at multiple levels of abstraction

bull Complex structure mining (eg graph relational data)bull Mining from heterogeneous information sourcesbull Mining in Social Network Setting

ndash Linked data between emails webpages blogs citations sequencesndash Static and dynamic structural behavior

bull Security Privacy and Data Integrity

bull New Applicationsndash Anti-terrorism Search advertising Recommendation systems Computing system

management Personalized Search

4

My Research DirectionsResearch Interests

Data Mining Machine Learning and Information Retrieval

Data Mining Machine Learning and Information Retrieval

Practical ProblemsPractical Problems

Generic ToolsGeneric ToolsTheoryTheory

Tools and LibrariesWorkflow and System

ManagementEvent MiningLog ParserVisualization Tools

Personal Music AssistantTools for Bioinformatics

Gene SelectionTissue Classification

Sequence analysisLiterature Assistant

Algorithm IssuesMatrix-based Learning

Framework

Semi-supervised LearningLearning from

heterogeneous data types

Learning from non-vectorial data

Stream data mining

Incremental learning

ApplicationsComputing System Management

Help Desk ApplicationsSelf-managing SystemsSecurity Management

Music Information RetrievalPersonalized Music Service

Adaptive Workflow ManagementProcess MiningWorkflow Optimization

BioinformaticsMicroarray data analysisSequence Analysis

5

Some Projects in My Group

bull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Business Intelligence

ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysisbull Matrix-Based Algorithms for Data Mining

6

Example of Call Center Cases

bull Adding more memory on XXXXXXbull Email (a wrong setting)bull Scanner Accountbull problem with encrypted telnetbull PDFs not opening in IE

ndash S What is your examplendash C An example httpexamplepdfndash S What is your environmentndash C Windows XP with Acrobat Standard ndash S Running a repair on Windowsndash C It doesnrsquot workndash S Uninstalling Acrobat Standard and Reader and then

reinstalling the Readerndash C It doesnrsquot work eitherndash S Install two Microsoft updates for IEndash C It works

7

Motivationbull Problems

ndash Symptoms must be manually described to help-deskndash Help-desk needs to determine solution - difficult to compare with all previous

problemsndash Difficult for one help-desk employee to gain from anothers experiencendash Difficult to determine if solution worked successfully - User feedback may not be

accurate

bull How can data miningmachine learning helpndash Malfunction diagnoses call distribution skill-based routing outbound targeting etcndash 90 of calls are due to the most frequent 10 of problemsndash Problem analysis based on known casesndash Augment known cases when a new case is solvedndash Significantly reducing the cost and shortening time needed to diagnose malfunction

issues improve customer satisfaction

8

Two Different Approaches

bull Semantic-based Case Retrievalndash Using Semantic features to retrieval the most

similar past cases

bull Problem Probingndash Find probequestion set which maximizes

utility (benefit ndash cost)

9

Semantic Feature ExtractionTo detect sentence relations

sentence 1

sentence 2SENNA

(NEC Parser)

frames of sentence 1

frames of sentence 2

finding word relations among words with the same semantic role in each frame pair using

WordNet

Features

Semantic Roles For certain verbs of motionArg0 causer of motion Arg3 start point Arg1 thing in motion Arg4 end pointArg2 distance moved Arg5 direction-- eg [Arg0 I] [ArgM-MOD can][ArgM-NEG not] [rel receive] [Arg1 emails]

Subtypes of ArgM modifier tagLOC location CAU causeEXT extent TMP timeDIS discourse connectives PNC purposeADV general-purpose MNR mannerNEG negation marker DIR directionMOD modal verb

Word Relations WordNet 30 (147306 words)20 word relations egidentical synonyms antonyms attribute hypernym and hyponym-- A is a hypernym of B if B is a kind of Ameronym and holonym-- A is a meronym of B if A is a part or member of B

verb

frame

FeaturesArg0_identicalArg0_synonymsArg0_hypernymArg0_meronymrel_identicalrel_synonymsrel_hypernymRel_meronymArg1_identical

hellip hellip

10

Case Study ndash Online Help Desk

Interface

Most related emails

15496txt 18697txt 11002txt3191txt

Can you switch these two computers

Slides from Dingding Wang

11

Some Projects in My Group

bull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Business Intelligence

ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysisbull Matrix-Based Algorithms for Data Mining

12

SQL-Query-Result NavigationWhat mutual funds to invest There are thousands of them

High return Low risk Low expense hellip

Information overloading problem users often can not form a query that returns answers exactly matching their needs

13

Existing Solutions ndashIterative Refinement

bull Iterative refinementndash Find all funds with 3 Year Return gt 10ndash Find all funds with 3 Year Return gt 10 and 1

Year Return gt 10ndash Find all funds with 3 Year and 1 Year Return gt

10 and expense lt 2ndash hellip

bull Problem too time consuming

14

Existing Solutions - Ranking

bull Rankingndash Similar to TFIDF (Agrawal et al CIDR 2003)ndash Probabilistic ranking (Chaudhuri et al VLDB 2004)ndash Ordering attributes (Das et al SIGMOD 2006)ndash Context-sensitive (Agrawal et al SIGMOD 2006)ndash Based on object relationships (Chakrabarti et al SIGMOD

2006)ndash A lot of work in IR field (eg Shen et al SIGIR 2005)

15

Existing Solutions - Ranking

bull Problem the same ranking function for all users (except the context sensitive work)ndash User 1 prefers high returnsndash User 2 prefers lower risks (often means lower

returns)ndash User 3 prefers a mix of high return funds and low

risk fundsndash One ranking function can not fit all

16

Existing Solutions -Categorization

Chakrabarti et al SIGMOD 2004Divide query results recursively into categories (a tree structure)

The splitting attributes are selected based on query history (more frequently asked attributes are typically put closer to root)

User can navigate this tree and show records in a node

Categorization is complementary to ranking (rank records in a node)

Root(193)

10 lt= 3 Yr return lt 20(71)

3 Yr return gt= 20 (29)

0 lt= 3 Yr return lt 10 (93)

17

Existing Solutions -Categorization

Problem differences between users are not consideredRoot(193)

0 lt= 3 Yr return lt 10 (93)

10 lt= 3 Yr return lt 20(71)

3 Yr return gt= 20 (29)

10 lt=1 Yr return lt 20 (55)

10 lt= Std lt 20 (48)

5 lt= Std lt 10 (7)

0 lt= 1 Yr return lt 10 (63)

0 lt= Std lt 5 (38)

User 1 3

User 2 3 User 2 3

This tree does not help user 2 and 3 much

We try to address this issue for categorization

18

Address Diverse User Preferences (Chen amp Li SIGMOD 2007)

bull Offline infer a set of typical types of user preferencesndash Data are clustered using query history of all existing usersndash Each cluster represents a typical type of user preference

(eg high return funds low risk funds)

bull Runtime ndash A navigational tree is constructed over clusters in the query resultsndash This tree highlights the differences between these clustersndash User navigates this tree to select clusters that he is interested inndash User can then rank or browse records in those clusters

19

Some Projects in My Group

bull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Business Intelligence

ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysisbull Matrix-Based Algorithms for Data Mining

20

System Log Databull Performance Metric

ndash CPUMemorySwap utilization workload average response timebull Application-level Log

ndash Windows NT system log Websphere application log Oracle activitybull Failure data

ndash System crash dumps error reportsbull Network Traffic Data

ndash SNMP Link-level byte counts Netflow Link-level flow statistics PMA link-level packet header traces TCPdump host-side packet traces

bull Request Datandash Apache and Microsoft IIS log

bull Reports from Operatorsndash Trouble ticket data problems reports possible causes and symptoms for failures

bull Othersndash Network-based alert logs Probes

21

logs

logs

logs

HistoricalData Collection

Realtime Analysis

Offline Analysis

Knowledge Base

Real Time Management

Anomaly Detection

Fault Diagnosis

Problem Determination

CorrelationDependencyKnowledge

SummarizationVisualizationRule Construction

Temporal Pattern DiscoveryKnowledge Management

PlanningActions

The Integrated Framework on Data-Driven Computing System Management

Log Data Organization

Componentlogs Active Data Collection

Log Adapter

Situation Identification

and Categorization

22

Raw data-gtData table

Aggregationsummarization

Event Plots

Pattern MiningPattern Mining

OutagesTemporal patternsEvent bursts

Morning rebootingperiodic patterns

Weekly rebootingPeriodic patterns

Hosts

Knowledge validation completion and construction

23

A Case Study

State start stop dependency create connection report request configuration otherPlotX‐axis Time Y‐axis State

24

Pattern Visualization in the Case Study

25

ERN in the Case Study

26

Other Projects in My Group

bull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Business Intelligence

ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysis

27

bull Operationalbusiness Intelligencendash Collecting and analyzing metrics (Key

Performance Indicators) to support better decision-making

bull Focus leading indicator discovery

ndash Using operational metrics to adaptively manage amp improve workflow efficiency

bull Focus workflow exception detection and tracing

28

Examples of KPI Types

Leading Indicators(Value Drivers)

Lagging Indicators(Outcomes )

Number of clients that sales people meet face to face each week

Sales revenue

Complex repairs completed successfully during the first call or visit

Customer satisfaction

Number of parts for which orders exceed forecasts within 30 days of scheduled delivery

Per unit manufacturing costs

Number of customers who are delinquent paying their first bill

Customer churn

29

Methodology for Leading Indicator Learning

Selected Leading Indicators or Leading Indicator Combinations

clustering

Similarity matrix (or Distance matrix) using non-metric distance or Granger Causality

hellip

hellip

hellipThe root KPI may be the key leading indicator

Time seriesproduction model(eg assumption data model)

Leading Indicator Selection based on Domain Knowledge

sequence KPIs

Reduced Time series

define KPIs

correlation

dimensionalityreduction

If no update modelIs Model Valid

New Business Concern(s)

If yes

Leading Indicators

Raw databullProduction metricsbullWorkflow activity logsbullJobDeviceDocument events

30

Some Projects in My Groupbull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Social Network Analysisbull Business Intelligence

ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysisbull Matrix-Based Algorithms for Data Mining

31

Biological Data SourcesGene Expression Text LiteratureSequence Data

Clustering Protein ExpressionProtein-protein Interactions

Transcription FactorBinding SitesGene Ontology

Gene Clusters

Function Co-regulation

32

Problem We are Studyingbull Marker Gene Identification

ndash httpwwwcsfiuedu~yzhan004geneselhtmlndash 15 gene selection algorithms

bull ReliefF F-statistic A-optimality D-Optimality mRNR ReliefF-mRNR GSNR

bull RankGene 11

bull How to perform clustering using data from multiple diverse sources

bull How to perform clustering using partial prior knowledge

33

Some Projects in My Group

bull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Business Intelligence

ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysisbull Matrix-Based Algorithms for Data Mining

34

Business Continuity Information Network (BCIN)

35

BCIN

bull Central questionndash How to help businesses recover and resume

operations quicker after a major hurricane (or other) disaster

3620082008--44--66 Florida International UniversityFlorida International University 3636

3720082008--44--66 Florida International UniversityFlorida International University 3737

3820082008--44--66 Florida International UniversityFlorida International University 3838

3920082008--44--66 Florida International UniversityFlorida International University 3939

40Florida International UniversityFlorida International University

Situation Analysis

John Smith Asst Manager Store 2345

Rcmd Non-Functional 10107 1001am R CarrRequest Roofer assessment amp repair Windowdoor replacementComment Parking lot may drain in 24hrs

Status EmailActions Report Contact JRNLRecommendation Summaries | SA OverallStaff Inadequate 10107 1232pm J SmithFacility Inadequate 10107 1001am R CarrGoods Inadequate 10107 1058am B FallsTrans Functional 10107 1145am L Staff

Facility Non-Func 10107 1001am R Carr

FACILITY Building amp GroundsRoof Damaged 10107 715am R GarciaWindows Damaged 10107 758am R GarciaDoor Damaged 10107 650am R GarciaSecurity NAWater NAParking Damaged 10107 640 am R Garcia

33

41

Acknowledgementsbull Funding

ndash NSF Career Awardndash NIH ndash Faculty Research Awards from IBM Xerox and NECndash Equipment Donation from IBM

bull Collaboratorsndash Dr Chris Ding (University of Texas at Arlington)ndash Dr Michael I Jordan (UC Berkeley)ndash Dr Charles Perng (IBM Research)ndash Dr Tong Sun (Xerox Research)ndash Dr Shenghuo Zhu (NEC Research)

bull PhD Studentsndash Dr David Kaiserndash Dr Wei Pengndash Bo Shaondash Dingding Wangndash Haifeng Wang ndash Yi Zhang

42

Questionbull Email taolicsfiuedubull httpwwwcsfiuedu~taoli

  • Data Mining Efficiently Extracting Interpretable and Actionable Patterns
  • Research Challenges
  • My Research Directions
  • Some Projects in My Group
  • Example of Call Center Cases
  • Motivation
  • Two Different Approaches
  • Semantic Feature Extraction
  • Case Study ndash Online Help Desk
  • Some Projects in My Group
  • SQL-Query-Result Navigation
  • Existing Solutions ndash Iterative Refinement
  • Existing Solutions - Ranking
  • Existing Solutions - Ranking
  • Existing Solutions - Categorization
  • Existing Solutions - Categorization
  • Address Diverse User Preferences (Chen amp Li SIGMOD 2007)
  • Some Projects in My Group
  • System Log Data
  • Knowledge validation completion and construction
  • A Case Study
  • Pattern Visualization in the Case Study
  • ERN in the Case Study
  • Other Projects in My Group
  • Examples of KPI Types
  • Methodology for Leading Indicator Learning
  • Some Projects in My Group
  • Biological Data Sources
  • Problem We are Studying
  • Some Projects in My Group
  • Business Continuity Information Network (BCIN)
  • BCIN
  • Acknowledgements
  • Question
Page 3: Data Mining: Efficiently Extracting Interpretable and ...taoli/TaoLi-Talk-1.pdf · Data Mining: Efficiently Extracting Interpretable and Actionable Patterns ... – Efficiently extracting

3

Research Challengesbull Efficiently extracting interpretable and actionable patterns from large amount of data

collectionsndash Performance efficiency effectiveness and scalability

bull Novel algorithmsndash High dimensional large volumendash Efficient data structures

bull Parallel distributed and incremental mining methods

ndash Interpretable and actionable patternsbull Pattern evaluation Incorporation of background knowledge bull Knowledge fusion Integration of the discovered knowledge with existing onebull User interaction

ndash Expression and visualization of data mining resultsndash Interactive mining of knowledge at multiple levels of abstraction

bull Complex structure mining (eg graph relational data)bull Mining from heterogeneous information sourcesbull Mining in Social Network Setting

ndash Linked data between emails webpages blogs citations sequencesndash Static and dynamic structural behavior

bull Security Privacy and Data Integrity

bull New Applicationsndash Anti-terrorism Search advertising Recommendation systems Computing system

management Personalized Search

4

My Research DirectionsResearch Interests

Data Mining Machine Learning and Information Retrieval

Data Mining Machine Learning and Information Retrieval

Practical ProblemsPractical Problems

Generic ToolsGeneric ToolsTheoryTheory

Tools and LibrariesWorkflow and System

ManagementEvent MiningLog ParserVisualization Tools

Personal Music AssistantTools for Bioinformatics

Gene SelectionTissue Classification

Sequence analysisLiterature Assistant

Algorithm IssuesMatrix-based Learning

Framework

Semi-supervised LearningLearning from

heterogeneous data types

Learning from non-vectorial data

Stream data mining

Incremental learning

ApplicationsComputing System Management

Help Desk ApplicationsSelf-managing SystemsSecurity Management

Music Information RetrievalPersonalized Music Service

Adaptive Workflow ManagementProcess MiningWorkflow Optimization

BioinformaticsMicroarray data analysisSequence Analysis

5

Some Projects in My Group

bull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Business Intelligence

ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysisbull Matrix-Based Algorithms for Data Mining

6

Example of Call Center Cases

bull Adding more memory on XXXXXXbull Email (a wrong setting)bull Scanner Accountbull problem with encrypted telnetbull PDFs not opening in IE

ndash S What is your examplendash C An example httpexamplepdfndash S What is your environmentndash C Windows XP with Acrobat Standard ndash S Running a repair on Windowsndash C It doesnrsquot workndash S Uninstalling Acrobat Standard and Reader and then

reinstalling the Readerndash C It doesnrsquot work eitherndash S Install two Microsoft updates for IEndash C It works

7

Motivationbull Problems

ndash Symptoms must be manually described to help-deskndash Help-desk needs to determine solution - difficult to compare with all previous

problemsndash Difficult for one help-desk employee to gain from anothers experiencendash Difficult to determine if solution worked successfully - User feedback may not be

accurate

bull How can data miningmachine learning helpndash Malfunction diagnoses call distribution skill-based routing outbound targeting etcndash 90 of calls are due to the most frequent 10 of problemsndash Problem analysis based on known casesndash Augment known cases when a new case is solvedndash Significantly reducing the cost and shortening time needed to diagnose malfunction

issues improve customer satisfaction

8

Two Different Approaches

bull Semantic-based Case Retrievalndash Using Semantic features to retrieval the most

similar past cases

bull Problem Probingndash Find probequestion set which maximizes

utility (benefit ndash cost)

9

Semantic Feature ExtractionTo detect sentence relations

sentence 1

sentence 2SENNA

(NEC Parser)

frames of sentence 1

frames of sentence 2

finding word relations among words with the same semantic role in each frame pair using

WordNet

Features

Semantic Roles For certain verbs of motionArg0 causer of motion Arg3 start point Arg1 thing in motion Arg4 end pointArg2 distance moved Arg5 direction-- eg [Arg0 I] [ArgM-MOD can][ArgM-NEG not] [rel receive] [Arg1 emails]

Subtypes of ArgM modifier tagLOC location CAU causeEXT extent TMP timeDIS discourse connectives PNC purposeADV general-purpose MNR mannerNEG negation marker DIR directionMOD modal verb

Word Relations WordNet 30 (147306 words)20 word relations egidentical synonyms antonyms attribute hypernym and hyponym-- A is a hypernym of B if B is a kind of Ameronym and holonym-- A is a meronym of B if A is a part or member of B

verb

frame

FeaturesArg0_identicalArg0_synonymsArg0_hypernymArg0_meronymrel_identicalrel_synonymsrel_hypernymRel_meronymArg1_identical

hellip hellip

10

Case Study ndash Online Help Desk

Interface

Most related emails

15496txt 18697txt 11002txt3191txt

Can you switch these two computers

Slides from Dingding Wang

11

Some Projects in My Group

bull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Business Intelligence

ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysisbull Matrix-Based Algorithms for Data Mining

12

SQL-Query-Result NavigationWhat mutual funds to invest There are thousands of them

High return Low risk Low expense hellip

Information overloading problem users often can not form a query that returns answers exactly matching their needs

13

Existing Solutions ndashIterative Refinement

bull Iterative refinementndash Find all funds with 3 Year Return gt 10ndash Find all funds with 3 Year Return gt 10 and 1

Year Return gt 10ndash Find all funds with 3 Year and 1 Year Return gt

10 and expense lt 2ndash hellip

bull Problem too time consuming

14

Existing Solutions - Ranking

bull Rankingndash Similar to TFIDF (Agrawal et al CIDR 2003)ndash Probabilistic ranking (Chaudhuri et al VLDB 2004)ndash Ordering attributes (Das et al SIGMOD 2006)ndash Context-sensitive (Agrawal et al SIGMOD 2006)ndash Based on object relationships (Chakrabarti et al SIGMOD

2006)ndash A lot of work in IR field (eg Shen et al SIGIR 2005)

15

Existing Solutions - Ranking

bull Problem the same ranking function for all users (except the context sensitive work)ndash User 1 prefers high returnsndash User 2 prefers lower risks (often means lower

returns)ndash User 3 prefers a mix of high return funds and low

risk fundsndash One ranking function can not fit all

16

Existing Solutions -Categorization

Chakrabarti et al SIGMOD 2004Divide query results recursively into categories (a tree structure)

The splitting attributes are selected based on query history (more frequently asked attributes are typically put closer to root)

User can navigate this tree and show records in a node

Categorization is complementary to ranking (rank records in a node)

Root(193)

10 lt= 3 Yr return lt 20(71)

3 Yr return gt= 20 (29)

0 lt= 3 Yr return lt 10 (93)

17

Existing Solutions -Categorization

Problem differences between users are not consideredRoot(193)

0 lt= 3 Yr return lt 10 (93)

10 lt= 3 Yr return lt 20(71)

3 Yr return gt= 20 (29)

10 lt=1 Yr return lt 20 (55)

10 lt= Std lt 20 (48)

5 lt= Std lt 10 (7)

0 lt= 1 Yr return lt 10 (63)

0 lt= Std lt 5 (38)

User 1 3

User 2 3 User 2 3

This tree does not help user 2 and 3 much

We try to address this issue for categorization

18

Address Diverse User Preferences (Chen amp Li SIGMOD 2007)

bull Offline infer a set of typical types of user preferencesndash Data are clustered using query history of all existing usersndash Each cluster represents a typical type of user preference

(eg high return funds low risk funds)

bull Runtime ndash A navigational tree is constructed over clusters in the query resultsndash This tree highlights the differences between these clustersndash User navigates this tree to select clusters that he is interested inndash User can then rank or browse records in those clusters

19

Some Projects in My Group

bull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Business Intelligence

ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysisbull Matrix-Based Algorithms for Data Mining

20

System Log Databull Performance Metric

ndash CPUMemorySwap utilization workload average response timebull Application-level Log

ndash Windows NT system log Websphere application log Oracle activitybull Failure data

ndash System crash dumps error reportsbull Network Traffic Data

ndash SNMP Link-level byte counts Netflow Link-level flow statistics PMA link-level packet header traces TCPdump host-side packet traces

bull Request Datandash Apache and Microsoft IIS log

bull Reports from Operatorsndash Trouble ticket data problems reports possible causes and symptoms for failures

bull Othersndash Network-based alert logs Probes

21

logs

logs

logs

HistoricalData Collection

Realtime Analysis

Offline Analysis

Knowledge Base

Real Time Management

Anomaly Detection

Fault Diagnosis

Problem Determination

CorrelationDependencyKnowledge

SummarizationVisualizationRule Construction

Temporal Pattern DiscoveryKnowledge Management

PlanningActions

The Integrated Framework on Data-Driven Computing System Management

Log Data Organization

Componentlogs Active Data Collection

Log Adapter

Situation Identification

and Categorization

22

Raw data-gtData table

Aggregationsummarization

Event Plots

Pattern MiningPattern Mining

OutagesTemporal patternsEvent bursts

Morning rebootingperiodic patterns

Weekly rebootingPeriodic patterns

Hosts

Knowledge validation completion and construction

23

A Case Study

State start stop dependency create connection report request configuration otherPlotX‐axis Time Y‐axis State

24

Pattern Visualization in the Case Study

25

ERN in the Case Study

26

Other Projects in My Group

bull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Business Intelligence

ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysis

27

bull Operationalbusiness Intelligencendash Collecting and analyzing metrics (Key

Performance Indicators) to support better decision-making

bull Focus leading indicator discovery

ndash Using operational metrics to adaptively manage amp improve workflow efficiency

bull Focus workflow exception detection and tracing

28

Examples of KPI Types

Leading Indicators(Value Drivers)

Lagging Indicators(Outcomes )

Number of clients that sales people meet face to face each week

Sales revenue

Complex repairs completed successfully during the first call or visit

Customer satisfaction

Number of parts for which orders exceed forecasts within 30 days of scheduled delivery

Per unit manufacturing costs

Number of customers who are delinquent paying their first bill

Customer churn

29

Methodology for Leading Indicator Learning

Selected Leading Indicators or Leading Indicator Combinations

clustering

Similarity matrix (or Distance matrix) using non-metric distance or Granger Causality

hellip

hellip

hellipThe root KPI may be the key leading indicator

Time seriesproduction model(eg assumption data model)

Leading Indicator Selection based on Domain Knowledge

sequence KPIs

Reduced Time series

define KPIs

correlation

dimensionalityreduction

If no update modelIs Model Valid

New Business Concern(s)

If yes

Leading Indicators

Raw databullProduction metricsbullWorkflow activity logsbullJobDeviceDocument events

30

Some Projects in My Groupbull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Social Network Analysisbull Business Intelligence

ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysisbull Matrix-Based Algorithms for Data Mining

31

Biological Data SourcesGene Expression Text LiteratureSequence Data

Clustering Protein ExpressionProtein-protein Interactions

Transcription FactorBinding SitesGene Ontology

Gene Clusters

Function Co-regulation

32

Problem We are Studyingbull Marker Gene Identification

ndash httpwwwcsfiuedu~yzhan004geneselhtmlndash 15 gene selection algorithms

bull ReliefF F-statistic A-optimality D-Optimality mRNR ReliefF-mRNR GSNR

bull RankGene 11

bull How to perform clustering using data from multiple diverse sources

bull How to perform clustering using partial prior knowledge

33

Some Projects in My Group

bull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Business Intelligence

ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysisbull Matrix-Based Algorithms for Data Mining

34

Business Continuity Information Network (BCIN)

35

BCIN

bull Central questionndash How to help businesses recover and resume

operations quicker after a major hurricane (or other) disaster

3620082008--44--66 Florida International UniversityFlorida International University 3636

3720082008--44--66 Florida International UniversityFlorida International University 3737

3820082008--44--66 Florida International UniversityFlorida International University 3838

3920082008--44--66 Florida International UniversityFlorida International University 3939

40Florida International UniversityFlorida International University

Situation Analysis

John Smith Asst Manager Store 2345

Rcmd Non-Functional 10107 1001am R CarrRequest Roofer assessment amp repair Windowdoor replacementComment Parking lot may drain in 24hrs

Status EmailActions Report Contact JRNLRecommendation Summaries | SA OverallStaff Inadequate 10107 1232pm J SmithFacility Inadequate 10107 1001am R CarrGoods Inadequate 10107 1058am B FallsTrans Functional 10107 1145am L Staff

Facility Non-Func 10107 1001am R Carr

FACILITY Building amp GroundsRoof Damaged 10107 715am R GarciaWindows Damaged 10107 758am R GarciaDoor Damaged 10107 650am R GarciaSecurity NAWater NAParking Damaged 10107 640 am R Garcia

33

41

Acknowledgementsbull Funding

ndash NSF Career Awardndash NIH ndash Faculty Research Awards from IBM Xerox and NECndash Equipment Donation from IBM

bull Collaboratorsndash Dr Chris Ding (University of Texas at Arlington)ndash Dr Michael I Jordan (UC Berkeley)ndash Dr Charles Perng (IBM Research)ndash Dr Tong Sun (Xerox Research)ndash Dr Shenghuo Zhu (NEC Research)

bull PhD Studentsndash Dr David Kaiserndash Dr Wei Pengndash Bo Shaondash Dingding Wangndash Haifeng Wang ndash Yi Zhang

42

Questionbull Email taolicsfiuedubull httpwwwcsfiuedu~taoli

  • Data Mining Efficiently Extracting Interpretable and Actionable Patterns
  • Research Challenges
  • My Research Directions
  • Some Projects in My Group
  • Example of Call Center Cases
  • Motivation
  • Two Different Approaches
  • Semantic Feature Extraction
  • Case Study ndash Online Help Desk
  • Some Projects in My Group
  • SQL-Query-Result Navigation
  • Existing Solutions ndash Iterative Refinement
  • Existing Solutions - Ranking
  • Existing Solutions - Ranking
  • Existing Solutions - Categorization
  • Existing Solutions - Categorization
  • Address Diverse User Preferences (Chen amp Li SIGMOD 2007)
  • Some Projects in My Group
  • System Log Data
  • Knowledge validation completion and construction
  • A Case Study
  • Pattern Visualization in the Case Study
  • ERN in the Case Study
  • Other Projects in My Group
  • Examples of KPI Types
  • Methodology for Leading Indicator Learning
  • Some Projects in My Group
  • Biological Data Sources
  • Problem We are Studying
  • Some Projects in My Group
  • Business Continuity Information Network (BCIN)
  • BCIN
  • Acknowledgements
  • Question
Page 4: Data Mining: Efficiently Extracting Interpretable and ...taoli/TaoLi-Talk-1.pdf · Data Mining: Efficiently Extracting Interpretable and Actionable Patterns ... – Efficiently extracting

4

My Research DirectionsResearch Interests

Data Mining Machine Learning and Information Retrieval

Data Mining Machine Learning and Information Retrieval

Practical ProblemsPractical Problems

Generic ToolsGeneric ToolsTheoryTheory

Tools and LibrariesWorkflow and System

ManagementEvent MiningLog ParserVisualization Tools

Personal Music AssistantTools for Bioinformatics

Gene SelectionTissue Classification

Sequence analysisLiterature Assistant

Algorithm IssuesMatrix-based Learning

Framework

Semi-supervised LearningLearning from

heterogeneous data types

Learning from non-vectorial data

Stream data mining

Incremental learning

ApplicationsComputing System Management

Help Desk ApplicationsSelf-managing SystemsSecurity Management

Music Information RetrievalPersonalized Music Service

Adaptive Workflow ManagementProcess MiningWorkflow Optimization

BioinformaticsMicroarray data analysisSequence Analysis

5

Some Projects in My Group

bull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Business Intelligence

ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysisbull Matrix-Based Algorithms for Data Mining

6

Example of Call Center Cases

bull Adding more memory on XXXXXXbull Email (a wrong setting)bull Scanner Accountbull problem with encrypted telnetbull PDFs not opening in IE

ndash S What is your examplendash C An example httpexamplepdfndash S What is your environmentndash C Windows XP with Acrobat Standard ndash S Running a repair on Windowsndash C It doesnrsquot workndash S Uninstalling Acrobat Standard and Reader and then

reinstalling the Readerndash C It doesnrsquot work eitherndash S Install two Microsoft updates for IEndash C It works

7

Motivationbull Problems

ndash Symptoms must be manually described to help-deskndash Help-desk needs to determine solution - difficult to compare with all previous

problemsndash Difficult for one help-desk employee to gain from anothers experiencendash Difficult to determine if solution worked successfully - User feedback may not be

accurate

bull How can data miningmachine learning helpndash Malfunction diagnoses call distribution skill-based routing outbound targeting etcndash 90 of calls are due to the most frequent 10 of problemsndash Problem analysis based on known casesndash Augment known cases when a new case is solvedndash Significantly reducing the cost and shortening time needed to diagnose malfunction

issues improve customer satisfaction

8

Two Different Approaches

bull Semantic-based Case Retrievalndash Using Semantic features to retrieval the most

similar past cases

bull Problem Probingndash Find probequestion set which maximizes

utility (benefit ndash cost)

9

Semantic Feature ExtractionTo detect sentence relations

sentence 1

sentence 2SENNA

(NEC Parser)

frames of sentence 1

frames of sentence 2

finding word relations among words with the same semantic role in each frame pair using

WordNet

Features

Semantic Roles For certain verbs of motionArg0 causer of motion Arg3 start point Arg1 thing in motion Arg4 end pointArg2 distance moved Arg5 direction-- eg [Arg0 I] [ArgM-MOD can][ArgM-NEG not] [rel receive] [Arg1 emails]

Subtypes of ArgM modifier tagLOC location CAU causeEXT extent TMP timeDIS discourse connectives PNC purposeADV general-purpose MNR mannerNEG negation marker DIR directionMOD modal verb

Word Relations WordNet 30 (147306 words)20 word relations egidentical synonyms antonyms attribute hypernym and hyponym-- A is a hypernym of B if B is a kind of Ameronym and holonym-- A is a meronym of B if A is a part or member of B

verb

frame

FeaturesArg0_identicalArg0_synonymsArg0_hypernymArg0_meronymrel_identicalrel_synonymsrel_hypernymRel_meronymArg1_identical

hellip hellip

10

Case Study ndash Online Help Desk

Interface

Most related emails

15496txt 18697txt 11002txt3191txt

Can you switch these two computers

Slides from Dingding Wang

11

Some Projects in My Group

bull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Business Intelligence

ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysisbull Matrix-Based Algorithms for Data Mining

12

SQL-Query-Result NavigationWhat mutual funds to invest There are thousands of them

High return Low risk Low expense hellip

Information overloading problem users often can not form a query that returns answers exactly matching their needs

13

Existing Solutions ndashIterative Refinement

bull Iterative refinementndash Find all funds with 3 Year Return gt 10ndash Find all funds with 3 Year Return gt 10 and 1

Year Return gt 10ndash Find all funds with 3 Year and 1 Year Return gt

10 and expense lt 2ndash hellip

bull Problem too time consuming

14

Existing Solutions - Ranking

bull Rankingndash Similar to TFIDF (Agrawal et al CIDR 2003)ndash Probabilistic ranking (Chaudhuri et al VLDB 2004)ndash Ordering attributes (Das et al SIGMOD 2006)ndash Context-sensitive (Agrawal et al SIGMOD 2006)ndash Based on object relationships (Chakrabarti et al SIGMOD

2006)ndash A lot of work in IR field (eg Shen et al SIGIR 2005)

15

Existing Solutions - Ranking

bull Problem the same ranking function for all users (except the context sensitive work)ndash User 1 prefers high returnsndash User 2 prefers lower risks (often means lower

returns)ndash User 3 prefers a mix of high return funds and low

risk fundsndash One ranking function can not fit all

16

Existing Solutions -Categorization

Chakrabarti et al SIGMOD 2004Divide query results recursively into categories (a tree structure)

The splitting attributes are selected based on query history (more frequently asked attributes are typically put closer to root)

User can navigate this tree and show records in a node

Categorization is complementary to ranking (rank records in a node)

Root(193)

10 lt= 3 Yr return lt 20(71)

3 Yr return gt= 20 (29)

0 lt= 3 Yr return lt 10 (93)

17

Existing Solutions -Categorization

Problem differences between users are not consideredRoot(193)

0 lt= 3 Yr return lt 10 (93)

10 lt= 3 Yr return lt 20(71)

3 Yr return gt= 20 (29)

10 lt=1 Yr return lt 20 (55)

10 lt= Std lt 20 (48)

5 lt= Std lt 10 (7)

0 lt= 1 Yr return lt 10 (63)

0 lt= Std lt 5 (38)

User 1 3

User 2 3 User 2 3

This tree does not help user 2 and 3 much

We try to address this issue for categorization

18

Address Diverse User Preferences (Chen amp Li SIGMOD 2007)

bull Offline infer a set of typical types of user preferencesndash Data are clustered using query history of all existing usersndash Each cluster represents a typical type of user preference

(eg high return funds low risk funds)

bull Runtime ndash A navigational tree is constructed over clusters in the query resultsndash This tree highlights the differences between these clustersndash User navigates this tree to select clusters that he is interested inndash User can then rank or browse records in those clusters

19

Some Projects in My Group

bull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Business Intelligence

ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysisbull Matrix-Based Algorithms for Data Mining

20

System Log Databull Performance Metric

ndash CPUMemorySwap utilization workload average response timebull Application-level Log

ndash Windows NT system log Websphere application log Oracle activitybull Failure data

ndash System crash dumps error reportsbull Network Traffic Data

ndash SNMP Link-level byte counts Netflow Link-level flow statistics PMA link-level packet header traces TCPdump host-side packet traces

bull Request Datandash Apache and Microsoft IIS log

bull Reports from Operatorsndash Trouble ticket data problems reports possible causes and symptoms for failures

bull Othersndash Network-based alert logs Probes

21

logs

logs

logs

HistoricalData Collection

Realtime Analysis

Offline Analysis

Knowledge Base

Real Time Management

Anomaly Detection

Fault Diagnosis

Problem Determination

CorrelationDependencyKnowledge

SummarizationVisualizationRule Construction

Temporal Pattern DiscoveryKnowledge Management

PlanningActions

The Integrated Framework on Data-Driven Computing System Management

Log Data Organization

Componentlogs Active Data Collection

Log Adapter

Situation Identification

and Categorization

22

Raw data-gtData table

Aggregationsummarization

Event Plots

Pattern MiningPattern Mining

OutagesTemporal patternsEvent bursts

Morning rebootingperiodic patterns

Weekly rebootingPeriodic patterns

Hosts

Knowledge validation completion and construction

23

A Case Study

State start stop dependency create connection report request configuration otherPlotX‐axis Time Y‐axis State

24

Pattern Visualization in the Case Study

25

ERN in the Case Study

26

Other Projects in My Group

bull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Business Intelligence

ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysis

27

bull Operationalbusiness Intelligencendash Collecting and analyzing metrics (Key

Performance Indicators) to support better decision-making

bull Focus leading indicator discovery

ndash Using operational metrics to adaptively manage amp improve workflow efficiency

bull Focus workflow exception detection and tracing

28

Examples of KPI Types

Leading Indicators(Value Drivers)

Lagging Indicators(Outcomes )

Number of clients that sales people meet face to face each week

Sales revenue

Complex repairs completed successfully during the first call or visit

Customer satisfaction

Number of parts for which orders exceed forecasts within 30 days of scheduled delivery

Per unit manufacturing costs

Number of customers who are delinquent paying their first bill

Customer churn

29

Methodology for Leading Indicator Learning

Selected Leading Indicators or Leading Indicator Combinations

clustering

Similarity matrix (or Distance matrix) using non-metric distance or Granger Causality

hellip

hellip

hellipThe root KPI may be the key leading indicator

Time seriesproduction model(eg assumption data model)

Leading Indicator Selection based on Domain Knowledge

sequence KPIs

Reduced Time series

define KPIs

correlation

dimensionalityreduction

If no update modelIs Model Valid

New Business Concern(s)

If yes

Leading Indicators

Raw databullProduction metricsbullWorkflow activity logsbullJobDeviceDocument events

30

Some Projects in My Groupbull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Social Network Analysisbull Business Intelligence

ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysisbull Matrix-Based Algorithms for Data Mining

31

Biological Data SourcesGene Expression Text LiteratureSequence Data

Clustering Protein ExpressionProtein-protein Interactions

Transcription FactorBinding SitesGene Ontology

Gene Clusters

Function Co-regulation

32

Problem We are Studyingbull Marker Gene Identification

ndash httpwwwcsfiuedu~yzhan004geneselhtmlndash 15 gene selection algorithms

bull ReliefF F-statistic A-optimality D-Optimality mRNR ReliefF-mRNR GSNR

bull RankGene 11

bull How to perform clustering using data from multiple diverse sources

bull How to perform clustering using partial prior knowledge

33

Some Projects in My Group

bull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Business Intelligence

ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysisbull Matrix-Based Algorithms for Data Mining

34

Business Continuity Information Network (BCIN)

35

BCIN

bull Central questionndash How to help businesses recover and resume

operations quicker after a major hurricane (or other) disaster

3620082008--44--66 Florida International UniversityFlorida International University 3636

3720082008--44--66 Florida International UniversityFlorida International University 3737

3820082008--44--66 Florida International UniversityFlorida International University 3838

3920082008--44--66 Florida International UniversityFlorida International University 3939

40Florida International UniversityFlorida International University

Situation Analysis

John Smith Asst Manager Store 2345

Rcmd Non-Functional 10107 1001am R CarrRequest Roofer assessment amp repair Windowdoor replacementComment Parking lot may drain in 24hrs

Status EmailActions Report Contact JRNLRecommendation Summaries | SA OverallStaff Inadequate 10107 1232pm J SmithFacility Inadequate 10107 1001am R CarrGoods Inadequate 10107 1058am B FallsTrans Functional 10107 1145am L Staff

Facility Non-Func 10107 1001am R Carr

FACILITY Building amp GroundsRoof Damaged 10107 715am R GarciaWindows Damaged 10107 758am R GarciaDoor Damaged 10107 650am R GarciaSecurity NAWater NAParking Damaged 10107 640 am R Garcia

33

41

Acknowledgementsbull Funding

ndash NSF Career Awardndash NIH ndash Faculty Research Awards from IBM Xerox and NECndash Equipment Donation from IBM

bull Collaboratorsndash Dr Chris Ding (University of Texas at Arlington)ndash Dr Michael I Jordan (UC Berkeley)ndash Dr Charles Perng (IBM Research)ndash Dr Tong Sun (Xerox Research)ndash Dr Shenghuo Zhu (NEC Research)

bull PhD Studentsndash Dr David Kaiserndash Dr Wei Pengndash Bo Shaondash Dingding Wangndash Haifeng Wang ndash Yi Zhang

42

Questionbull Email taolicsfiuedubull httpwwwcsfiuedu~taoli

  • Data Mining Efficiently Extracting Interpretable and Actionable Patterns
  • Research Challenges
  • My Research Directions
  • Some Projects in My Group
  • Example of Call Center Cases
  • Motivation
  • Two Different Approaches
  • Semantic Feature Extraction
  • Case Study ndash Online Help Desk
  • Some Projects in My Group
  • SQL-Query-Result Navigation
  • Existing Solutions ndash Iterative Refinement
  • Existing Solutions - Ranking
  • Existing Solutions - Ranking
  • Existing Solutions - Categorization
  • Existing Solutions - Categorization
  • Address Diverse User Preferences (Chen amp Li SIGMOD 2007)
  • Some Projects in My Group
  • System Log Data
  • Knowledge validation completion and construction
  • A Case Study
  • Pattern Visualization in the Case Study
  • ERN in the Case Study
  • Other Projects in My Group
  • Examples of KPI Types
  • Methodology for Leading Indicator Learning
  • Some Projects in My Group
  • Biological Data Sources
  • Problem We are Studying
  • Some Projects in My Group
  • Business Continuity Information Network (BCIN)
  • BCIN
  • Acknowledgements
  • Question
Page 5: Data Mining: Efficiently Extracting Interpretable and ...taoli/TaoLi-Talk-1.pdf · Data Mining: Efficiently Extracting Interpretable and Actionable Patterns ... – Efficiently extracting

5

Some Projects in My Group

bull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Business Intelligence

ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysisbull Matrix-Based Algorithms for Data Mining

6

Example of Call Center Cases

bull Adding more memory on XXXXXXbull Email (a wrong setting)bull Scanner Accountbull problem with encrypted telnetbull PDFs not opening in IE

ndash S What is your examplendash C An example httpexamplepdfndash S What is your environmentndash C Windows XP with Acrobat Standard ndash S Running a repair on Windowsndash C It doesnrsquot workndash S Uninstalling Acrobat Standard and Reader and then

reinstalling the Readerndash C It doesnrsquot work eitherndash S Install two Microsoft updates for IEndash C It works

7

Motivationbull Problems

ndash Symptoms must be manually described to help-deskndash Help-desk needs to determine solution - difficult to compare with all previous

problemsndash Difficult for one help-desk employee to gain from anothers experiencendash Difficult to determine if solution worked successfully - User feedback may not be

accurate

bull How can data miningmachine learning helpndash Malfunction diagnoses call distribution skill-based routing outbound targeting etcndash 90 of calls are due to the most frequent 10 of problemsndash Problem analysis based on known casesndash Augment known cases when a new case is solvedndash Significantly reducing the cost and shortening time needed to diagnose malfunction

issues improve customer satisfaction

8

Two Different Approaches

bull Semantic-based Case Retrievalndash Using Semantic features to retrieval the most

similar past cases

bull Problem Probingndash Find probequestion set which maximizes

utility (benefit ndash cost)

9

Semantic Feature ExtractionTo detect sentence relations

sentence 1

sentence 2SENNA

(NEC Parser)

frames of sentence 1

frames of sentence 2

finding word relations among words with the same semantic role in each frame pair using

WordNet

Features

Semantic Roles For certain verbs of motionArg0 causer of motion Arg3 start point Arg1 thing in motion Arg4 end pointArg2 distance moved Arg5 direction-- eg [Arg0 I] [ArgM-MOD can][ArgM-NEG not] [rel receive] [Arg1 emails]

Subtypes of ArgM modifier tagLOC location CAU causeEXT extent TMP timeDIS discourse connectives PNC purposeADV general-purpose MNR mannerNEG negation marker DIR directionMOD modal verb

Word Relations WordNet 30 (147306 words)20 word relations egidentical synonyms antonyms attribute hypernym and hyponym-- A is a hypernym of B if B is a kind of Ameronym and holonym-- A is a meronym of B if A is a part or member of B

verb

frame

FeaturesArg0_identicalArg0_synonymsArg0_hypernymArg0_meronymrel_identicalrel_synonymsrel_hypernymRel_meronymArg1_identical

hellip hellip

10

Case Study ndash Online Help Desk

Interface

Most related emails

15496txt 18697txt 11002txt3191txt

Can you switch these two computers

Slides from Dingding Wang

11

Some Projects in My Group

bull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Business Intelligence

ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysisbull Matrix-Based Algorithms for Data Mining

12

SQL-Query-Result NavigationWhat mutual funds to invest There are thousands of them

High return Low risk Low expense hellip

Information overloading problem users often can not form a query that returns answers exactly matching their needs

13

Existing Solutions ndashIterative Refinement

bull Iterative refinementndash Find all funds with 3 Year Return gt 10ndash Find all funds with 3 Year Return gt 10 and 1

Year Return gt 10ndash Find all funds with 3 Year and 1 Year Return gt

10 and expense lt 2ndash hellip

bull Problem too time consuming

14

Existing Solutions - Ranking

bull Rankingndash Similar to TFIDF (Agrawal et al CIDR 2003)ndash Probabilistic ranking (Chaudhuri et al VLDB 2004)ndash Ordering attributes (Das et al SIGMOD 2006)ndash Context-sensitive (Agrawal et al SIGMOD 2006)ndash Based on object relationships (Chakrabarti et al SIGMOD

2006)ndash A lot of work in IR field (eg Shen et al SIGIR 2005)

15

Existing Solutions - Ranking

bull Problem the same ranking function for all users (except the context sensitive work)ndash User 1 prefers high returnsndash User 2 prefers lower risks (often means lower

returns)ndash User 3 prefers a mix of high return funds and low

risk fundsndash One ranking function can not fit all

16

Existing Solutions -Categorization

Chakrabarti et al SIGMOD 2004Divide query results recursively into categories (a tree structure)

The splitting attributes are selected based on query history (more frequently asked attributes are typically put closer to root)

User can navigate this tree and show records in a node

Categorization is complementary to ranking (rank records in a node)

Root(193)

10 lt= 3 Yr return lt 20(71)

3 Yr return gt= 20 (29)

0 lt= 3 Yr return lt 10 (93)

17

Existing Solutions -Categorization

Problem differences between users are not consideredRoot(193)

0 lt= 3 Yr return lt 10 (93)

10 lt= 3 Yr return lt 20(71)

3 Yr return gt= 20 (29)

10 lt=1 Yr return lt 20 (55)

10 lt= Std lt 20 (48)

5 lt= Std lt 10 (7)

0 lt= 1 Yr return lt 10 (63)

0 lt= Std lt 5 (38)

User 1 3

User 2 3 User 2 3

This tree does not help user 2 and 3 much

We try to address this issue for categorization

18

Address Diverse User Preferences (Chen amp Li SIGMOD 2007)

bull Offline infer a set of typical types of user preferencesndash Data are clustered using query history of all existing usersndash Each cluster represents a typical type of user preference

(eg high return funds low risk funds)

bull Runtime ndash A navigational tree is constructed over clusters in the query resultsndash This tree highlights the differences between these clustersndash User navigates this tree to select clusters that he is interested inndash User can then rank or browse records in those clusters

19

Some Projects in My Group

bull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Business Intelligence

ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysisbull Matrix-Based Algorithms for Data Mining

20

System Log Databull Performance Metric

ndash CPUMemorySwap utilization workload average response timebull Application-level Log

ndash Windows NT system log Websphere application log Oracle activitybull Failure data

ndash System crash dumps error reportsbull Network Traffic Data

ndash SNMP Link-level byte counts Netflow Link-level flow statistics PMA link-level packet header traces TCPdump host-side packet traces

bull Request Datandash Apache and Microsoft IIS log

bull Reports from Operatorsndash Trouble ticket data problems reports possible causes and symptoms for failures

bull Othersndash Network-based alert logs Probes

21

logs

logs

logs

HistoricalData Collection

Realtime Analysis

Offline Analysis

Knowledge Base

Real Time Management

Anomaly Detection

Fault Diagnosis

Problem Determination

CorrelationDependencyKnowledge

SummarizationVisualizationRule Construction

Temporal Pattern DiscoveryKnowledge Management

PlanningActions

The Integrated Framework on Data-Driven Computing System Management

Log Data Organization

Componentlogs Active Data Collection

Log Adapter

Situation Identification

and Categorization

22

Raw data-gtData table

Aggregationsummarization

Event Plots

Pattern MiningPattern Mining

OutagesTemporal patternsEvent bursts

Morning rebootingperiodic patterns

Weekly rebootingPeriodic patterns

Hosts

Knowledge validation completion and construction

23

A Case Study

State start stop dependency create connection report request configuration otherPlotX‐axis Time Y‐axis State

24

Pattern Visualization in the Case Study

25

ERN in the Case Study

26

Other Projects in My Group

bull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Business Intelligence

ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysis

27

bull Operationalbusiness Intelligencendash Collecting and analyzing metrics (Key

Performance Indicators) to support better decision-making

bull Focus leading indicator discovery

ndash Using operational metrics to adaptively manage amp improve workflow efficiency

bull Focus workflow exception detection and tracing

28

Examples of KPI Types

Leading Indicators(Value Drivers)

Lagging Indicators(Outcomes )

Number of clients that sales people meet face to face each week

Sales revenue

Complex repairs completed successfully during the first call or visit

Customer satisfaction

Number of parts for which orders exceed forecasts within 30 days of scheduled delivery

Per unit manufacturing costs

Number of customers who are delinquent paying their first bill

Customer churn

29

Methodology for Leading Indicator Learning

Selected Leading Indicators or Leading Indicator Combinations

clustering

Similarity matrix (or Distance matrix) using non-metric distance or Granger Causality

hellip

hellip

hellipThe root KPI may be the key leading indicator

Time seriesproduction model(eg assumption data model)

Leading Indicator Selection based on Domain Knowledge

sequence KPIs

Reduced Time series

define KPIs

correlation

dimensionalityreduction

If no update modelIs Model Valid

New Business Concern(s)

If yes

Leading Indicators

Raw databullProduction metricsbullWorkflow activity logsbullJobDeviceDocument events

30

Some Projects in My Groupbull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Social Network Analysisbull Business Intelligence

ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysisbull Matrix-Based Algorithms for Data Mining

31

Biological Data SourcesGene Expression Text LiteratureSequence Data

Clustering Protein ExpressionProtein-protein Interactions

Transcription FactorBinding SitesGene Ontology

Gene Clusters

Function Co-regulation

32

Problem We are Studyingbull Marker Gene Identification

ndash httpwwwcsfiuedu~yzhan004geneselhtmlndash 15 gene selection algorithms

bull ReliefF F-statistic A-optimality D-Optimality mRNR ReliefF-mRNR GSNR

bull RankGene 11

bull How to perform clustering using data from multiple diverse sources

bull How to perform clustering using partial prior knowledge

33

Some Projects in My Group

bull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Business Intelligence

ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysisbull Matrix-Based Algorithms for Data Mining

34

Business Continuity Information Network (BCIN)

35

BCIN

bull Central questionndash How to help businesses recover and resume

operations quicker after a major hurricane (or other) disaster

3620082008--44--66 Florida International UniversityFlorida International University 3636

3720082008--44--66 Florida International UniversityFlorida International University 3737

3820082008--44--66 Florida International UniversityFlorida International University 3838

3920082008--44--66 Florida International UniversityFlorida International University 3939

40Florida International UniversityFlorida International University

Situation Analysis

John Smith Asst Manager Store 2345

Rcmd Non-Functional 10107 1001am R CarrRequest Roofer assessment amp repair Windowdoor replacementComment Parking lot may drain in 24hrs

Status EmailActions Report Contact JRNLRecommendation Summaries | SA OverallStaff Inadequate 10107 1232pm J SmithFacility Inadequate 10107 1001am R CarrGoods Inadequate 10107 1058am B FallsTrans Functional 10107 1145am L Staff

Facility Non-Func 10107 1001am R Carr

FACILITY Building amp GroundsRoof Damaged 10107 715am R GarciaWindows Damaged 10107 758am R GarciaDoor Damaged 10107 650am R GarciaSecurity NAWater NAParking Damaged 10107 640 am R Garcia

33

41

Acknowledgementsbull Funding

ndash NSF Career Awardndash NIH ndash Faculty Research Awards from IBM Xerox and NECndash Equipment Donation from IBM

bull Collaboratorsndash Dr Chris Ding (University of Texas at Arlington)ndash Dr Michael I Jordan (UC Berkeley)ndash Dr Charles Perng (IBM Research)ndash Dr Tong Sun (Xerox Research)ndash Dr Shenghuo Zhu (NEC Research)

bull PhD Studentsndash Dr David Kaiserndash Dr Wei Pengndash Bo Shaondash Dingding Wangndash Haifeng Wang ndash Yi Zhang

42

Questionbull Email taolicsfiuedubull httpwwwcsfiuedu~taoli

  • Data Mining Efficiently Extracting Interpretable and Actionable Patterns
  • Research Challenges
  • My Research Directions
  • Some Projects in My Group
  • Example of Call Center Cases
  • Motivation
  • Two Different Approaches
  • Semantic Feature Extraction
  • Case Study ndash Online Help Desk
  • Some Projects in My Group
  • SQL-Query-Result Navigation
  • Existing Solutions ndash Iterative Refinement
  • Existing Solutions - Ranking
  • Existing Solutions - Ranking
  • Existing Solutions - Categorization
  • Existing Solutions - Categorization
  • Address Diverse User Preferences (Chen amp Li SIGMOD 2007)
  • Some Projects in My Group
  • System Log Data
  • Knowledge validation completion and construction
  • A Case Study
  • Pattern Visualization in the Case Study
  • ERN in the Case Study
  • Other Projects in My Group
  • Examples of KPI Types
  • Methodology for Leading Indicator Learning
  • Some Projects in My Group
  • Biological Data Sources
  • Problem We are Studying
  • Some Projects in My Group
  • Business Continuity Information Network (BCIN)
  • BCIN
  • Acknowledgements
  • Question
Page 6: Data Mining: Efficiently Extracting Interpretable and ...taoli/TaoLi-Talk-1.pdf · Data Mining: Efficiently Extracting Interpretable and Actionable Patterns ... – Efficiently extracting

6

Example of Call Center Cases

bull Adding more memory on XXXXXXbull Email (a wrong setting)bull Scanner Accountbull problem with encrypted telnetbull PDFs not opening in IE

ndash S What is your examplendash C An example httpexamplepdfndash S What is your environmentndash C Windows XP with Acrobat Standard ndash S Running a repair on Windowsndash C It doesnrsquot workndash S Uninstalling Acrobat Standard and Reader and then

reinstalling the Readerndash C It doesnrsquot work eitherndash S Install two Microsoft updates for IEndash C It works

7

Motivationbull Problems

ndash Symptoms must be manually described to help-deskndash Help-desk needs to determine solution - difficult to compare with all previous

problemsndash Difficult for one help-desk employee to gain from anothers experiencendash Difficult to determine if solution worked successfully - User feedback may not be

accurate

bull How can data miningmachine learning helpndash Malfunction diagnoses call distribution skill-based routing outbound targeting etcndash 90 of calls are due to the most frequent 10 of problemsndash Problem analysis based on known casesndash Augment known cases when a new case is solvedndash Significantly reducing the cost and shortening time needed to diagnose malfunction

issues improve customer satisfaction

8

Two Different Approaches

bull Semantic-based Case Retrievalndash Using Semantic features to retrieval the most

similar past cases

bull Problem Probingndash Find probequestion set which maximizes

utility (benefit ndash cost)

9

Semantic Feature ExtractionTo detect sentence relations

sentence 1

sentence 2SENNA

(NEC Parser)

frames of sentence 1

frames of sentence 2

finding word relations among words with the same semantic role in each frame pair using

WordNet

Features

Semantic Roles For certain verbs of motionArg0 causer of motion Arg3 start point Arg1 thing in motion Arg4 end pointArg2 distance moved Arg5 direction-- eg [Arg0 I] [ArgM-MOD can][ArgM-NEG not] [rel receive] [Arg1 emails]

Subtypes of ArgM modifier tagLOC location CAU causeEXT extent TMP timeDIS discourse connectives PNC purposeADV general-purpose MNR mannerNEG negation marker DIR directionMOD modal verb

Word Relations WordNet 30 (147306 words)20 word relations egidentical synonyms antonyms attribute hypernym and hyponym-- A is a hypernym of B if B is a kind of Ameronym and holonym-- A is a meronym of B if A is a part or member of B

verb

frame

FeaturesArg0_identicalArg0_synonymsArg0_hypernymArg0_meronymrel_identicalrel_synonymsrel_hypernymRel_meronymArg1_identical

hellip hellip

10

Case Study ndash Online Help Desk

Interface

Most related emails

15496txt 18697txt 11002txt3191txt

Can you switch these two computers

Slides from Dingding Wang

11

Some Projects in My Group

bull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Business Intelligence

ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysisbull Matrix-Based Algorithms for Data Mining

12

SQL-Query-Result NavigationWhat mutual funds to invest There are thousands of them

High return Low risk Low expense hellip

Information overloading problem users often can not form a query that returns answers exactly matching their needs

13

Existing Solutions ndashIterative Refinement

bull Iterative refinementndash Find all funds with 3 Year Return gt 10ndash Find all funds with 3 Year Return gt 10 and 1

Year Return gt 10ndash Find all funds with 3 Year and 1 Year Return gt

10 and expense lt 2ndash hellip

bull Problem too time consuming

14

Existing Solutions - Ranking

bull Rankingndash Similar to TFIDF (Agrawal et al CIDR 2003)ndash Probabilistic ranking (Chaudhuri et al VLDB 2004)ndash Ordering attributes (Das et al SIGMOD 2006)ndash Context-sensitive (Agrawal et al SIGMOD 2006)ndash Based on object relationships (Chakrabarti et al SIGMOD

2006)ndash A lot of work in IR field (eg Shen et al SIGIR 2005)

15

Existing Solutions - Ranking

bull Problem the same ranking function for all users (except the context sensitive work)ndash User 1 prefers high returnsndash User 2 prefers lower risks (often means lower

returns)ndash User 3 prefers a mix of high return funds and low

risk fundsndash One ranking function can not fit all

16

Existing Solutions -Categorization

Chakrabarti et al SIGMOD 2004Divide query results recursively into categories (a tree structure)

The splitting attributes are selected based on query history (more frequently asked attributes are typically put closer to root)

User can navigate this tree and show records in a node

Categorization is complementary to ranking (rank records in a node)

Root(193)

10 lt= 3 Yr return lt 20(71)

3 Yr return gt= 20 (29)

0 lt= 3 Yr return lt 10 (93)

17

Existing Solutions -Categorization

Problem differences between users are not consideredRoot(193)

0 lt= 3 Yr return lt 10 (93)

10 lt= 3 Yr return lt 20(71)

3 Yr return gt= 20 (29)

10 lt=1 Yr return lt 20 (55)

10 lt= Std lt 20 (48)

5 lt= Std lt 10 (7)

0 lt= 1 Yr return lt 10 (63)

0 lt= Std lt 5 (38)

User 1 3

User 2 3 User 2 3

This tree does not help user 2 and 3 much

We try to address this issue for categorization

18

Address Diverse User Preferences (Chen amp Li SIGMOD 2007)

bull Offline infer a set of typical types of user preferencesndash Data are clustered using query history of all existing usersndash Each cluster represents a typical type of user preference

(eg high return funds low risk funds)

bull Runtime ndash A navigational tree is constructed over clusters in the query resultsndash This tree highlights the differences between these clustersndash User navigates this tree to select clusters that he is interested inndash User can then rank or browse records in those clusters

19

Some Projects in My Group

bull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Business Intelligence

ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysisbull Matrix-Based Algorithms for Data Mining

20

System Log Databull Performance Metric

ndash CPUMemorySwap utilization workload average response timebull Application-level Log

ndash Windows NT system log Websphere application log Oracle activitybull Failure data

ndash System crash dumps error reportsbull Network Traffic Data

ndash SNMP Link-level byte counts Netflow Link-level flow statistics PMA link-level packet header traces TCPdump host-side packet traces

bull Request Datandash Apache and Microsoft IIS log

bull Reports from Operatorsndash Trouble ticket data problems reports possible causes and symptoms for failures

bull Othersndash Network-based alert logs Probes

21

logs

logs

logs

HistoricalData Collection

Realtime Analysis

Offline Analysis

Knowledge Base

Real Time Management

Anomaly Detection

Fault Diagnosis

Problem Determination

CorrelationDependencyKnowledge

SummarizationVisualizationRule Construction

Temporal Pattern DiscoveryKnowledge Management

PlanningActions

The Integrated Framework on Data-Driven Computing System Management

Log Data Organization

Componentlogs Active Data Collection

Log Adapter

Situation Identification

and Categorization

22

Raw data-gtData table

Aggregationsummarization

Event Plots

Pattern MiningPattern Mining

OutagesTemporal patternsEvent bursts

Morning rebootingperiodic patterns

Weekly rebootingPeriodic patterns

Hosts

Knowledge validation completion and construction

23

A Case Study

State start stop dependency create connection report request configuration otherPlotX‐axis Time Y‐axis State

24

Pattern Visualization in the Case Study

25

ERN in the Case Study

26

Other Projects in My Group

bull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Business Intelligence

ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysis

27

bull Operationalbusiness Intelligencendash Collecting and analyzing metrics (Key

Performance Indicators) to support better decision-making

bull Focus leading indicator discovery

ndash Using operational metrics to adaptively manage amp improve workflow efficiency

bull Focus workflow exception detection and tracing

28

Examples of KPI Types

Leading Indicators(Value Drivers)

Lagging Indicators(Outcomes )

Number of clients that sales people meet face to face each week

Sales revenue

Complex repairs completed successfully during the first call or visit

Customer satisfaction

Number of parts for which orders exceed forecasts within 30 days of scheduled delivery

Per unit manufacturing costs

Number of customers who are delinquent paying their first bill

Customer churn

29

Methodology for Leading Indicator Learning

Selected Leading Indicators or Leading Indicator Combinations

clustering

Similarity matrix (or Distance matrix) using non-metric distance or Granger Causality

hellip

hellip

hellipThe root KPI may be the key leading indicator

Time seriesproduction model(eg assumption data model)

Leading Indicator Selection based on Domain Knowledge

sequence KPIs

Reduced Time series

define KPIs

correlation

dimensionalityreduction

If no update modelIs Model Valid

New Business Concern(s)

If yes

Leading Indicators

Raw databullProduction metricsbullWorkflow activity logsbullJobDeviceDocument events

30

Some Projects in My Groupbull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Social Network Analysisbull Business Intelligence

ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysisbull Matrix-Based Algorithms for Data Mining

31

Biological Data SourcesGene Expression Text LiteratureSequence Data

Clustering Protein ExpressionProtein-protein Interactions

Transcription FactorBinding SitesGene Ontology

Gene Clusters

Function Co-regulation

32

Problem We are Studyingbull Marker Gene Identification

ndash httpwwwcsfiuedu~yzhan004geneselhtmlndash 15 gene selection algorithms

bull ReliefF F-statistic A-optimality D-Optimality mRNR ReliefF-mRNR GSNR

bull RankGene 11

bull How to perform clustering using data from multiple diverse sources

bull How to perform clustering using partial prior knowledge

33

Some Projects in My Group

bull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Business Intelligence

ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysisbull Matrix-Based Algorithms for Data Mining

34

Business Continuity Information Network (BCIN)

35

BCIN

bull Central questionndash How to help businesses recover and resume

operations quicker after a major hurricane (or other) disaster

3620082008--44--66 Florida International UniversityFlorida International University 3636

3720082008--44--66 Florida International UniversityFlorida International University 3737

3820082008--44--66 Florida International UniversityFlorida International University 3838

3920082008--44--66 Florida International UniversityFlorida International University 3939

40Florida International UniversityFlorida International University

Situation Analysis

John Smith Asst Manager Store 2345

Rcmd Non-Functional 10107 1001am R CarrRequest Roofer assessment amp repair Windowdoor replacementComment Parking lot may drain in 24hrs

Status EmailActions Report Contact JRNLRecommendation Summaries | SA OverallStaff Inadequate 10107 1232pm J SmithFacility Inadequate 10107 1001am R CarrGoods Inadequate 10107 1058am B FallsTrans Functional 10107 1145am L Staff

Facility Non-Func 10107 1001am R Carr

FACILITY Building amp GroundsRoof Damaged 10107 715am R GarciaWindows Damaged 10107 758am R GarciaDoor Damaged 10107 650am R GarciaSecurity NAWater NAParking Damaged 10107 640 am R Garcia

33

41

Acknowledgementsbull Funding

ndash NSF Career Awardndash NIH ndash Faculty Research Awards from IBM Xerox and NECndash Equipment Donation from IBM

bull Collaboratorsndash Dr Chris Ding (University of Texas at Arlington)ndash Dr Michael I Jordan (UC Berkeley)ndash Dr Charles Perng (IBM Research)ndash Dr Tong Sun (Xerox Research)ndash Dr Shenghuo Zhu (NEC Research)

bull PhD Studentsndash Dr David Kaiserndash Dr Wei Pengndash Bo Shaondash Dingding Wangndash Haifeng Wang ndash Yi Zhang

42

Questionbull Email taolicsfiuedubull httpwwwcsfiuedu~taoli

  • Data Mining Efficiently Extracting Interpretable and Actionable Patterns
  • Research Challenges
  • My Research Directions
  • Some Projects in My Group
  • Example of Call Center Cases
  • Motivation
  • Two Different Approaches
  • Semantic Feature Extraction
  • Case Study ndash Online Help Desk
  • Some Projects in My Group
  • SQL-Query-Result Navigation
  • Existing Solutions ndash Iterative Refinement
  • Existing Solutions - Ranking
  • Existing Solutions - Ranking
  • Existing Solutions - Categorization
  • Existing Solutions - Categorization
  • Address Diverse User Preferences (Chen amp Li SIGMOD 2007)
  • Some Projects in My Group
  • System Log Data
  • Knowledge validation completion and construction
  • A Case Study
  • Pattern Visualization in the Case Study
  • ERN in the Case Study
  • Other Projects in My Group
  • Examples of KPI Types
  • Methodology for Leading Indicator Learning
  • Some Projects in My Group
  • Biological Data Sources
  • Problem We are Studying
  • Some Projects in My Group
  • Business Continuity Information Network (BCIN)
  • BCIN
  • Acknowledgements
  • Question
Page 7: Data Mining: Efficiently Extracting Interpretable and ...taoli/TaoLi-Talk-1.pdf · Data Mining: Efficiently Extracting Interpretable and Actionable Patterns ... – Efficiently extracting

7

Motivationbull Problems

ndash Symptoms must be manually described to help-deskndash Help-desk needs to determine solution - difficult to compare with all previous

problemsndash Difficult for one help-desk employee to gain from anothers experiencendash Difficult to determine if solution worked successfully - User feedback may not be

accurate

bull How can data miningmachine learning helpndash Malfunction diagnoses call distribution skill-based routing outbound targeting etcndash 90 of calls are due to the most frequent 10 of problemsndash Problem analysis based on known casesndash Augment known cases when a new case is solvedndash Significantly reducing the cost and shortening time needed to diagnose malfunction

issues improve customer satisfaction

8

Two Different Approaches

bull Semantic-based Case Retrievalndash Using Semantic features to retrieval the most

similar past cases

bull Problem Probingndash Find probequestion set which maximizes

utility (benefit ndash cost)

9

Semantic Feature ExtractionTo detect sentence relations

sentence 1

sentence 2SENNA

(NEC Parser)

frames of sentence 1

frames of sentence 2

finding word relations among words with the same semantic role in each frame pair using

WordNet

Features

Semantic Roles For certain verbs of motionArg0 causer of motion Arg3 start point Arg1 thing in motion Arg4 end pointArg2 distance moved Arg5 direction-- eg [Arg0 I] [ArgM-MOD can][ArgM-NEG not] [rel receive] [Arg1 emails]

Subtypes of ArgM modifier tagLOC location CAU causeEXT extent TMP timeDIS discourse connectives PNC purposeADV general-purpose MNR mannerNEG negation marker DIR directionMOD modal verb

Word Relations WordNet 30 (147306 words)20 word relations egidentical synonyms antonyms attribute hypernym and hyponym-- A is a hypernym of B if B is a kind of Ameronym and holonym-- A is a meronym of B if A is a part or member of B

verb

frame

FeaturesArg0_identicalArg0_synonymsArg0_hypernymArg0_meronymrel_identicalrel_synonymsrel_hypernymRel_meronymArg1_identical

hellip hellip

10

Case Study ndash Online Help Desk

Interface

Most related emails

15496txt 18697txt 11002txt3191txt

Can you switch these two computers

Slides from Dingding Wang

11

Some Projects in My Group

bull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Business Intelligence

ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysisbull Matrix-Based Algorithms for Data Mining

12

SQL-Query-Result NavigationWhat mutual funds to invest There are thousands of them

High return Low risk Low expense hellip

Information overloading problem users often can not form a query that returns answers exactly matching their needs

13

Existing Solutions ndashIterative Refinement

bull Iterative refinementndash Find all funds with 3 Year Return gt 10ndash Find all funds with 3 Year Return gt 10 and 1

Year Return gt 10ndash Find all funds with 3 Year and 1 Year Return gt

10 and expense lt 2ndash hellip

bull Problem too time consuming

14

Existing Solutions - Ranking

bull Rankingndash Similar to TFIDF (Agrawal et al CIDR 2003)ndash Probabilistic ranking (Chaudhuri et al VLDB 2004)ndash Ordering attributes (Das et al SIGMOD 2006)ndash Context-sensitive (Agrawal et al SIGMOD 2006)ndash Based on object relationships (Chakrabarti et al SIGMOD

2006)ndash A lot of work in IR field (eg Shen et al SIGIR 2005)

15

Existing Solutions - Ranking

bull Problem the same ranking function for all users (except the context sensitive work)ndash User 1 prefers high returnsndash User 2 prefers lower risks (often means lower

returns)ndash User 3 prefers a mix of high return funds and low

risk fundsndash One ranking function can not fit all

16

Existing Solutions -Categorization

Chakrabarti et al SIGMOD 2004Divide query results recursively into categories (a tree structure)

The splitting attributes are selected based on query history (more frequently asked attributes are typically put closer to root)

User can navigate this tree and show records in a node

Categorization is complementary to ranking (rank records in a node)

Root(193)

10 lt= 3 Yr return lt 20(71)

3 Yr return gt= 20 (29)

0 lt= 3 Yr return lt 10 (93)

17

Existing Solutions -Categorization

Problem differences between users are not consideredRoot(193)

0 lt= 3 Yr return lt 10 (93)

10 lt= 3 Yr return lt 20(71)

3 Yr return gt= 20 (29)

10 lt=1 Yr return lt 20 (55)

10 lt= Std lt 20 (48)

5 lt= Std lt 10 (7)

0 lt= 1 Yr return lt 10 (63)

0 lt= Std lt 5 (38)

User 1 3

User 2 3 User 2 3

This tree does not help user 2 and 3 much

We try to address this issue for categorization

18

Address Diverse User Preferences (Chen amp Li SIGMOD 2007)

bull Offline infer a set of typical types of user preferencesndash Data are clustered using query history of all existing usersndash Each cluster represents a typical type of user preference

(eg high return funds low risk funds)

bull Runtime ndash A navigational tree is constructed over clusters in the query resultsndash This tree highlights the differences between these clustersndash User navigates this tree to select clusters that he is interested inndash User can then rank or browse records in those clusters

19

Some Projects in My Group

bull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Business Intelligence

ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysisbull Matrix-Based Algorithms for Data Mining

20

System Log Databull Performance Metric

ndash CPUMemorySwap utilization workload average response timebull Application-level Log

ndash Windows NT system log Websphere application log Oracle activitybull Failure data

ndash System crash dumps error reportsbull Network Traffic Data

ndash SNMP Link-level byte counts Netflow Link-level flow statistics PMA link-level packet header traces TCPdump host-side packet traces

bull Request Datandash Apache and Microsoft IIS log

bull Reports from Operatorsndash Trouble ticket data problems reports possible causes and symptoms for failures

bull Othersndash Network-based alert logs Probes

21

logs

logs

logs

HistoricalData Collection

Realtime Analysis

Offline Analysis

Knowledge Base

Real Time Management

Anomaly Detection

Fault Diagnosis

Problem Determination

CorrelationDependencyKnowledge

SummarizationVisualizationRule Construction

Temporal Pattern DiscoveryKnowledge Management

PlanningActions

The Integrated Framework on Data-Driven Computing System Management

Log Data Organization

Componentlogs Active Data Collection

Log Adapter

Situation Identification

and Categorization

22

Raw data-gtData table

Aggregationsummarization

Event Plots

Pattern MiningPattern Mining

OutagesTemporal patternsEvent bursts

Morning rebootingperiodic patterns

Weekly rebootingPeriodic patterns

Hosts

Knowledge validation completion and construction

23

A Case Study

State start stop dependency create connection report request configuration otherPlotX‐axis Time Y‐axis State

24

Pattern Visualization in the Case Study

25

ERN in the Case Study

26

Other Projects in My Group

bull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Business Intelligence

ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysis

27

bull Operationalbusiness Intelligencendash Collecting and analyzing metrics (Key

Performance Indicators) to support better decision-making

bull Focus leading indicator discovery

ndash Using operational metrics to adaptively manage amp improve workflow efficiency

bull Focus workflow exception detection and tracing

28

Examples of KPI Types

Leading Indicators(Value Drivers)

Lagging Indicators(Outcomes )

Number of clients that sales people meet face to face each week

Sales revenue

Complex repairs completed successfully during the first call or visit

Customer satisfaction

Number of parts for which orders exceed forecasts within 30 days of scheduled delivery

Per unit manufacturing costs

Number of customers who are delinquent paying their first bill

Customer churn

29

Methodology for Leading Indicator Learning

Selected Leading Indicators or Leading Indicator Combinations

clustering

Similarity matrix (or Distance matrix) using non-metric distance or Granger Causality

hellip

hellip

hellipThe root KPI may be the key leading indicator

Time seriesproduction model(eg assumption data model)

Leading Indicator Selection based on Domain Knowledge

sequence KPIs

Reduced Time series

define KPIs

correlation

dimensionalityreduction

If no update modelIs Model Valid

New Business Concern(s)

If yes

Leading Indicators

Raw databullProduction metricsbullWorkflow activity logsbullJobDeviceDocument events

30

Some Projects in My Groupbull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Social Network Analysisbull Business Intelligence

ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysisbull Matrix-Based Algorithms for Data Mining

31

Biological Data SourcesGene Expression Text LiteratureSequence Data

Clustering Protein ExpressionProtein-protein Interactions

Transcription FactorBinding SitesGene Ontology

Gene Clusters

Function Co-regulation

32

Problem We are Studyingbull Marker Gene Identification

ndash httpwwwcsfiuedu~yzhan004geneselhtmlndash 15 gene selection algorithms

bull ReliefF F-statistic A-optimality D-Optimality mRNR ReliefF-mRNR GSNR

bull RankGene 11

bull How to perform clustering using data from multiple diverse sources

bull How to perform clustering using partial prior knowledge

33

Some Projects in My Group

bull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Business Intelligence

ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysisbull Matrix-Based Algorithms for Data Mining

34

Business Continuity Information Network (BCIN)

35

BCIN

bull Central questionndash How to help businesses recover and resume

operations quicker after a major hurricane (or other) disaster

3620082008--44--66 Florida International UniversityFlorida International University 3636

3720082008--44--66 Florida International UniversityFlorida International University 3737

3820082008--44--66 Florida International UniversityFlorida International University 3838

3920082008--44--66 Florida International UniversityFlorida International University 3939

40Florida International UniversityFlorida International University

Situation Analysis

John Smith Asst Manager Store 2345

Rcmd Non-Functional 10107 1001am R CarrRequest Roofer assessment amp repair Windowdoor replacementComment Parking lot may drain in 24hrs

Status EmailActions Report Contact JRNLRecommendation Summaries | SA OverallStaff Inadequate 10107 1232pm J SmithFacility Inadequate 10107 1001am R CarrGoods Inadequate 10107 1058am B FallsTrans Functional 10107 1145am L Staff

Facility Non-Func 10107 1001am R Carr

FACILITY Building amp GroundsRoof Damaged 10107 715am R GarciaWindows Damaged 10107 758am R GarciaDoor Damaged 10107 650am R GarciaSecurity NAWater NAParking Damaged 10107 640 am R Garcia

33

41

Acknowledgementsbull Funding

ndash NSF Career Awardndash NIH ndash Faculty Research Awards from IBM Xerox and NECndash Equipment Donation from IBM

bull Collaboratorsndash Dr Chris Ding (University of Texas at Arlington)ndash Dr Michael I Jordan (UC Berkeley)ndash Dr Charles Perng (IBM Research)ndash Dr Tong Sun (Xerox Research)ndash Dr Shenghuo Zhu (NEC Research)

bull PhD Studentsndash Dr David Kaiserndash Dr Wei Pengndash Bo Shaondash Dingding Wangndash Haifeng Wang ndash Yi Zhang

42

Questionbull Email taolicsfiuedubull httpwwwcsfiuedu~taoli

  • Data Mining Efficiently Extracting Interpretable and Actionable Patterns
  • Research Challenges
  • My Research Directions
  • Some Projects in My Group
  • Example of Call Center Cases
  • Motivation
  • Two Different Approaches
  • Semantic Feature Extraction
  • Case Study ndash Online Help Desk
  • Some Projects in My Group
  • SQL-Query-Result Navigation
  • Existing Solutions ndash Iterative Refinement
  • Existing Solutions - Ranking
  • Existing Solutions - Ranking
  • Existing Solutions - Categorization
  • Existing Solutions - Categorization
  • Address Diverse User Preferences (Chen amp Li SIGMOD 2007)
  • Some Projects in My Group
  • System Log Data
  • Knowledge validation completion and construction
  • A Case Study
  • Pattern Visualization in the Case Study
  • ERN in the Case Study
  • Other Projects in My Group
  • Examples of KPI Types
  • Methodology for Leading Indicator Learning
  • Some Projects in My Group
  • Biological Data Sources
  • Problem We are Studying
  • Some Projects in My Group
  • Business Continuity Information Network (BCIN)
  • BCIN
  • Acknowledgements
  • Question
Page 8: Data Mining: Efficiently Extracting Interpretable and ...taoli/TaoLi-Talk-1.pdf · Data Mining: Efficiently Extracting Interpretable and Actionable Patterns ... – Efficiently extracting

8

Two Different Approaches

bull Semantic-based Case Retrievalndash Using Semantic features to retrieval the most

similar past cases

bull Problem Probingndash Find probequestion set which maximizes

utility (benefit ndash cost)

9

Semantic Feature ExtractionTo detect sentence relations

sentence 1

sentence 2SENNA

(NEC Parser)

frames of sentence 1

frames of sentence 2

finding word relations among words with the same semantic role in each frame pair using

WordNet

Features

Semantic Roles For certain verbs of motionArg0 causer of motion Arg3 start point Arg1 thing in motion Arg4 end pointArg2 distance moved Arg5 direction-- eg [Arg0 I] [ArgM-MOD can][ArgM-NEG not] [rel receive] [Arg1 emails]

Subtypes of ArgM modifier tagLOC location CAU causeEXT extent TMP timeDIS discourse connectives PNC purposeADV general-purpose MNR mannerNEG negation marker DIR directionMOD modal verb

Word Relations WordNet 30 (147306 words)20 word relations egidentical synonyms antonyms attribute hypernym and hyponym-- A is a hypernym of B if B is a kind of Ameronym and holonym-- A is a meronym of B if A is a part or member of B

verb

frame

FeaturesArg0_identicalArg0_synonymsArg0_hypernymArg0_meronymrel_identicalrel_synonymsrel_hypernymRel_meronymArg1_identical

hellip hellip

10

Case Study ndash Online Help Desk

Interface

Most related emails

15496txt 18697txt 11002txt3191txt

Can you switch these two computers

Slides from Dingding Wang

11

Some Projects in My Group

bull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Business Intelligence

ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysisbull Matrix-Based Algorithms for Data Mining

12

SQL-Query-Result NavigationWhat mutual funds to invest There are thousands of them

High return Low risk Low expense hellip

Information overloading problem users often can not form a query that returns answers exactly matching their needs

13

Existing Solutions ndashIterative Refinement

bull Iterative refinementndash Find all funds with 3 Year Return gt 10ndash Find all funds with 3 Year Return gt 10 and 1

Year Return gt 10ndash Find all funds with 3 Year and 1 Year Return gt

10 and expense lt 2ndash hellip

bull Problem too time consuming

14

Existing Solutions - Ranking

bull Rankingndash Similar to TFIDF (Agrawal et al CIDR 2003)ndash Probabilistic ranking (Chaudhuri et al VLDB 2004)ndash Ordering attributes (Das et al SIGMOD 2006)ndash Context-sensitive (Agrawal et al SIGMOD 2006)ndash Based on object relationships (Chakrabarti et al SIGMOD

2006)ndash A lot of work in IR field (eg Shen et al SIGIR 2005)

15

Existing Solutions - Ranking

bull Problem the same ranking function for all users (except the context sensitive work)ndash User 1 prefers high returnsndash User 2 prefers lower risks (often means lower

returns)ndash User 3 prefers a mix of high return funds and low

risk fundsndash One ranking function can not fit all

16

Existing Solutions -Categorization

Chakrabarti et al SIGMOD 2004Divide query results recursively into categories (a tree structure)

The splitting attributes are selected based on query history (more frequently asked attributes are typically put closer to root)

User can navigate this tree and show records in a node

Categorization is complementary to ranking (rank records in a node)

Root(193)

10 lt= 3 Yr return lt 20(71)

3 Yr return gt= 20 (29)

0 lt= 3 Yr return lt 10 (93)

17

Existing Solutions -Categorization

Problem differences between users are not consideredRoot(193)

0 lt= 3 Yr return lt 10 (93)

10 lt= 3 Yr return lt 20(71)

3 Yr return gt= 20 (29)

10 lt=1 Yr return lt 20 (55)

10 lt= Std lt 20 (48)

5 lt= Std lt 10 (7)

0 lt= 1 Yr return lt 10 (63)

0 lt= Std lt 5 (38)

User 1 3

User 2 3 User 2 3

This tree does not help user 2 and 3 much

We try to address this issue for categorization

18

Address Diverse User Preferences (Chen amp Li SIGMOD 2007)

bull Offline infer a set of typical types of user preferencesndash Data are clustered using query history of all existing usersndash Each cluster represents a typical type of user preference

(eg high return funds low risk funds)

bull Runtime ndash A navigational tree is constructed over clusters in the query resultsndash This tree highlights the differences between these clustersndash User navigates this tree to select clusters that he is interested inndash User can then rank or browse records in those clusters

19

Some Projects in My Group

bull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Business Intelligence

ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysisbull Matrix-Based Algorithms for Data Mining

20

System Log Databull Performance Metric

ndash CPUMemorySwap utilization workload average response timebull Application-level Log

ndash Windows NT system log Websphere application log Oracle activitybull Failure data

ndash System crash dumps error reportsbull Network Traffic Data

ndash SNMP Link-level byte counts Netflow Link-level flow statistics PMA link-level packet header traces TCPdump host-side packet traces

bull Request Datandash Apache and Microsoft IIS log

bull Reports from Operatorsndash Trouble ticket data problems reports possible causes and symptoms for failures

bull Othersndash Network-based alert logs Probes

21

logs

logs

logs

HistoricalData Collection

Realtime Analysis

Offline Analysis

Knowledge Base

Real Time Management

Anomaly Detection

Fault Diagnosis

Problem Determination

CorrelationDependencyKnowledge

SummarizationVisualizationRule Construction

Temporal Pattern DiscoveryKnowledge Management

PlanningActions

The Integrated Framework on Data-Driven Computing System Management

Log Data Organization

Componentlogs Active Data Collection

Log Adapter

Situation Identification

and Categorization

22

Raw data-gtData table

Aggregationsummarization

Event Plots

Pattern MiningPattern Mining

OutagesTemporal patternsEvent bursts

Morning rebootingperiodic patterns

Weekly rebootingPeriodic patterns

Hosts

Knowledge validation completion and construction

23

A Case Study

State start stop dependency create connection report request configuration otherPlotX‐axis Time Y‐axis State

24

Pattern Visualization in the Case Study

25

ERN in the Case Study

26

Other Projects in My Group

bull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Business Intelligence

ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysis

27

bull Operationalbusiness Intelligencendash Collecting and analyzing metrics (Key

Performance Indicators) to support better decision-making

bull Focus leading indicator discovery

ndash Using operational metrics to adaptively manage amp improve workflow efficiency

bull Focus workflow exception detection and tracing

28

Examples of KPI Types

Leading Indicators(Value Drivers)

Lagging Indicators(Outcomes )

Number of clients that sales people meet face to face each week

Sales revenue

Complex repairs completed successfully during the first call or visit

Customer satisfaction

Number of parts for which orders exceed forecasts within 30 days of scheduled delivery

Per unit manufacturing costs

Number of customers who are delinquent paying their first bill

Customer churn

29

Methodology for Leading Indicator Learning

Selected Leading Indicators or Leading Indicator Combinations

clustering

Similarity matrix (or Distance matrix) using non-metric distance or Granger Causality

hellip

hellip

hellipThe root KPI may be the key leading indicator

Time seriesproduction model(eg assumption data model)

Leading Indicator Selection based on Domain Knowledge

sequence KPIs

Reduced Time series

define KPIs

correlation

dimensionalityreduction

If no update modelIs Model Valid

New Business Concern(s)

If yes

Leading Indicators

Raw databullProduction metricsbullWorkflow activity logsbullJobDeviceDocument events

30

Some Projects in My Groupbull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Social Network Analysisbull Business Intelligence

ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysisbull Matrix-Based Algorithms for Data Mining

31

Biological Data SourcesGene Expression Text LiteratureSequence Data

Clustering Protein ExpressionProtein-protein Interactions

Transcription FactorBinding SitesGene Ontology

Gene Clusters

Function Co-regulation

32

Problem We are Studyingbull Marker Gene Identification

ndash httpwwwcsfiuedu~yzhan004geneselhtmlndash 15 gene selection algorithms

bull ReliefF F-statistic A-optimality D-Optimality mRNR ReliefF-mRNR GSNR

bull RankGene 11

bull How to perform clustering using data from multiple diverse sources

bull How to perform clustering using partial prior knowledge

33

Some Projects in My Group

bull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Business Intelligence

ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysisbull Matrix-Based Algorithms for Data Mining

34

Business Continuity Information Network (BCIN)

35

BCIN

bull Central questionndash How to help businesses recover and resume

operations quicker after a major hurricane (or other) disaster

3620082008--44--66 Florida International UniversityFlorida International University 3636

3720082008--44--66 Florida International UniversityFlorida International University 3737

3820082008--44--66 Florida International UniversityFlorida International University 3838

3920082008--44--66 Florida International UniversityFlorida International University 3939

40Florida International UniversityFlorida International University

Situation Analysis

John Smith Asst Manager Store 2345

Rcmd Non-Functional 10107 1001am R CarrRequest Roofer assessment amp repair Windowdoor replacementComment Parking lot may drain in 24hrs

Status EmailActions Report Contact JRNLRecommendation Summaries | SA OverallStaff Inadequate 10107 1232pm J SmithFacility Inadequate 10107 1001am R CarrGoods Inadequate 10107 1058am B FallsTrans Functional 10107 1145am L Staff

Facility Non-Func 10107 1001am R Carr

FACILITY Building amp GroundsRoof Damaged 10107 715am R GarciaWindows Damaged 10107 758am R GarciaDoor Damaged 10107 650am R GarciaSecurity NAWater NAParking Damaged 10107 640 am R Garcia

33

41

Acknowledgementsbull Funding

ndash NSF Career Awardndash NIH ndash Faculty Research Awards from IBM Xerox and NECndash Equipment Donation from IBM

bull Collaboratorsndash Dr Chris Ding (University of Texas at Arlington)ndash Dr Michael I Jordan (UC Berkeley)ndash Dr Charles Perng (IBM Research)ndash Dr Tong Sun (Xerox Research)ndash Dr Shenghuo Zhu (NEC Research)

bull PhD Studentsndash Dr David Kaiserndash Dr Wei Pengndash Bo Shaondash Dingding Wangndash Haifeng Wang ndash Yi Zhang

42

Questionbull Email taolicsfiuedubull httpwwwcsfiuedu~taoli

  • Data Mining Efficiently Extracting Interpretable and Actionable Patterns
  • Research Challenges
  • My Research Directions
  • Some Projects in My Group
  • Example of Call Center Cases
  • Motivation
  • Two Different Approaches
  • Semantic Feature Extraction
  • Case Study ndash Online Help Desk
  • Some Projects in My Group
  • SQL-Query-Result Navigation
  • Existing Solutions ndash Iterative Refinement
  • Existing Solutions - Ranking
  • Existing Solutions - Ranking
  • Existing Solutions - Categorization
  • Existing Solutions - Categorization
  • Address Diverse User Preferences (Chen amp Li SIGMOD 2007)
  • Some Projects in My Group
  • System Log Data
  • Knowledge validation completion and construction
  • A Case Study
  • Pattern Visualization in the Case Study
  • ERN in the Case Study
  • Other Projects in My Group
  • Examples of KPI Types
  • Methodology for Leading Indicator Learning
  • Some Projects in My Group
  • Biological Data Sources
  • Problem We are Studying
  • Some Projects in My Group
  • Business Continuity Information Network (BCIN)
  • BCIN
  • Acknowledgements
  • Question
Page 9: Data Mining: Efficiently Extracting Interpretable and ...taoli/TaoLi-Talk-1.pdf · Data Mining: Efficiently Extracting Interpretable and Actionable Patterns ... – Efficiently extracting

9

Semantic Feature ExtractionTo detect sentence relations

sentence 1

sentence 2SENNA

(NEC Parser)

frames of sentence 1

frames of sentence 2

finding word relations among words with the same semantic role in each frame pair using

WordNet

Features

Semantic Roles For certain verbs of motionArg0 causer of motion Arg3 start point Arg1 thing in motion Arg4 end pointArg2 distance moved Arg5 direction-- eg [Arg0 I] [ArgM-MOD can][ArgM-NEG not] [rel receive] [Arg1 emails]

Subtypes of ArgM modifier tagLOC location CAU causeEXT extent TMP timeDIS discourse connectives PNC purposeADV general-purpose MNR mannerNEG negation marker DIR directionMOD modal verb

Word Relations WordNet 30 (147306 words)20 word relations egidentical synonyms antonyms attribute hypernym and hyponym-- A is a hypernym of B if B is a kind of Ameronym and holonym-- A is a meronym of B if A is a part or member of B

verb

frame

FeaturesArg0_identicalArg0_synonymsArg0_hypernymArg0_meronymrel_identicalrel_synonymsrel_hypernymRel_meronymArg1_identical

hellip hellip

10

Case Study ndash Online Help Desk

Interface

Most related emails

15496txt 18697txt 11002txt3191txt

Can you switch these two computers

Slides from Dingding Wang

11

Some Projects in My Group

bull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Business Intelligence

ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysisbull Matrix-Based Algorithms for Data Mining

12

SQL-Query-Result NavigationWhat mutual funds to invest There are thousands of them

High return Low risk Low expense hellip

Information overloading problem users often can not form a query that returns answers exactly matching their needs

13

Existing Solutions ndashIterative Refinement

bull Iterative refinementndash Find all funds with 3 Year Return gt 10ndash Find all funds with 3 Year Return gt 10 and 1

Year Return gt 10ndash Find all funds with 3 Year and 1 Year Return gt

10 and expense lt 2ndash hellip

bull Problem too time consuming

14

Existing Solutions - Ranking

bull Rankingndash Similar to TFIDF (Agrawal et al CIDR 2003)ndash Probabilistic ranking (Chaudhuri et al VLDB 2004)ndash Ordering attributes (Das et al SIGMOD 2006)ndash Context-sensitive (Agrawal et al SIGMOD 2006)ndash Based on object relationships (Chakrabarti et al SIGMOD

2006)ndash A lot of work in IR field (eg Shen et al SIGIR 2005)

15

Existing Solutions - Ranking

bull Problem the same ranking function for all users (except the context sensitive work)ndash User 1 prefers high returnsndash User 2 prefers lower risks (often means lower

returns)ndash User 3 prefers a mix of high return funds and low

risk fundsndash One ranking function can not fit all

16

Existing Solutions -Categorization

Chakrabarti et al SIGMOD 2004Divide query results recursively into categories (a tree structure)

The splitting attributes are selected based on query history (more frequently asked attributes are typically put closer to root)

User can navigate this tree and show records in a node

Categorization is complementary to ranking (rank records in a node)

Root(193)

10 lt= 3 Yr return lt 20(71)

3 Yr return gt= 20 (29)

0 lt= 3 Yr return lt 10 (93)

17

Existing Solutions -Categorization

Problem differences between users are not consideredRoot(193)

0 lt= 3 Yr return lt 10 (93)

10 lt= 3 Yr return lt 20(71)

3 Yr return gt= 20 (29)

10 lt=1 Yr return lt 20 (55)

10 lt= Std lt 20 (48)

5 lt= Std lt 10 (7)

0 lt= 1 Yr return lt 10 (63)

0 lt= Std lt 5 (38)

User 1 3

User 2 3 User 2 3

This tree does not help user 2 and 3 much

We try to address this issue for categorization

18

Address Diverse User Preferences (Chen amp Li SIGMOD 2007)

bull Offline infer a set of typical types of user preferencesndash Data are clustered using query history of all existing usersndash Each cluster represents a typical type of user preference

(eg high return funds low risk funds)

bull Runtime ndash A navigational tree is constructed over clusters in the query resultsndash This tree highlights the differences between these clustersndash User navigates this tree to select clusters that he is interested inndash User can then rank or browse records in those clusters

19

Some Projects in My Group

bull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Business Intelligence

ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysisbull Matrix-Based Algorithms for Data Mining

20

System Log Databull Performance Metric

ndash CPUMemorySwap utilization workload average response timebull Application-level Log

ndash Windows NT system log Websphere application log Oracle activitybull Failure data

ndash System crash dumps error reportsbull Network Traffic Data

ndash SNMP Link-level byte counts Netflow Link-level flow statistics PMA link-level packet header traces TCPdump host-side packet traces

bull Request Datandash Apache and Microsoft IIS log

bull Reports from Operatorsndash Trouble ticket data problems reports possible causes and symptoms for failures

bull Othersndash Network-based alert logs Probes

21

logs

logs

logs

HistoricalData Collection

Realtime Analysis

Offline Analysis

Knowledge Base

Real Time Management

Anomaly Detection

Fault Diagnosis

Problem Determination

CorrelationDependencyKnowledge

SummarizationVisualizationRule Construction

Temporal Pattern DiscoveryKnowledge Management

PlanningActions

The Integrated Framework on Data-Driven Computing System Management

Log Data Organization

Componentlogs Active Data Collection

Log Adapter

Situation Identification

and Categorization

22

Raw data-gtData table

Aggregationsummarization

Event Plots

Pattern MiningPattern Mining

OutagesTemporal patternsEvent bursts

Morning rebootingperiodic patterns

Weekly rebootingPeriodic patterns

Hosts

Knowledge validation completion and construction

23

A Case Study

State start stop dependency create connection report request configuration otherPlotX‐axis Time Y‐axis State

24

Pattern Visualization in the Case Study

25

ERN in the Case Study

26

Other Projects in My Group

bull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Business Intelligence

ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysis

27

bull Operationalbusiness Intelligencendash Collecting and analyzing metrics (Key

Performance Indicators) to support better decision-making

bull Focus leading indicator discovery

ndash Using operational metrics to adaptively manage amp improve workflow efficiency

bull Focus workflow exception detection and tracing

28

Examples of KPI Types

Leading Indicators(Value Drivers)

Lagging Indicators(Outcomes )

Number of clients that sales people meet face to face each week

Sales revenue

Complex repairs completed successfully during the first call or visit

Customer satisfaction

Number of parts for which orders exceed forecasts within 30 days of scheduled delivery

Per unit manufacturing costs

Number of customers who are delinquent paying their first bill

Customer churn

29

Methodology for Leading Indicator Learning

Selected Leading Indicators or Leading Indicator Combinations

clustering

Similarity matrix (or Distance matrix) using non-metric distance or Granger Causality

hellip

hellip

hellipThe root KPI may be the key leading indicator

Time seriesproduction model(eg assumption data model)

Leading Indicator Selection based on Domain Knowledge

sequence KPIs

Reduced Time series

define KPIs

correlation

dimensionalityreduction

If no update modelIs Model Valid

New Business Concern(s)

If yes

Leading Indicators

Raw databullProduction metricsbullWorkflow activity logsbullJobDeviceDocument events

30

Some Projects in My Groupbull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Social Network Analysisbull Business Intelligence

ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysisbull Matrix-Based Algorithms for Data Mining

31

Biological Data SourcesGene Expression Text LiteratureSequence Data

Clustering Protein ExpressionProtein-protein Interactions

Transcription FactorBinding SitesGene Ontology

Gene Clusters

Function Co-regulation

32

Problem We are Studyingbull Marker Gene Identification

ndash httpwwwcsfiuedu~yzhan004geneselhtmlndash 15 gene selection algorithms

bull ReliefF F-statistic A-optimality D-Optimality mRNR ReliefF-mRNR GSNR

bull RankGene 11

bull How to perform clustering using data from multiple diverse sources

bull How to perform clustering using partial prior knowledge

33

Some Projects in My Group

bull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Business Intelligence

ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysisbull Matrix-Based Algorithms for Data Mining

34

Business Continuity Information Network (BCIN)

35

BCIN

bull Central questionndash How to help businesses recover and resume

operations quicker after a major hurricane (or other) disaster

3620082008--44--66 Florida International UniversityFlorida International University 3636

3720082008--44--66 Florida International UniversityFlorida International University 3737

3820082008--44--66 Florida International UniversityFlorida International University 3838

3920082008--44--66 Florida International UniversityFlorida International University 3939

40Florida International UniversityFlorida International University

Situation Analysis

John Smith Asst Manager Store 2345

Rcmd Non-Functional 10107 1001am R CarrRequest Roofer assessment amp repair Windowdoor replacementComment Parking lot may drain in 24hrs

Status EmailActions Report Contact JRNLRecommendation Summaries | SA OverallStaff Inadequate 10107 1232pm J SmithFacility Inadequate 10107 1001am R CarrGoods Inadequate 10107 1058am B FallsTrans Functional 10107 1145am L Staff

Facility Non-Func 10107 1001am R Carr

FACILITY Building amp GroundsRoof Damaged 10107 715am R GarciaWindows Damaged 10107 758am R GarciaDoor Damaged 10107 650am R GarciaSecurity NAWater NAParking Damaged 10107 640 am R Garcia

33

41

Acknowledgementsbull Funding

ndash NSF Career Awardndash NIH ndash Faculty Research Awards from IBM Xerox and NECndash Equipment Donation from IBM

bull Collaboratorsndash Dr Chris Ding (University of Texas at Arlington)ndash Dr Michael I Jordan (UC Berkeley)ndash Dr Charles Perng (IBM Research)ndash Dr Tong Sun (Xerox Research)ndash Dr Shenghuo Zhu (NEC Research)

bull PhD Studentsndash Dr David Kaiserndash Dr Wei Pengndash Bo Shaondash Dingding Wangndash Haifeng Wang ndash Yi Zhang

42

Questionbull Email taolicsfiuedubull httpwwwcsfiuedu~taoli

  • Data Mining Efficiently Extracting Interpretable and Actionable Patterns
  • Research Challenges
  • My Research Directions
  • Some Projects in My Group
  • Example of Call Center Cases
  • Motivation
  • Two Different Approaches
  • Semantic Feature Extraction
  • Case Study ndash Online Help Desk
  • Some Projects in My Group
  • SQL-Query-Result Navigation
  • Existing Solutions ndash Iterative Refinement
  • Existing Solutions - Ranking
  • Existing Solutions - Ranking
  • Existing Solutions - Categorization
  • Existing Solutions - Categorization
  • Address Diverse User Preferences (Chen amp Li SIGMOD 2007)
  • Some Projects in My Group
  • System Log Data
  • Knowledge validation completion and construction
  • A Case Study
  • Pattern Visualization in the Case Study
  • ERN in the Case Study
  • Other Projects in My Group
  • Examples of KPI Types
  • Methodology for Leading Indicator Learning
  • Some Projects in My Group
  • Biological Data Sources
  • Problem We are Studying
  • Some Projects in My Group
  • Business Continuity Information Network (BCIN)
  • BCIN
  • Acknowledgements
  • Question
Page 10: Data Mining: Efficiently Extracting Interpretable and ...taoli/TaoLi-Talk-1.pdf · Data Mining: Efficiently Extracting Interpretable and Actionable Patterns ... – Efficiently extracting

10

Case Study ndash Online Help Desk

Interface

Most related emails

15496txt 18697txt 11002txt3191txt

Can you switch these two computers

Slides from Dingding Wang

11

Some Projects in My Group

bull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Business Intelligence

ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysisbull Matrix-Based Algorithms for Data Mining

12

SQL-Query-Result NavigationWhat mutual funds to invest There are thousands of them

High return Low risk Low expense hellip

Information overloading problem users often can not form a query that returns answers exactly matching their needs

13

Existing Solutions ndashIterative Refinement

bull Iterative refinementndash Find all funds with 3 Year Return gt 10ndash Find all funds with 3 Year Return gt 10 and 1

Year Return gt 10ndash Find all funds with 3 Year and 1 Year Return gt

10 and expense lt 2ndash hellip

bull Problem too time consuming

14

Existing Solutions - Ranking

bull Rankingndash Similar to TFIDF (Agrawal et al CIDR 2003)ndash Probabilistic ranking (Chaudhuri et al VLDB 2004)ndash Ordering attributes (Das et al SIGMOD 2006)ndash Context-sensitive (Agrawal et al SIGMOD 2006)ndash Based on object relationships (Chakrabarti et al SIGMOD

2006)ndash A lot of work in IR field (eg Shen et al SIGIR 2005)

15

Existing Solutions - Ranking

bull Problem the same ranking function for all users (except the context sensitive work)ndash User 1 prefers high returnsndash User 2 prefers lower risks (often means lower

returns)ndash User 3 prefers a mix of high return funds and low

risk fundsndash One ranking function can not fit all

16

Existing Solutions -Categorization

Chakrabarti et al SIGMOD 2004Divide query results recursively into categories (a tree structure)

The splitting attributes are selected based on query history (more frequently asked attributes are typically put closer to root)

User can navigate this tree and show records in a node

Categorization is complementary to ranking (rank records in a node)

Root(193)

10 lt= 3 Yr return lt 20(71)

3 Yr return gt= 20 (29)

0 lt= 3 Yr return lt 10 (93)

17

Existing Solutions -Categorization

Problem differences between users are not consideredRoot(193)

0 lt= 3 Yr return lt 10 (93)

10 lt= 3 Yr return lt 20(71)

3 Yr return gt= 20 (29)

10 lt=1 Yr return lt 20 (55)

10 lt= Std lt 20 (48)

5 lt= Std lt 10 (7)

0 lt= 1 Yr return lt 10 (63)

0 lt= Std lt 5 (38)

User 1 3

User 2 3 User 2 3

This tree does not help user 2 and 3 much

We try to address this issue for categorization

18

Address Diverse User Preferences (Chen amp Li SIGMOD 2007)

bull Offline infer a set of typical types of user preferencesndash Data are clustered using query history of all existing usersndash Each cluster represents a typical type of user preference

(eg high return funds low risk funds)

bull Runtime ndash A navigational tree is constructed over clusters in the query resultsndash This tree highlights the differences between these clustersndash User navigates this tree to select clusters that he is interested inndash User can then rank or browse records in those clusters

19

Some Projects in My Group

bull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Business Intelligence

ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysisbull Matrix-Based Algorithms for Data Mining

20

System Log Databull Performance Metric

ndash CPUMemorySwap utilization workload average response timebull Application-level Log

ndash Windows NT system log Websphere application log Oracle activitybull Failure data

ndash System crash dumps error reportsbull Network Traffic Data

ndash SNMP Link-level byte counts Netflow Link-level flow statistics PMA link-level packet header traces TCPdump host-side packet traces

bull Request Datandash Apache and Microsoft IIS log

bull Reports from Operatorsndash Trouble ticket data problems reports possible causes and symptoms for failures

bull Othersndash Network-based alert logs Probes

21

logs

logs

logs

HistoricalData Collection

Realtime Analysis

Offline Analysis

Knowledge Base

Real Time Management

Anomaly Detection

Fault Diagnosis

Problem Determination

CorrelationDependencyKnowledge

SummarizationVisualizationRule Construction

Temporal Pattern DiscoveryKnowledge Management

PlanningActions

The Integrated Framework on Data-Driven Computing System Management

Log Data Organization

Componentlogs Active Data Collection

Log Adapter

Situation Identification

and Categorization

22

Raw data-gtData table

Aggregationsummarization

Event Plots

Pattern MiningPattern Mining

OutagesTemporal patternsEvent bursts

Morning rebootingperiodic patterns

Weekly rebootingPeriodic patterns

Hosts

Knowledge validation completion and construction

23

A Case Study

State start stop dependency create connection report request configuration otherPlotX‐axis Time Y‐axis State

24

Pattern Visualization in the Case Study

25

ERN in the Case Study

26

Other Projects in My Group

bull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Business Intelligence

ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysis

27

bull Operationalbusiness Intelligencendash Collecting and analyzing metrics (Key

Performance Indicators) to support better decision-making

bull Focus leading indicator discovery

ndash Using operational metrics to adaptively manage amp improve workflow efficiency

bull Focus workflow exception detection and tracing

28

Examples of KPI Types

Leading Indicators(Value Drivers)

Lagging Indicators(Outcomes )

Number of clients that sales people meet face to face each week

Sales revenue

Complex repairs completed successfully during the first call or visit

Customer satisfaction

Number of parts for which orders exceed forecasts within 30 days of scheduled delivery

Per unit manufacturing costs

Number of customers who are delinquent paying their first bill

Customer churn

29

Methodology for Leading Indicator Learning

Selected Leading Indicators or Leading Indicator Combinations

clustering

Similarity matrix (or Distance matrix) using non-metric distance or Granger Causality

hellip

hellip

hellipThe root KPI may be the key leading indicator

Time seriesproduction model(eg assumption data model)

Leading Indicator Selection based on Domain Knowledge

sequence KPIs

Reduced Time series

define KPIs

correlation

dimensionalityreduction

If no update modelIs Model Valid

New Business Concern(s)

If yes

Leading Indicators

Raw databullProduction metricsbullWorkflow activity logsbullJobDeviceDocument events

30

Some Projects in My Groupbull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Social Network Analysisbull Business Intelligence

ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysisbull Matrix-Based Algorithms for Data Mining

31

Biological Data SourcesGene Expression Text LiteratureSequence Data

Clustering Protein ExpressionProtein-protein Interactions

Transcription FactorBinding SitesGene Ontology

Gene Clusters

Function Co-regulation

32

Problem We are Studyingbull Marker Gene Identification

ndash httpwwwcsfiuedu~yzhan004geneselhtmlndash 15 gene selection algorithms

bull ReliefF F-statistic A-optimality D-Optimality mRNR ReliefF-mRNR GSNR

bull RankGene 11

bull How to perform clustering using data from multiple diverse sources

bull How to perform clustering using partial prior knowledge

33

Some Projects in My Group

bull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Business Intelligence

ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysisbull Matrix-Based Algorithms for Data Mining

34

Business Continuity Information Network (BCIN)

35

BCIN

bull Central questionndash How to help businesses recover and resume

operations quicker after a major hurricane (or other) disaster

3620082008--44--66 Florida International UniversityFlorida International University 3636

3720082008--44--66 Florida International UniversityFlorida International University 3737

3820082008--44--66 Florida International UniversityFlorida International University 3838

3920082008--44--66 Florida International UniversityFlorida International University 3939

40Florida International UniversityFlorida International University

Situation Analysis

John Smith Asst Manager Store 2345

Rcmd Non-Functional 10107 1001am R CarrRequest Roofer assessment amp repair Windowdoor replacementComment Parking lot may drain in 24hrs

Status EmailActions Report Contact JRNLRecommendation Summaries | SA OverallStaff Inadequate 10107 1232pm J SmithFacility Inadequate 10107 1001am R CarrGoods Inadequate 10107 1058am B FallsTrans Functional 10107 1145am L Staff

Facility Non-Func 10107 1001am R Carr

FACILITY Building amp GroundsRoof Damaged 10107 715am R GarciaWindows Damaged 10107 758am R GarciaDoor Damaged 10107 650am R GarciaSecurity NAWater NAParking Damaged 10107 640 am R Garcia

33

41

Acknowledgementsbull Funding

ndash NSF Career Awardndash NIH ndash Faculty Research Awards from IBM Xerox and NECndash Equipment Donation from IBM

bull Collaboratorsndash Dr Chris Ding (University of Texas at Arlington)ndash Dr Michael I Jordan (UC Berkeley)ndash Dr Charles Perng (IBM Research)ndash Dr Tong Sun (Xerox Research)ndash Dr Shenghuo Zhu (NEC Research)

bull PhD Studentsndash Dr David Kaiserndash Dr Wei Pengndash Bo Shaondash Dingding Wangndash Haifeng Wang ndash Yi Zhang

42

Questionbull Email taolicsfiuedubull httpwwwcsfiuedu~taoli

  • Data Mining Efficiently Extracting Interpretable and Actionable Patterns
  • Research Challenges
  • My Research Directions
  • Some Projects in My Group
  • Example of Call Center Cases
  • Motivation
  • Two Different Approaches
  • Semantic Feature Extraction
  • Case Study ndash Online Help Desk
  • Some Projects in My Group
  • SQL-Query-Result Navigation
  • Existing Solutions ndash Iterative Refinement
  • Existing Solutions - Ranking
  • Existing Solutions - Ranking
  • Existing Solutions - Categorization
  • Existing Solutions - Categorization
  • Address Diverse User Preferences (Chen amp Li SIGMOD 2007)
  • Some Projects in My Group
  • System Log Data
  • Knowledge validation completion and construction
  • A Case Study
  • Pattern Visualization in the Case Study
  • ERN in the Case Study
  • Other Projects in My Group
  • Examples of KPI Types
  • Methodology for Leading Indicator Learning
  • Some Projects in My Group
  • Biological Data Sources
  • Problem We are Studying
  • Some Projects in My Group
  • Business Continuity Information Network (BCIN)
  • BCIN
  • Acknowledgements
  • Question
Page 11: Data Mining: Efficiently Extracting Interpretable and ...taoli/TaoLi-Talk-1.pdf · Data Mining: Efficiently Extracting Interpretable and Actionable Patterns ... – Efficiently extracting

11

Some Projects in My Group

bull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Business Intelligence

ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysisbull Matrix-Based Algorithms for Data Mining

12

SQL-Query-Result NavigationWhat mutual funds to invest There are thousands of them

High return Low risk Low expense hellip

Information overloading problem users often can not form a query that returns answers exactly matching their needs

13

Existing Solutions ndashIterative Refinement

bull Iterative refinementndash Find all funds with 3 Year Return gt 10ndash Find all funds with 3 Year Return gt 10 and 1

Year Return gt 10ndash Find all funds with 3 Year and 1 Year Return gt

10 and expense lt 2ndash hellip

bull Problem too time consuming

14

Existing Solutions - Ranking

bull Rankingndash Similar to TFIDF (Agrawal et al CIDR 2003)ndash Probabilistic ranking (Chaudhuri et al VLDB 2004)ndash Ordering attributes (Das et al SIGMOD 2006)ndash Context-sensitive (Agrawal et al SIGMOD 2006)ndash Based on object relationships (Chakrabarti et al SIGMOD

2006)ndash A lot of work in IR field (eg Shen et al SIGIR 2005)

15

Existing Solutions - Ranking

bull Problem the same ranking function for all users (except the context sensitive work)ndash User 1 prefers high returnsndash User 2 prefers lower risks (often means lower

returns)ndash User 3 prefers a mix of high return funds and low

risk fundsndash One ranking function can not fit all

16

Existing Solutions -Categorization

Chakrabarti et al SIGMOD 2004Divide query results recursively into categories (a tree structure)

The splitting attributes are selected based on query history (more frequently asked attributes are typically put closer to root)

User can navigate this tree and show records in a node

Categorization is complementary to ranking (rank records in a node)

Root(193)

10 lt= 3 Yr return lt 20(71)

3 Yr return gt= 20 (29)

0 lt= 3 Yr return lt 10 (93)

17

Existing Solutions -Categorization

Problem differences between users are not consideredRoot(193)

0 lt= 3 Yr return lt 10 (93)

10 lt= 3 Yr return lt 20(71)

3 Yr return gt= 20 (29)

10 lt=1 Yr return lt 20 (55)

10 lt= Std lt 20 (48)

5 lt= Std lt 10 (7)

0 lt= 1 Yr return lt 10 (63)

0 lt= Std lt 5 (38)

User 1 3

User 2 3 User 2 3

This tree does not help user 2 and 3 much

We try to address this issue for categorization

18

Address Diverse User Preferences (Chen amp Li SIGMOD 2007)

bull Offline infer a set of typical types of user preferencesndash Data are clustered using query history of all existing usersndash Each cluster represents a typical type of user preference

(eg high return funds low risk funds)

bull Runtime ndash A navigational tree is constructed over clusters in the query resultsndash This tree highlights the differences between these clustersndash User navigates this tree to select clusters that he is interested inndash User can then rank or browse records in those clusters

19

Some Projects in My Group

bull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Business Intelligence

ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysisbull Matrix-Based Algorithms for Data Mining

20

System Log Databull Performance Metric

ndash CPUMemorySwap utilization workload average response timebull Application-level Log

ndash Windows NT system log Websphere application log Oracle activitybull Failure data

ndash System crash dumps error reportsbull Network Traffic Data

ndash SNMP Link-level byte counts Netflow Link-level flow statistics PMA link-level packet header traces TCPdump host-side packet traces

bull Request Datandash Apache and Microsoft IIS log

bull Reports from Operatorsndash Trouble ticket data problems reports possible causes and symptoms for failures

bull Othersndash Network-based alert logs Probes

21

logs

logs

logs

HistoricalData Collection

Realtime Analysis

Offline Analysis

Knowledge Base

Real Time Management

Anomaly Detection

Fault Diagnosis

Problem Determination

CorrelationDependencyKnowledge

SummarizationVisualizationRule Construction

Temporal Pattern DiscoveryKnowledge Management

PlanningActions

The Integrated Framework on Data-Driven Computing System Management

Log Data Organization

Componentlogs Active Data Collection

Log Adapter

Situation Identification

and Categorization

22

Raw data-gtData table

Aggregationsummarization

Event Plots

Pattern MiningPattern Mining

OutagesTemporal patternsEvent bursts

Morning rebootingperiodic patterns

Weekly rebootingPeriodic patterns

Hosts

Knowledge validation completion and construction

23

A Case Study

State start stop dependency create connection report request configuration otherPlotX‐axis Time Y‐axis State

24

Pattern Visualization in the Case Study

25

ERN in the Case Study

26

Other Projects in My Group

bull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Business Intelligence

ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysis

27

bull Operationalbusiness Intelligencendash Collecting and analyzing metrics (Key

Performance Indicators) to support better decision-making

bull Focus leading indicator discovery

ndash Using operational metrics to adaptively manage amp improve workflow efficiency

bull Focus workflow exception detection and tracing

28

Examples of KPI Types

Leading Indicators(Value Drivers)

Lagging Indicators(Outcomes )

Number of clients that sales people meet face to face each week

Sales revenue

Complex repairs completed successfully during the first call or visit

Customer satisfaction

Number of parts for which orders exceed forecasts within 30 days of scheduled delivery

Per unit manufacturing costs

Number of customers who are delinquent paying their first bill

Customer churn

29

Methodology for Leading Indicator Learning

Selected Leading Indicators or Leading Indicator Combinations

clustering

Similarity matrix (or Distance matrix) using non-metric distance or Granger Causality

hellip

hellip

hellipThe root KPI may be the key leading indicator

Time seriesproduction model(eg assumption data model)

Leading Indicator Selection based on Domain Knowledge

sequence KPIs

Reduced Time series

define KPIs

correlation

dimensionalityreduction

If no update modelIs Model Valid

New Business Concern(s)

If yes

Leading Indicators

Raw databullProduction metricsbullWorkflow activity logsbullJobDeviceDocument events

30

Some Projects in My Groupbull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Social Network Analysisbull Business Intelligence

ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysisbull Matrix-Based Algorithms for Data Mining

31

Biological Data SourcesGene Expression Text LiteratureSequence Data

Clustering Protein ExpressionProtein-protein Interactions

Transcription FactorBinding SitesGene Ontology

Gene Clusters

Function Co-regulation

32

Problem We are Studyingbull Marker Gene Identification

ndash httpwwwcsfiuedu~yzhan004geneselhtmlndash 15 gene selection algorithms

bull ReliefF F-statistic A-optimality D-Optimality mRNR ReliefF-mRNR GSNR

bull RankGene 11

bull How to perform clustering using data from multiple diverse sources

bull How to perform clustering using partial prior knowledge

33

Some Projects in My Group

bull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Business Intelligence

ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysisbull Matrix-Based Algorithms for Data Mining

34

Business Continuity Information Network (BCIN)

35

BCIN

bull Central questionndash How to help businesses recover and resume

operations quicker after a major hurricane (or other) disaster

3620082008--44--66 Florida International UniversityFlorida International University 3636

3720082008--44--66 Florida International UniversityFlorida International University 3737

3820082008--44--66 Florida International UniversityFlorida International University 3838

3920082008--44--66 Florida International UniversityFlorida International University 3939

40Florida International UniversityFlorida International University

Situation Analysis

John Smith Asst Manager Store 2345

Rcmd Non-Functional 10107 1001am R CarrRequest Roofer assessment amp repair Windowdoor replacementComment Parking lot may drain in 24hrs

Status EmailActions Report Contact JRNLRecommendation Summaries | SA OverallStaff Inadequate 10107 1232pm J SmithFacility Inadequate 10107 1001am R CarrGoods Inadequate 10107 1058am B FallsTrans Functional 10107 1145am L Staff

Facility Non-Func 10107 1001am R Carr

FACILITY Building amp GroundsRoof Damaged 10107 715am R GarciaWindows Damaged 10107 758am R GarciaDoor Damaged 10107 650am R GarciaSecurity NAWater NAParking Damaged 10107 640 am R Garcia

33

41

Acknowledgementsbull Funding

ndash NSF Career Awardndash NIH ndash Faculty Research Awards from IBM Xerox and NECndash Equipment Donation from IBM

bull Collaboratorsndash Dr Chris Ding (University of Texas at Arlington)ndash Dr Michael I Jordan (UC Berkeley)ndash Dr Charles Perng (IBM Research)ndash Dr Tong Sun (Xerox Research)ndash Dr Shenghuo Zhu (NEC Research)

bull PhD Studentsndash Dr David Kaiserndash Dr Wei Pengndash Bo Shaondash Dingding Wangndash Haifeng Wang ndash Yi Zhang

42

Questionbull Email taolicsfiuedubull httpwwwcsfiuedu~taoli

  • Data Mining Efficiently Extracting Interpretable and Actionable Patterns
  • Research Challenges
  • My Research Directions
  • Some Projects in My Group
  • Example of Call Center Cases
  • Motivation
  • Two Different Approaches
  • Semantic Feature Extraction
  • Case Study ndash Online Help Desk
  • Some Projects in My Group
  • SQL-Query-Result Navigation
  • Existing Solutions ndash Iterative Refinement
  • Existing Solutions - Ranking
  • Existing Solutions - Ranking
  • Existing Solutions - Categorization
  • Existing Solutions - Categorization
  • Address Diverse User Preferences (Chen amp Li SIGMOD 2007)
  • Some Projects in My Group
  • System Log Data
  • Knowledge validation completion and construction
  • A Case Study
  • Pattern Visualization in the Case Study
  • ERN in the Case Study
  • Other Projects in My Group
  • Examples of KPI Types
  • Methodology for Leading Indicator Learning
  • Some Projects in My Group
  • Biological Data Sources
  • Problem We are Studying
  • Some Projects in My Group
  • Business Continuity Information Network (BCIN)
  • BCIN
  • Acknowledgements
  • Question
Page 12: Data Mining: Efficiently Extracting Interpretable and ...taoli/TaoLi-Talk-1.pdf · Data Mining: Efficiently Extracting Interpretable and Actionable Patterns ... – Efficiently extracting

12

SQL-Query-Result NavigationWhat mutual funds to invest There are thousands of them

High return Low risk Low expense hellip

Information overloading problem users often can not form a query that returns answers exactly matching their needs

13

Existing Solutions ndashIterative Refinement

bull Iterative refinementndash Find all funds with 3 Year Return gt 10ndash Find all funds with 3 Year Return gt 10 and 1

Year Return gt 10ndash Find all funds with 3 Year and 1 Year Return gt

10 and expense lt 2ndash hellip

bull Problem too time consuming

14

Existing Solutions - Ranking

bull Rankingndash Similar to TFIDF (Agrawal et al CIDR 2003)ndash Probabilistic ranking (Chaudhuri et al VLDB 2004)ndash Ordering attributes (Das et al SIGMOD 2006)ndash Context-sensitive (Agrawal et al SIGMOD 2006)ndash Based on object relationships (Chakrabarti et al SIGMOD

2006)ndash A lot of work in IR field (eg Shen et al SIGIR 2005)

15

Existing Solutions - Ranking

bull Problem the same ranking function for all users (except the context sensitive work)ndash User 1 prefers high returnsndash User 2 prefers lower risks (often means lower

returns)ndash User 3 prefers a mix of high return funds and low

risk fundsndash One ranking function can not fit all

16

Existing Solutions -Categorization

Chakrabarti et al SIGMOD 2004Divide query results recursively into categories (a tree structure)

The splitting attributes are selected based on query history (more frequently asked attributes are typically put closer to root)

User can navigate this tree and show records in a node

Categorization is complementary to ranking (rank records in a node)

Root(193)

10 lt= 3 Yr return lt 20(71)

3 Yr return gt= 20 (29)

0 lt= 3 Yr return lt 10 (93)

17

Existing Solutions -Categorization

Problem differences between users are not consideredRoot(193)

0 lt= 3 Yr return lt 10 (93)

10 lt= 3 Yr return lt 20(71)

3 Yr return gt= 20 (29)

10 lt=1 Yr return lt 20 (55)

10 lt= Std lt 20 (48)

5 lt= Std lt 10 (7)

0 lt= 1 Yr return lt 10 (63)

0 lt= Std lt 5 (38)

User 1 3

User 2 3 User 2 3

This tree does not help user 2 and 3 much

We try to address this issue for categorization

18

Address Diverse User Preferences (Chen amp Li SIGMOD 2007)

bull Offline infer a set of typical types of user preferencesndash Data are clustered using query history of all existing usersndash Each cluster represents a typical type of user preference

(eg high return funds low risk funds)

bull Runtime ndash A navigational tree is constructed over clusters in the query resultsndash This tree highlights the differences between these clustersndash User navigates this tree to select clusters that he is interested inndash User can then rank or browse records in those clusters

19

Some Projects in My Group

bull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Business Intelligence

ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysisbull Matrix-Based Algorithms for Data Mining

20

System Log Databull Performance Metric

ndash CPUMemorySwap utilization workload average response timebull Application-level Log

ndash Windows NT system log Websphere application log Oracle activitybull Failure data

ndash System crash dumps error reportsbull Network Traffic Data

ndash SNMP Link-level byte counts Netflow Link-level flow statistics PMA link-level packet header traces TCPdump host-side packet traces

bull Request Datandash Apache and Microsoft IIS log

bull Reports from Operatorsndash Trouble ticket data problems reports possible causes and symptoms for failures

bull Othersndash Network-based alert logs Probes

21

logs

logs

logs

HistoricalData Collection

Realtime Analysis

Offline Analysis

Knowledge Base

Real Time Management

Anomaly Detection

Fault Diagnosis

Problem Determination

CorrelationDependencyKnowledge

SummarizationVisualizationRule Construction

Temporal Pattern DiscoveryKnowledge Management

PlanningActions

The Integrated Framework on Data-Driven Computing System Management

Log Data Organization

Componentlogs Active Data Collection

Log Adapter

Situation Identification

and Categorization

22

Raw data-gtData table

Aggregationsummarization

Event Plots

Pattern MiningPattern Mining

OutagesTemporal patternsEvent bursts

Morning rebootingperiodic patterns

Weekly rebootingPeriodic patterns

Hosts

Knowledge validation completion and construction

23

A Case Study

State start stop dependency create connection report request configuration otherPlotX‐axis Time Y‐axis State

24

Pattern Visualization in the Case Study

25

ERN in the Case Study

26

Other Projects in My Group

bull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Business Intelligence

ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysis

27

bull Operationalbusiness Intelligencendash Collecting and analyzing metrics (Key

Performance Indicators) to support better decision-making

bull Focus leading indicator discovery

ndash Using operational metrics to adaptively manage amp improve workflow efficiency

bull Focus workflow exception detection and tracing

28

Examples of KPI Types

Leading Indicators(Value Drivers)

Lagging Indicators(Outcomes )

Number of clients that sales people meet face to face each week

Sales revenue

Complex repairs completed successfully during the first call or visit

Customer satisfaction

Number of parts for which orders exceed forecasts within 30 days of scheduled delivery

Per unit manufacturing costs

Number of customers who are delinquent paying their first bill

Customer churn

29

Methodology for Leading Indicator Learning

Selected Leading Indicators or Leading Indicator Combinations

clustering

Similarity matrix (or Distance matrix) using non-metric distance or Granger Causality

hellip

hellip

hellipThe root KPI may be the key leading indicator

Time seriesproduction model(eg assumption data model)

Leading Indicator Selection based on Domain Knowledge

sequence KPIs

Reduced Time series

define KPIs

correlation

dimensionalityreduction

If no update modelIs Model Valid

New Business Concern(s)

If yes

Leading Indicators

Raw databullProduction metricsbullWorkflow activity logsbullJobDeviceDocument events

30

Some Projects in My Groupbull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Social Network Analysisbull Business Intelligence

ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysisbull Matrix-Based Algorithms for Data Mining

31

Biological Data SourcesGene Expression Text LiteratureSequence Data

Clustering Protein ExpressionProtein-protein Interactions

Transcription FactorBinding SitesGene Ontology

Gene Clusters

Function Co-regulation

32

Problem We are Studyingbull Marker Gene Identification

ndash httpwwwcsfiuedu~yzhan004geneselhtmlndash 15 gene selection algorithms

bull ReliefF F-statistic A-optimality D-Optimality mRNR ReliefF-mRNR GSNR

bull RankGene 11

bull How to perform clustering using data from multiple diverse sources

bull How to perform clustering using partial prior knowledge

33

Some Projects in My Group

bull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Business Intelligence

ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysisbull Matrix-Based Algorithms for Data Mining

34

Business Continuity Information Network (BCIN)

35

BCIN

bull Central questionndash How to help businesses recover and resume

operations quicker after a major hurricane (or other) disaster

3620082008--44--66 Florida International UniversityFlorida International University 3636

3720082008--44--66 Florida International UniversityFlorida International University 3737

3820082008--44--66 Florida International UniversityFlorida International University 3838

3920082008--44--66 Florida International UniversityFlorida International University 3939

40Florida International UniversityFlorida International University

Situation Analysis

John Smith Asst Manager Store 2345

Rcmd Non-Functional 10107 1001am R CarrRequest Roofer assessment amp repair Windowdoor replacementComment Parking lot may drain in 24hrs

Status EmailActions Report Contact JRNLRecommendation Summaries | SA OverallStaff Inadequate 10107 1232pm J SmithFacility Inadequate 10107 1001am R CarrGoods Inadequate 10107 1058am B FallsTrans Functional 10107 1145am L Staff

Facility Non-Func 10107 1001am R Carr

FACILITY Building amp GroundsRoof Damaged 10107 715am R GarciaWindows Damaged 10107 758am R GarciaDoor Damaged 10107 650am R GarciaSecurity NAWater NAParking Damaged 10107 640 am R Garcia

33

41

Acknowledgementsbull Funding

ndash NSF Career Awardndash NIH ndash Faculty Research Awards from IBM Xerox and NECndash Equipment Donation from IBM

bull Collaboratorsndash Dr Chris Ding (University of Texas at Arlington)ndash Dr Michael I Jordan (UC Berkeley)ndash Dr Charles Perng (IBM Research)ndash Dr Tong Sun (Xerox Research)ndash Dr Shenghuo Zhu (NEC Research)

bull PhD Studentsndash Dr David Kaiserndash Dr Wei Pengndash Bo Shaondash Dingding Wangndash Haifeng Wang ndash Yi Zhang

42

Questionbull Email taolicsfiuedubull httpwwwcsfiuedu~taoli

  • Data Mining Efficiently Extracting Interpretable and Actionable Patterns
  • Research Challenges
  • My Research Directions
  • Some Projects in My Group
  • Example of Call Center Cases
  • Motivation
  • Two Different Approaches
  • Semantic Feature Extraction
  • Case Study ndash Online Help Desk
  • Some Projects in My Group
  • SQL-Query-Result Navigation
  • Existing Solutions ndash Iterative Refinement
  • Existing Solutions - Ranking
  • Existing Solutions - Ranking
  • Existing Solutions - Categorization
  • Existing Solutions - Categorization
  • Address Diverse User Preferences (Chen amp Li SIGMOD 2007)
  • Some Projects in My Group
  • System Log Data
  • Knowledge validation completion and construction
  • A Case Study
  • Pattern Visualization in the Case Study
  • ERN in the Case Study
  • Other Projects in My Group
  • Examples of KPI Types
  • Methodology for Leading Indicator Learning
  • Some Projects in My Group
  • Biological Data Sources
  • Problem We are Studying
  • Some Projects in My Group
  • Business Continuity Information Network (BCIN)
  • BCIN
  • Acknowledgements
  • Question
Page 13: Data Mining: Efficiently Extracting Interpretable and ...taoli/TaoLi-Talk-1.pdf · Data Mining: Efficiently Extracting Interpretable and Actionable Patterns ... – Efficiently extracting

13

Existing Solutions ndashIterative Refinement

bull Iterative refinementndash Find all funds with 3 Year Return gt 10ndash Find all funds with 3 Year Return gt 10 and 1

Year Return gt 10ndash Find all funds with 3 Year and 1 Year Return gt

10 and expense lt 2ndash hellip

bull Problem too time consuming

14

Existing Solutions - Ranking

bull Rankingndash Similar to TFIDF (Agrawal et al CIDR 2003)ndash Probabilistic ranking (Chaudhuri et al VLDB 2004)ndash Ordering attributes (Das et al SIGMOD 2006)ndash Context-sensitive (Agrawal et al SIGMOD 2006)ndash Based on object relationships (Chakrabarti et al SIGMOD

2006)ndash A lot of work in IR field (eg Shen et al SIGIR 2005)

15

Existing Solutions - Ranking

bull Problem the same ranking function for all users (except the context sensitive work)ndash User 1 prefers high returnsndash User 2 prefers lower risks (often means lower

returns)ndash User 3 prefers a mix of high return funds and low

risk fundsndash One ranking function can not fit all

16

Existing Solutions -Categorization

Chakrabarti et al SIGMOD 2004Divide query results recursively into categories (a tree structure)

The splitting attributes are selected based on query history (more frequently asked attributes are typically put closer to root)

User can navigate this tree and show records in a node

Categorization is complementary to ranking (rank records in a node)

Root(193)

10 lt= 3 Yr return lt 20(71)

3 Yr return gt= 20 (29)

0 lt= 3 Yr return lt 10 (93)

17

Existing Solutions -Categorization

Problem differences between users are not consideredRoot(193)

0 lt= 3 Yr return lt 10 (93)

10 lt= 3 Yr return lt 20(71)

3 Yr return gt= 20 (29)

10 lt=1 Yr return lt 20 (55)

10 lt= Std lt 20 (48)

5 lt= Std lt 10 (7)

0 lt= 1 Yr return lt 10 (63)

0 lt= Std lt 5 (38)

User 1 3

User 2 3 User 2 3

This tree does not help user 2 and 3 much

We try to address this issue for categorization

18

Address Diverse User Preferences (Chen amp Li SIGMOD 2007)

bull Offline infer a set of typical types of user preferencesndash Data are clustered using query history of all existing usersndash Each cluster represents a typical type of user preference

(eg high return funds low risk funds)

bull Runtime ndash A navigational tree is constructed over clusters in the query resultsndash This tree highlights the differences between these clustersndash User navigates this tree to select clusters that he is interested inndash User can then rank or browse records in those clusters

19

Some Projects in My Group

bull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Business Intelligence

ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysisbull Matrix-Based Algorithms for Data Mining

20

System Log Databull Performance Metric

ndash CPUMemorySwap utilization workload average response timebull Application-level Log

ndash Windows NT system log Websphere application log Oracle activitybull Failure data

ndash System crash dumps error reportsbull Network Traffic Data

ndash SNMP Link-level byte counts Netflow Link-level flow statistics PMA link-level packet header traces TCPdump host-side packet traces

bull Request Datandash Apache and Microsoft IIS log

bull Reports from Operatorsndash Trouble ticket data problems reports possible causes and symptoms for failures

bull Othersndash Network-based alert logs Probes

21

logs

logs

logs

HistoricalData Collection

Realtime Analysis

Offline Analysis

Knowledge Base

Real Time Management

Anomaly Detection

Fault Diagnosis

Problem Determination

CorrelationDependencyKnowledge

SummarizationVisualizationRule Construction

Temporal Pattern DiscoveryKnowledge Management

PlanningActions

The Integrated Framework on Data-Driven Computing System Management

Log Data Organization

Componentlogs Active Data Collection

Log Adapter

Situation Identification

and Categorization

22

Raw data-gtData table

Aggregationsummarization

Event Plots

Pattern MiningPattern Mining

OutagesTemporal patternsEvent bursts

Morning rebootingperiodic patterns

Weekly rebootingPeriodic patterns

Hosts

Knowledge validation completion and construction

23

A Case Study

State start stop dependency create connection report request configuration otherPlotX‐axis Time Y‐axis State

24

Pattern Visualization in the Case Study

25

ERN in the Case Study

26

Other Projects in My Group

bull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Business Intelligence

ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysis

27

bull Operationalbusiness Intelligencendash Collecting and analyzing metrics (Key

Performance Indicators) to support better decision-making

bull Focus leading indicator discovery

ndash Using operational metrics to adaptively manage amp improve workflow efficiency

bull Focus workflow exception detection and tracing

28

Examples of KPI Types

Leading Indicators(Value Drivers)

Lagging Indicators(Outcomes )

Number of clients that sales people meet face to face each week

Sales revenue

Complex repairs completed successfully during the first call or visit

Customer satisfaction

Number of parts for which orders exceed forecasts within 30 days of scheduled delivery

Per unit manufacturing costs

Number of customers who are delinquent paying their first bill

Customer churn

29

Methodology for Leading Indicator Learning

Selected Leading Indicators or Leading Indicator Combinations

clustering

Similarity matrix (or Distance matrix) using non-metric distance or Granger Causality

hellip

hellip

hellipThe root KPI may be the key leading indicator

Time seriesproduction model(eg assumption data model)

Leading Indicator Selection based on Domain Knowledge

sequence KPIs

Reduced Time series

define KPIs

correlation

dimensionalityreduction

If no update modelIs Model Valid

New Business Concern(s)

If yes

Leading Indicators

Raw databullProduction metricsbullWorkflow activity logsbullJobDeviceDocument events

30

Some Projects in My Groupbull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Social Network Analysisbull Business Intelligence

ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysisbull Matrix-Based Algorithms for Data Mining

31

Biological Data SourcesGene Expression Text LiteratureSequence Data

Clustering Protein ExpressionProtein-protein Interactions

Transcription FactorBinding SitesGene Ontology

Gene Clusters

Function Co-regulation

32

Problem We are Studyingbull Marker Gene Identification

ndash httpwwwcsfiuedu~yzhan004geneselhtmlndash 15 gene selection algorithms

bull ReliefF F-statistic A-optimality D-Optimality mRNR ReliefF-mRNR GSNR

bull RankGene 11

bull How to perform clustering using data from multiple diverse sources

bull How to perform clustering using partial prior knowledge

33

Some Projects in My Group

bull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Business Intelligence

ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysisbull Matrix-Based Algorithms for Data Mining

34

Business Continuity Information Network (BCIN)

35

BCIN

bull Central questionndash How to help businesses recover and resume

operations quicker after a major hurricane (or other) disaster

3620082008--44--66 Florida International UniversityFlorida International University 3636

3720082008--44--66 Florida International UniversityFlorida International University 3737

3820082008--44--66 Florida International UniversityFlorida International University 3838

3920082008--44--66 Florida International UniversityFlorida International University 3939

40Florida International UniversityFlorida International University

Situation Analysis

John Smith Asst Manager Store 2345

Rcmd Non-Functional 10107 1001am R CarrRequest Roofer assessment amp repair Windowdoor replacementComment Parking lot may drain in 24hrs

Status EmailActions Report Contact JRNLRecommendation Summaries | SA OverallStaff Inadequate 10107 1232pm J SmithFacility Inadequate 10107 1001am R CarrGoods Inadequate 10107 1058am B FallsTrans Functional 10107 1145am L Staff

Facility Non-Func 10107 1001am R Carr

FACILITY Building amp GroundsRoof Damaged 10107 715am R GarciaWindows Damaged 10107 758am R GarciaDoor Damaged 10107 650am R GarciaSecurity NAWater NAParking Damaged 10107 640 am R Garcia

33

41

Acknowledgementsbull Funding

ndash NSF Career Awardndash NIH ndash Faculty Research Awards from IBM Xerox and NECndash Equipment Donation from IBM

bull Collaboratorsndash Dr Chris Ding (University of Texas at Arlington)ndash Dr Michael I Jordan (UC Berkeley)ndash Dr Charles Perng (IBM Research)ndash Dr Tong Sun (Xerox Research)ndash Dr Shenghuo Zhu (NEC Research)

bull PhD Studentsndash Dr David Kaiserndash Dr Wei Pengndash Bo Shaondash Dingding Wangndash Haifeng Wang ndash Yi Zhang

42

Questionbull Email taolicsfiuedubull httpwwwcsfiuedu~taoli

  • Data Mining Efficiently Extracting Interpretable and Actionable Patterns
  • Research Challenges
  • My Research Directions
  • Some Projects in My Group
  • Example of Call Center Cases
  • Motivation
  • Two Different Approaches
  • Semantic Feature Extraction
  • Case Study ndash Online Help Desk
  • Some Projects in My Group
  • SQL-Query-Result Navigation
  • Existing Solutions ndash Iterative Refinement
  • Existing Solutions - Ranking
  • Existing Solutions - Ranking
  • Existing Solutions - Categorization
  • Existing Solutions - Categorization
  • Address Diverse User Preferences (Chen amp Li SIGMOD 2007)
  • Some Projects in My Group
  • System Log Data
  • Knowledge validation completion and construction
  • A Case Study
  • Pattern Visualization in the Case Study
  • ERN in the Case Study
  • Other Projects in My Group
  • Examples of KPI Types
  • Methodology for Leading Indicator Learning
  • Some Projects in My Group
  • Biological Data Sources
  • Problem We are Studying
  • Some Projects in My Group
  • Business Continuity Information Network (BCIN)
  • BCIN
  • Acknowledgements
  • Question
Page 14: Data Mining: Efficiently Extracting Interpretable and ...taoli/TaoLi-Talk-1.pdf · Data Mining: Efficiently Extracting Interpretable and Actionable Patterns ... – Efficiently extracting

14

Existing Solutions - Ranking

bull Rankingndash Similar to TFIDF (Agrawal et al CIDR 2003)ndash Probabilistic ranking (Chaudhuri et al VLDB 2004)ndash Ordering attributes (Das et al SIGMOD 2006)ndash Context-sensitive (Agrawal et al SIGMOD 2006)ndash Based on object relationships (Chakrabarti et al SIGMOD

2006)ndash A lot of work in IR field (eg Shen et al SIGIR 2005)

15

Existing Solutions - Ranking

bull Problem the same ranking function for all users (except the context sensitive work)ndash User 1 prefers high returnsndash User 2 prefers lower risks (often means lower

returns)ndash User 3 prefers a mix of high return funds and low

risk fundsndash One ranking function can not fit all

16

Existing Solutions -Categorization

Chakrabarti et al SIGMOD 2004Divide query results recursively into categories (a tree structure)

The splitting attributes are selected based on query history (more frequently asked attributes are typically put closer to root)

User can navigate this tree and show records in a node

Categorization is complementary to ranking (rank records in a node)

Root(193)

10 lt= 3 Yr return lt 20(71)

3 Yr return gt= 20 (29)

0 lt= 3 Yr return lt 10 (93)

17

Existing Solutions -Categorization

Problem differences between users are not consideredRoot(193)

0 lt= 3 Yr return lt 10 (93)

10 lt= 3 Yr return lt 20(71)

3 Yr return gt= 20 (29)

10 lt=1 Yr return lt 20 (55)

10 lt= Std lt 20 (48)

5 lt= Std lt 10 (7)

0 lt= 1 Yr return lt 10 (63)

0 lt= Std lt 5 (38)

User 1 3

User 2 3 User 2 3

This tree does not help user 2 and 3 much

We try to address this issue for categorization

18

Address Diverse User Preferences (Chen amp Li SIGMOD 2007)

bull Offline infer a set of typical types of user preferencesndash Data are clustered using query history of all existing usersndash Each cluster represents a typical type of user preference

(eg high return funds low risk funds)

bull Runtime ndash A navigational tree is constructed over clusters in the query resultsndash This tree highlights the differences between these clustersndash User navigates this tree to select clusters that he is interested inndash User can then rank or browse records in those clusters

19

Some Projects in My Group

bull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Business Intelligence

ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysisbull Matrix-Based Algorithms for Data Mining

20

System Log Databull Performance Metric

ndash CPUMemorySwap utilization workload average response timebull Application-level Log

ndash Windows NT system log Websphere application log Oracle activitybull Failure data

ndash System crash dumps error reportsbull Network Traffic Data

ndash SNMP Link-level byte counts Netflow Link-level flow statistics PMA link-level packet header traces TCPdump host-side packet traces

bull Request Datandash Apache and Microsoft IIS log

bull Reports from Operatorsndash Trouble ticket data problems reports possible causes and symptoms for failures

bull Othersndash Network-based alert logs Probes

21

logs

logs

logs

HistoricalData Collection

Realtime Analysis

Offline Analysis

Knowledge Base

Real Time Management

Anomaly Detection

Fault Diagnosis

Problem Determination

CorrelationDependencyKnowledge

SummarizationVisualizationRule Construction

Temporal Pattern DiscoveryKnowledge Management

PlanningActions

The Integrated Framework on Data-Driven Computing System Management

Log Data Organization

Componentlogs Active Data Collection

Log Adapter

Situation Identification

and Categorization

22

Raw data-gtData table

Aggregationsummarization

Event Plots

Pattern MiningPattern Mining

OutagesTemporal patternsEvent bursts

Morning rebootingperiodic patterns

Weekly rebootingPeriodic patterns

Hosts

Knowledge validation completion and construction

23

A Case Study

State start stop dependency create connection report request configuration otherPlotX‐axis Time Y‐axis State

24

Pattern Visualization in the Case Study

25

ERN in the Case Study

26

Other Projects in My Group

bull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Business Intelligence

ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysis

27

bull Operationalbusiness Intelligencendash Collecting and analyzing metrics (Key

Performance Indicators) to support better decision-making

bull Focus leading indicator discovery

ndash Using operational metrics to adaptively manage amp improve workflow efficiency

bull Focus workflow exception detection and tracing

28

Examples of KPI Types

Leading Indicators(Value Drivers)

Lagging Indicators(Outcomes )

Number of clients that sales people meet face to face each week

Sales revenue

Complex repairs completed successfully during the first call or visit

Customer satisfaction

Number of parts for which orders exceed forecasts within 30 days of scheduled delivery

Per unit manufacturing costs

Number of customers who are delinquent paying their first bill

Customer churn

29

Methodology for Leading Indicator Learning

Selected Leading Indicators or Leading Indicator Combinations

clustering

Similarity matrix (or Distance matrix) using non-metric distance or Granger Causality

hellip

hellip

hellipThe root KPI may be the key leading indicator

Time seriesproduction model(eg assumption data model)

Leading Indicator Selection based on Domain Knowledge

sequence KPIs

Reduced Time series

define KPIs

correlation

dimensionalityreduction

If no update modelIs Model Valid

New Business Concern(s)

If yes

Leading Indicators

Raw databullProduction metricsbullWorkflow activity logsbullJobDeviceDocument events

30

Some Projects in My Groupbull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Social Network Analysisbull Business Intelligence

ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysisbull Matrix-Based Algorithms for Data Mining

31

Biological Data SourcesGene Expression Text LiteratureSequence Data

Clustering Protein ExpressionProtein-protein Interactions

Transcription FactorBinding SitesGene Ontology

Gene Clusters

Function Co-regulation

32

Problem We are Studyingbull Marker Gene Identification

ndash httpwwwcsfiuedu~yzhan004geneselhtmlndash 15 gene selection algorithms

bull ReliefF F-statistic A-optimality D-Optimality mRNR ReliefF-mRNR GSNR

bull RankGene 11

bull How to perform clustering using data from multiple diverse sources

bull How to perform clustering using partial prior knowledge

33

Some Projects in My Group

bull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Business Intelligence

ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysisbull Matrix-Based Algorithms for Data Mining

34

Business Continuity Information Network (BCIN)

35

BCIN

bull Central questionndash How to help businesses recover and resume

operations quicker after a major hurricane (or other) disaster

3620082008--44--66 Florida International UniversityFlorida International University 3636

3720082008--44--66 Florida International UniversityFlorida International University 3737

3820082008--44--66 Florida International UniversityFlorida International University 3838

3920082008--44--66 Florida International UniversityFlorida International University 3939

40Florida International UniversityFlorida International University

Situation Analysis

John Smith Asst Manager Store 2345

Rcmd Non-Functional 10107 1001am R CarrRequest Roofer assessment amp repair Windowdoor replacementComment Parking lot may drain in 24hrs

Status EmailActions Report Contact JRNLRecommendation Summaries | SA OverallStaff Inadequate 10107 1232pm J SmithFacility Inadequate 10107 1001am R CarrGoods Inadequate 10107 1058am B FallsTrans Functional 10107 1145am L Staff

Facility Non-Func 10107 1001am R Carr

FACILITY Building amp GroundsRoof Damaged 10107 715am R GarciaWindows Damaged 10107 758am R GarciaDoor Damaged 10107 650am R GarciaSecurity NAWater NAParking Damaged 10107 640 am R Garcia

33

41

Acknowledgementsbull Funding

ndash NSF Career Awardndash NIH ndash Faculty Research Awards from IBM Xerox and NECndash Equipment Donation from IBM

bull Collaboratorsndash Dr Chris Ding (University of Texas at Arlington)ndash Dr Michael I Jordan (UC Berkeley)ndash Dr Charles Perng (IBM Research)ndash Dr Tong Sun (Xerox Research)ndash Dr Shenghuo Zhu (NEC Research)

bull PhD Studentsndash Dr David Kaiserndash Dr Wei Pengndash Bo Shaondash Dingding Wangndash Haifeng Wang ndash Yi Zhang

42

Questionbull Email taolicsfiuedubull httpwwwcsfiuedu~taoli

  • Data Mining Efficiently Extracting Interpretable and Actionable Patterns
  • Research Challenges
  • My Research Directions
  • Some Projects in My Group
  • Example of Call Center Cases
  • Motivation
  • Two Different Approaches
  • Semantic Feature Extraction
  • Case Study ndash Online Help Desk
  • Some Projects in My Group
  • SQL-Query-Result Navigation
  • Existing Solutions ndash Iterative Refinement
  • Existing Solutions - Ranking
  • Existing Solutions - Ranking
  • Existing Solutions - Categorization
  • Existing Solutions - Categorization
  • Address Diverse User Preferences (Chen amp Li SIGMOD 2007)
  • Some Projects in My Group
  • System Log Data
  • Knowledge validation completion and construction
  • A Case Study
  • Pattern Visualization in the Case Study
  • ERN in the Case Study
  • Other Projects in My Group
  • Examples of KPI Types
  • Methodology for Leading Indicator Learning
  • Some Projects in My Group
  • Biological Data Sources
  • Problem We are Studying
  • Some Projects in My Group
  • Business Continuity Information Network (BCIN)
  • BCIN
  • Acknowledgements
  • Question
Page 15: Data Mining: Efficiently Extracting Interpretable and ...taoli/TaoLi-Talk-1.pdf · Data Mining: Efficiently Extracting Interpretable and Actionable Patterns ... – Efficiently extracting

15

Existing Solutions - Ranking

bull Problem the same ranking function for all users (except the context sensitive work)ndash User 1 prefers high returnsndash User 2 prefers lower risks (often means lower

returns)ndash User 3 prefers a mix of high return funds and low

risk fundsndash One ranking function can not fit all

16

Existing Solutions -Categorization

Chakrabarti et al SIGMOD 2004Divide query results recursively into categories (a tree structure)

The splitting attributes are selected based on query history (more frequently asked attributes are typically put closer to root)

User can navigate this tree and show records in a node

Categorization is complementary to ranking (rank records in a node)

Root(193)

10 lt= 3 Yr return lt 20(71)

3 Yr return gt= 20 (29)

0 lt= 3 Yr return lt 10 (93)

17

Existing Solutions -Categorization

Problem differences between users are not consideredRoot(193)

0 lt= 3 Yr return lt 10 (93)

10 lt= 3 Yr return lt 20(71)

3 Yr return gt= 20 (29)

10 lt=1 Yr return lt 20 (55)

10 lt= Std lt 20 (48)

5 lt= Std lt 10 (7)

0 lt= 1 Yr return lt 10 (63)

0 lt= Std lt 5 (38)

User 1 3

User 2 3 User 2 3

This tree does not help user 2 and 3 much

We try to address this issue for categorization

18

Address Diverse User Preferences (Chen amp Li SIGMOD 2007)

bull Offline infer a set of typical types of user preferencesndash Data are clustered using query history of all existing usersndash Each cluster represents a typical type of user preference

(eg high return funds low risk funds)

bull Runtime ndash A navigational tree is constructed over clusters in the query resultsndash This tree highlights the differences between these clustersndash User navigates this tree to select clusters that he is interested inndash User can then rank or browse records in those clusters

19

Some Projects in My Group

bull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Business Intelligence

ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysisbull Matrix-Based Algorithms for Data Mining

20

System Log Databull Performance Metric

ndash CPUMemorySwap utilization workload average response timebull Application-level Log

ndash Windows NT system log Websphere application log Oracle activitybull Failure data

ndash System crash dumps error reportsbull Network Traffic Data

ndash SNMP Link-level byte counts Netflow Link-level flow statistics PMA link-level packet header traces TCPdump host-side packet traces

bull Request Datandash Apache and Microsoft IIS log

bull Reports from Operatorsndash Trouble ticket data problems reports possible causes and symptoms for failures

bull Othersndash Network-based alert logs Probes

21

logs

logs

logs

HistoricalData Collection

Realtime Analysis

Offline Analysis

Knowledge Base

Real Time Management

Anomaly Detection

Fault Diagnosis

Problem Determination

CorrelationDependencyKnowledge

SummarizationVisualizationRule Construction

Temporal Pattern DiscoveryKnowledge Management

PlanningActions

The Integrated Framework on Data-Driven Computing System Management

Log Data Organization

Componentlogs Active Data Collection

Log Adapter

Situation Identification

and Categorization

22

Raw data-gtData table

Aggregationsummarization

Event Plots

Pattern MiningPattern Mining

OutagesTemporal patternsEvent bursts

Morning rebootingperiodic patterns

Weekly rebootingPeriodic patterns

Hosts

Knowledge validation completion and construction

23

A Case Study

State start stop dependency create connection report request configuration otherPlotX‐axis Time Y‐axis State

24

Pattern Visualization in the Case Study

25

ERN in the Case Study

26

Other Projects in My Group

bull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Business Intelligence

ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysis

27

bull Operationalbusiness Intelligencendash Collecting and analyzing metrics (Key

Performance Indicators) to support better decision-making

bull Focus leading indicator discovery

ndash Using operational metrics to adaptively manage amp improve workflow efficiency

bull Focus workflow exception detection and tracing

28

Examples of KPI Types

Leading Indicators(Value Drivers)

Lagging Indicators(Outcomes )

Number of clients that sales people meet face to face each week

Sales revenue

Complex repairs completed successfully during the first call or visit

Customer satisfaction

Number of parts for which orders exceed forecasts within 30 days of scheduled delivery

Per unit manufacturing costs

Number of customers who are delinquent paying their first bill

Customer churn

29

Methodology for Leading Indicator Learning

Selected Leading Indicators or Leading Indicator Combinations

clustering

Similarity matrix (or Distance matrix) using non-metric distance or Granger Causality

hellip

hellip

hellipThe root KPI may be the key leading indicator

Time seriesproduction model(eg assumption data model)

Leading Indicator Selection based on Domain Knowledge

sequence KPIs

Reduced Time series

define KPIs

correlation

dimensionalityreduction

If no update modelIs Model Valid

New Business Concern(s)

If yes

Leading Indicators

Raw databullProduction metricsbullWorkflow activity logsbullJobDeviceDocument events

30

Some Projects in My Groupbull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Social Network Analysisbull Business Intelligence

ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysisbull Matrix-Based Algorithms for Data Mining

31

Biological Data SourcesGene Expression Text LiteratureSequence Data

Clustering Protein ExpressionProtein-protein Interactions

Transcription FactorBinding SitesGene Ontology

Gene Clusters

Function Co-regulation

32

Problem We are Studyingbull Marker Gene Identification

ndash httpwwwcsfiuedu~yzhan004geneselhtmlndash 15 gene selection algorithms

bull ReliefF F-statistic A-optimality D-Optimality mRNR ReliefF-mRNR GSNR

bull RankGene 11

bull How to perform clustering using data from multiple diverse sources

bull How to perform clustering using partial prior knowledge

33

Some Projects in My Group

bull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Business Intelligence

ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysisbull Matrix-Based Algorithms for Data Mining

34

Business Continuity Information Network (BCIN)

35

BCIN

bull Central questionndash How to help businesses recover and resume

operations quicker after a major hurricane (or other) disaster

3620082008--44--66 Florida International UniversityFlorida International University 3636

3720082008--44--66 Florida International UniversityFlorida International University 3737

3820082008--44--66 Florida International UniversityFlorida International University 3838

3920082008--44--66 Florida International UniversityFlorida International University 3939

40Florida International UniversityFlorida International University

Situation Analysis

John Smith Asst Manager Store 2345

Rcmd Non-Functional 10107 1001am R CarrRequest Roofer assessment amp repair Windowdoor replacementComment Parking lot may drain in 24hrs

Status EmailActions Report Contact JRNLRecommendation Summaries | SA OverallStaff Inadequate 10107 1232pm J SmithFacility Inadequate 10107 1001am R CarrGoods Inadequate 10107 1058am B FallsTrans Functional 10107 1145am L Staff

Facility Non-Func 10107 1001am R Carr

FACILITY Building amp GroundsRoof Damaged 10107 715am R GarciaWindows Damaged 10107 758am R GarciaDoor Damaged 10107 650am R GarciaSecurity NAWater NAParking Damaged 10107 640 am R Garcia

33

41

Acknowledgementsbull Funding

ndash NSF Career Awardndash NIH ndash Faculty Research Awards from IBM Xerox and NECndash Equipment Donation from IBM

bull Collaboratorsndash Dr Chris Ding (University of Texas at Arlington)ndash Dr Michael I Jordan (UC Berkeley)ndash Dr Charles Perng (IBM Research)ndash Dr Tong Sun (Xerox Research)ndash Dr Shenghuo Zhu (NEC Research)

bull PhD Studentsndash Dr David Kaiserndash Dr Wei Pengndash Bo Shaondash Dingding Wangndash Haifeng Wang ndash Yi Zhang

42

Questionbull Email taolicsfiuedubull httpwwwcsfiuedu~taoli

  • Data Mining Efficiently Extracting Interpretable and Actionable Patterns
  • Research Challenges
  • My Research Directions
  • Some Projects in My Group
  • Example of Call Center Cases
  • Motivation
  • Two Different Approaches
  • Semantic Feature Extraction
  • Case Study ndash Online Help Desk
  • Some Projects in My Group
  • SQL-Query-Result Navigation
  • Existing Solutions ndash Iterative Refinement
  • Existing Solutions - Ranking
  • Existing Solutions - Ranking
  • Existing Solutions - Categorization
  • Existing Solutions - Categorization
  • Address Diverse User Preferences (Chen amp Li SIGMOD 2007)
  • Some Projects in My Group
  • System Log Data
  • Knowledge validation completion and construction
  • A Case Study
  • Pattern Visualization in the Case Study
  • ERN in the Case Study
  • Other Projects in My Group
  • Examples of KPI Types
  • Methodology for Leading Indicator Learning
  • Some Projects in My Group
  • Biological Data Sources
  • Problem We are Studying
  • Some Projects in My Group
  • Business Continuity Information Network (BCIN)
  • BCIN
  • Acknowledgements
  • Question
Page 16: Data Mining: Efficiently Extracting Interpretable and ...taoli/TaoLi-Talk-1.pdf · Data Mining: Efficiently Extracting Interpretable and Actionable Patterns ... – Efficiently extracting

16

Existing Solutions -Categorization

Chakrabarti et al SIGMOD 2004Divide query results recursively into categories (a tree structure)

The splitting attributes are selected based on query history (more frequently asked attributes are typically put closer to root)

User can navigate this tree and show records in a node

Categorization is complementary to ranking (rank records in a node)

Root(193)

10 lt= 3 Yr return lt 20(71)

3 Yr return gt= 20 (29)

0 lt= 3 Yr return lt 10 (93)

17

Existing Solutions -Categorization

Problem differences between users are not consideredRoot(193)

0 lt= 3 Yr return lt 10 (93)

10 lt= 3 Yr return lt 20(71)

3 Yr return gt= 20 (29)

10 lt=1 Yr return lt 20 (55)

10 lt= Std lt 20 (48)

5 lt= Std lt 10 (7)

0 lt= 1 Yr return lt 10 (63)

0 lt= Std lt 5 (38)

User 1 3

User 2 3 User 2 3

This tree does not help user 2 and 3 much

We try to address this issue for categorization

18

Address Diverse User Preferences (Chen amp Li SIGMOD 2007)

bull Offline infer a set of typical types of user preferencesndash Data are clustered using query history of all existing usersndash Each cluster represents a typical type of user preference

(eg high return funds low risk funds)

bull Runtime ndash A navigational tree is constructed over clusters in the query resultsndash This tree highlights the differences between these clustersndash User navigates this tree to select clusters that he is interested inndash User can then rank or browse records in those clusters

19

Some Projects in My Group

bull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Business Intelligence

ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysisbull Matrix-Based Algorithms for Data Mining

20

System Log Databull Performance Metric

ndash CPUMemorySwap utilization workload average response timebull Application-level Log

ndash Windows NT system log Websphere application log Oracle activitybull Failure data

ndash System crash dumps error reportsbull Network Traffic Data

ndash SNMP Link-level byte counts Netflow Link-level flow statistics PMA link-level packet header traces TCPdump host-side packet traces

bull Request Datandash Apache and Microsoft IIS log

bull Reports from Operatorsndash Trouble ticket data problems reports possible causes and symptoms for failures

bull Othersndash Network-based alert logs Probes

21

logs

logs

logs

HistoricalData Collection

Realtime Analysis

Offline Analysis

Knowledge Base

Real Time Management

Anomaly Detection

Fault Diagnosis

Problem Determination

CorrelationDependencyKnowledge

SummarizationVisualizationRule Construction

Temporal Pattern DiscoveryKnowledge Management

PlanningActions

The Integrated Framework on Data-Driven Computing System Management

Log Data Organization

Componentlogs Active Data Collection

Log Adapter

Situation Identification

and Categorization

22

Raw data-gtData table

Aggregationsummarization

Event Plots

Pattern MiningPattern Mining

OutagesTemporal patternsEvent bursts

Morning rebootingperiodic patterns

Weekly rebootingPeriodic patterns

Hosts

Knowledge validation completion and construction

23

A Case Study

State start stop dependency create connection report request configuration otherPlotX‐axis Time Y‐axis State

24

Pattern Visualization in the Case Study

25

ERN in the Case Study

26

Other Projects in My Group

bull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Business Intelligence

ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysis

27

bull Operationalbusiness Intelligencendash Collecting and analyzing metrics (Key

Performance Indicators) to support better decision-making

bull Focus leading indicator discovery

ndash Using operational metrics to adaptively manage amp improve workflow efficiency

bull Focus workflow exception detection and tracing

28

Examples of KPI Types

Leading Indicators(Value Drivers)

Lagging Indicators(Outcomes )

Number of clients that sales people meet face to face each week

Sales revenue

Complex repairs completed successfully during the first call or visit

Customer satisfaction

Number of parts for which orders exceed forecasts within 30 days of scheduled delivery

Per unit manufacturing costs

Number of customers who are delinquent paying their first bill

Customer churn

29

Methodology for Leading Indicator Learning

Selected Leading Indicators or Leading Indicator Combinations

clustering

Similarity matrix (or Distance matrix) using non-metric distance or Granger Causality

hellip

hellip

hellipThe root KPI may be the key leading indicator

Time seriesproduction model(eg assumption data model)

Leading Indicator Selection based on Domain Knowledge

sequence KPIs

Reduced Time series

define KPIs

correlation

dimensionalityreduction

If no update modelIs Model Valid

New Business Concern(s)

If yes

Leading Indicators

Raw databullProduction metricsbullWorkflow activity logsbullJobDeviceDocument events

30

Some Projects in My Groupbull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Social Network Analysisbull Business Intelligence

ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysisbull Matrix-Based Algorithms for Data Mining

31

Biological Data SourcesGene Expression Text LiteratureSequence Data

Clustering Protein ExpressionProtein-protein Interactions

Transcription FactorBinding SitesGene Ontology

Gene Clusters

Function Co-regulation

32

Problem We are Studyingbull Marker Gene Identification

ndash httpwwwcsfiuedu~yzhan004geneselhtmlndash 15 gene selection algorithms

bull ReliefF F-statistic A-optimality D-Optimality mRNR ReliefF-mRNR GSNR

bull RankGene 11

bull How to perform clustering using data from multiple diverse sources

bull How to perform clustering using partial prior knowledge

33

Some Projects in My Group

bull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Business Intelligence

ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysisbull Matrix-Based Algorithms for Data Mining

34

Business Continuity Information Network (BCIN)

35

BCIN

bull Central questionndash How to help businesses recover and resume

operations quicker after a major hurricane (or other) disaster

3620082008--44--66 Florida International UniversityFlorida International University 3636

3720082008--44--66 Florida International UniversityFlorida International University 3737

3820082008--44--66 Florida International UniversityFlorida International University 3838

3920082008--44--66 Florida International UniversityFlorida International University 3939

40Florida International UniversityFlorida International University

Situation Analysis

John Smith Asst Manager Store 2345

Rcmd Non-Functional 10107 1001am R CarrRequest Roofer assessment amp repair Windowdoor replacementComment Parking lot may drain in 24hrs

Status EmailActions Report Contact JRNLRecommendation Summaries | SA OverallStaff Inadequate 10107 1232pm J SmithFacility Inadequate 10107 1001am R CarrGoods Inadequate 10107 1058am B FallsTrans Functional 10107 1145am L Staff

Facility Non-Func 10107 1001am R Carr

FACILITY Building amp GroundsRoof Damaged 10107 715am R GarciaWindows Damaged 10107 758am R GarciaDoor Damaged 10107 650am R GarciaSecurity NAWater NAParking Damaged 10107 640 am R Garcia

33

41

Acknowledgementsbull Funding

ndash NSF Career Awardndash NIH ndash Faculty Research Awards from IBM Xerox and NECndash Equipment Donation from IBM

bull Collaboratorsndash Dr Chris Ding (University of Texas at Arlington)ndash Dr Michael I Jordan (UC Berkeley)ndash Dr Charles Perng (IBM Research)ndash Dr Tong Sun (Xerox Research)ndash Dr Shenghuo Zhu (NEC Research)

bull PhD Studentsndash Dr David Kaiserndash Dr Wei Pengndash Bo Shaondash Dingding Wangndash Haifeng Wang ndash Yi Zhang

42

Questionbull Email taolicsfiuedubull httpwwwcsfiuedu~taoli

  • Data Mining Efficiently Extracting Interpretable and Actionable Patterns
  • Research Challenges
  • My Research Directions
  • Some Projects in My Group
  • Example of Call Center Cases
  • Motivation
  • Two Different Approaches
  • Semantic Feature Extraction
  • Case Study ndash Online Help Desk
  • Some Projects in My Group
  • SQL-Query-Result Navigation
  • Existing Solutions ndash Iterative Refinement
  • Existing Solutions - Ranking
  • Existing Solutions - Ranking
  • Existing Solutions - Categorization
  • Existing Solutions - Categorization
  • Address Diverse User Preferences (Chen amp Li SIGMOD 2007)
  • Some Projects in My Group
  • System Log Data
  • Knowledge validation completion and construction
  • A Case Study
  • Pattern Visualization in the Case Study
  • ERN in the Case Study
  • Other Projects in My Group
  • Examples of KPI Types
  • Methodology for Leading Indicator Learning
  • Some Projects in My Group
  • Biological Data Sources
  • Problem We are Studying
  • Some Projects in My Group
  • Business Continuity Information Network (BCIN)
  • BCIN
  • Acknowledgements
  • Question
Page 17: Data Mining: Efficiently Extracting Interpretable and ...taoli/TaoLi-Talk-1.pdf · Data Mining: Efficiently Extracting Interpretable and Actionable Patterns ... – Efficiently extracting

17

Existing Solutions -Categorization

Problem differences between users are not consideredRoot(193)

0 lt= 3 Yr return lt 10 (93)

10 lt= 3 Yr return lt 20(71)

3 Yr return gt= 20 (29)

10 lt=1 Yr return lt 20 (55)

10 lt= Std lt 20 (48)

5 lt= Std lt 10 (7)

0 lt= 1 Yr return lt 10 (63)

0 lt= Std lt 5 (38)

User 1 3

User 2 3 User 2 3

This tree does not help user 2 and 3 much

We try to address this issue for categorization

18

Address Diverse User Preferences (Chen amp Li SIGMOD 2007)

bull Offline infer a set of typical types of user preferencesndash Data are clustered using query history of all existing usersndash Each cluster represents a typical type of user preference

(eg high return funds low risk funds)

bull Runtime ndash A navigational tree is constructed over clusters in the query resultsndash This tree highlights the differences between these clustersndash User navigates this tree to select clusters that he is interested inndash User can then rank or browse records in those clusters

19

Some Projects in My Group

bull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Business Intelligence

ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysisbull Matrix-Based Algorithms for Data Mining

20

System Log Databull Performance Metric

ndash CPUMemorySwap utilization workload average response timebull Application-level Log

ndash Windows NT system log Websphere application log Oracle activitybull Failure data

ndash System crash dumps error reportsbull Network Traffic Data

ndash SNMP Link-level byte counts Netflow Link-level flow statistics PMA link-level packet header traces TCPdump host-side packet traces

bull Request Datandash Apache and Microsoft IIS log

bull Reports from Operatorsndash Trouble ticket data problems reports possible causes and symptoms for failures

bull Othersndash Network-based alert logs Probes

21

logs

logs

logs

HistoricalData Collection

Realtime Analysis

Offline Analysis

Knowledge Base

Real Time Management

Anomaly Detection

Fault Diagnosis

Problem Determination

CorrelationDependencyKnowledge

SummarizationVisualizationRule Construction

Temporal Pattern DiscoveryKnowledge Management

PlanningActions

The Integrated Framework on Data-Driven Computing System Management

Log Data Organization

Componentlogs Active Data Collection

Log Adapter

Situation Identification

and Categorization

22

Raw data-gtData table

Aggregationsummarization

Event Plots

Pattern MiningPattern Mining

OutagesTemporal patternsEvent bursts

Morning rebootingperiodic patterns

Weekly rebootingPeriodic patterns

Hosts

Knowledge validation completion and construction

23

A Case Study

State start stop dependency create connection report request configuration otherPlotX‐axis Time Y‐axis State

24

Pattern Visualization in the Case Study

25

ERN in the Case Study

26

Other Projects in My Group

bull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Business Intelligence

ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysis

27

bull Operationalbusiness Intelligencendash Collecting and analyzing metrics (Key

Performance Indicators) to support better decision-making

bull Focus leading indicator discovery

ndash Using operational metrics to adaptively manage amp improve workflow efficiency

bull Focus workflow exception detection and tracing

28

Examples of KPI Types

Leading Indicators(Value Drivers)

Lagging Indicators(Outcomes )

Number of clients that sales people meet face to face each week

Sales revenue

Complex repairs completed successfully during the first call or visit

Customer satisfaction

Number of parts for which orders exceed forecasts within 30 days of scheduled delivery

Per unit manufacturing costs

Number of customers who are delinquent paying their first bill

Customer churn

29

Methodology for Leading Indicator Learning

Selected Leading Indicators or Leading Indicator Combinations

clustering

Similarity matrix (or Distance matrix) using non-metric distance or Granger Causality

hellip

hellip

hellipThe root KPI may be the key leading indicator

Time seriesproduction model(eg assumption data model)

Leading Indicator Selection based on Domain Knowledge

sequence KPIs

Reduced Time series

define KPIs

correlation

dimensionalityreduction

If no update modelIs Model Valid

New Business Concern(s)

If yes

Leading Indicators

Raw databullProduction metricsbullWorkflow activity logsbullJobDeviceDocument events

30

Some Projects in My Groupbull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Social Network Analysisbull Business Intelligence

ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysisbull Matrix-Based Algorithms for Data Mining

31

Biological Data SourcesGene Expression Text LiteratureSequence Data

Clustering Protein ExpressionProtein-protein Interactions

Transcription FactorBinding SitesGene Ontology

Gene Clusters

Function Co-regulation

32

Problem We are Studyingbull Marker Gene Identification

ndash httpwwwcsfiuedu~yzhan004geneselhtmlndash 15 gene selection algorithms

bull ReliefF F-statistic A-optimality D-Optimality mRNR ReliefF-mRNR GSNR

bull RankGene 11

bull How to perform clustering using data from multiple diverse sources

bull How to perform clustering using partial prior knowledge

33

Some Projects in My Group

bull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Business Intelligence

ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysisbull Matrix-Based Algorithms for Data Mining

34

Business Continuity Information Network (BCIN)

35

BCIN

bull Central questionndash How to help businesses recover and resume

operations quicker after a major hurricane (or other) disaster

3620082008--44--66 Florida International UniversityFlorida International University 3636

3720082008--44--66 Florida International UniversityFlorida International University 3737

3820082008--44--66 Florida International UniversityFlorida International University 3838

3920082008--44--66 Florida International UniversityFlorida International University 3939

40Florida International UniversityFlorida International University

Situation Analysis

John Smith Asst Manager Store 2345

Rcmd Non-Functional 10107 1001am R CarrRequest Roofer assessment amp repair Windowdoor replacementComment Parking lot may drain in 24hrs

Status EmailActions Report Contact JRNLRecommendation Summaries | SA OverallStaff Inadequate 10107 1232pm J SmithFacility Inadequate 10107 1001am R CarrGoods Inadequate 10107 1058am B FallsTrans Functional 10107 1145am L Staff

Facility Non-Func 10107 1001am R Carr

FACILITY Building amp GroundsRoof Damaged 10107 715am R GarciaWindows Damaged 10107 758am R GarciaDoor Damaged 10107 650am R GarciaSecurity NAWater NAParking Damaged 10107 640 am R Garcia

33

41

Acknowledgementsbull Funding

ndash NSF Career Awardndash NIH ndash Faculty Research Awards from IBM Xerox and NECndash Equipment Donation from IBM

bull Collaboratorsndash Dr Chris Ding (University of Texas at Arlington)ndash Dr Michael I Jordan (UC Berkeley)ndash Dr Charles Perng (IBM Research)ndash Dr Tong Sun (Xerox Research)ndash Dr Shenghuo Zhu (NEC Research)

bull PhD Studentsndash Dr David Kaiserndash Dr Wei Pengndash Bo Shaondash Dingding Wangndash Haifeng Wang ndash Yi Zhang

42

Questionbull Email taolicsfiuedubull httpwwwcsfiuedu~taoli

  • Data Mining Efficiently Extracting Interpretable and Actionable Patterns
  • Research Challenges
  • My Research Directions
  • Some Projects in My Group
  • Example of Call Center Cases
  • Motivation
  • Two Different Approaches
  • Semantic Feature Extraction
  • Case Study ndash Online Help Desk
  • Some Projects in My Group
  • SQL-Query-Result Navigation
  • Existing Solutions ndash Iterative Refinement
  • Existing Solutions - Ranking
  • Existing Solutions - Ranking
  • Existing Solutions - Categorization
  • Existing Solutions - Categorization
  • Address Diverse User Preferences (Chen amp Li SIGMOD 2007)
  • Some Projects in My Group
  • System Log Data
  • Knowledge validation completion and construction
  • A Case Study
  • Pattern Visualization in the Case Study
  • ERN in the Case Study
  • Other Projects in My Group
  • Examples of KPI Types
  • Methodology for Leading Indicator Learning
  • Some Projects in My Group
  • Biological Data Sources
  • Problem We are Studying
  • Some Projects in My Group
  • Business Continuity Information Network (BCIN)
  • BCIN
  • Acknowledgements
  • Question
Page 18: Data Mining: Efficiently Extracting Interpretable and ...taoli/TaoLi-Talk-1.pdf · Data Mining: Efficiently Extracting Interpretable and Actionable Patterns ... – Efficiently extracting

18

Address Diverse User Preferences (Chen amp Li SIGMOD 2007)

bull Offline infer a set of typical types of user preferencesndash Data are clustered using query history of all existing usersndash Each cluster represents a typical type of user preference

(eg high return funds low risk funds)

bull Runtime ndash A navigational tree is constructed over clusters in the query resultsndash This tree highlights the differences between these clustersndash User navigates this tree to select clusters that he is interested inndash User can then rank or browse records in those clusters

19

Some Projects in My Group

bull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Business Intelligence

ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysisbull Matrix-Based Algorithms for Data Mining

20

System Log Databull Performance Metric

ndash CPUMemorySwap utilization workload average response timebull Application-level Log

ndash Windows NT system log Websphere application log Oracle activitybull Failure data

ndash System crash dumps error reportsbull Network Traffic Data

ndash SNMP Link-level byte counts Netflow Link-level flow statistics PMA link-level packet header traces TCPdump host-side packet traces

bull Request Datandash Apache and Microsoft IIS log

bull Reports from Operatorsndash Trouble ticket data problems reports possible causes and symptoms for failures

bull Othersndash Network-based alert logs Probes

21

logs

logs

logs

HistoricalData Collection

Realtime Analysis

Offline Analysis

Knowledge Base

Real Time Management

Anomaly Detection

Fault Diagnosis

Problem Determination

CorrelationDependencyKnowledge

SummarizationVisualizationRule Construction

Temporal Pattern DiscoveryKnowledge Management

PlanningActions

The Integrated Framework on Data-Driven Computing System Management

Log Data Organization

Componentlogs Active Data Collection

Log Adapter

Situation Identification

and Categorization

22

Raw data-gtData table

Aggregationsummarization

Event Plots

Pattern MiningPattern Mining

OutagesTemporal patternsEvent bursts

Morning rebootingperiodic patterns

Weekly rebootingPeriodic patterns

Hosts

Knowledge validation completion and construction

23

A Case Study

State start stop dependency create connection report request configuration otherPlotX‐axis Time Y‐axis State

24

Pattern Visualization in the Case Study

25

ERN in the Case Study

26

Other Projects in My Group

bull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Business Intelligence

ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysis

27

bull Operationalbusiness Intelligencendash Collecting and analyzing metrics (Key

Performance Indicators) to support better decision-making

bull Focus leading indicator discovery

ndash Using operational metrics to adaptively manage amp improve workflow efficiency

bull Focus workflow exception detection and tracing

28

Examples of KPI Types

Leading Indicators(Value Drivers)

Lagging Indicators(Outcomes )

Number of clients that sales people meet face to face each week

Sales revenue

Complex repairs completed successfully during the first call or visit

Customer satisfaction

Number of parts for which orders exceed forecasts within 30 days of scheduled delivery

Per unit manufacturing costs

Number of customers who are delinquent paying their first bill

Customer churn

29

Methodology for Leading Indicator Learning

Selected Leading Indicators or Leading Indicator Combinations

clustering

Similarity matrix (or Distance matrix) using non-metric distance or Granger Causality

hellip

hellip

hellipThe root KPI may be the key leading indicator

Time seriesproduction model(eg assumption data model)

Leading Indicator Selection based on Domain Knowledge

sequence KPIs

Reduced Time series

define KPIs

correlation

dimensionalityreduction

If no update modelIs Model Valid

New Business Concern(s)

If yes

Leading Indicators

Raw databullProduction metricsbullWorkflow activity logsbullJobDeviceDocument events

30

Some Projects in My Groupbull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Social Network Analysisbull Business Intelligence

ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysisbull Matrix-Based Algorithms for Data Mining

31

Biological Data SourcesGene Expression Text LiteratureSequence Data

Clustering Protein ExpressionProtein-protein Interactions

Transcription FactorBinding SitesGene Ontology

Gene Clusters

Function Co-regulation

32

Problem We are Studyingbull Marker Gene Identification

ndash httpwwwcsfiuedu~yzhan004geneselhtmlndash 15 gene selection algorithms

bull ReliefF F-statistic A-optimality D-Optimality mRNR ReliefF-mRNR GSNR

bull RankGene 11

bull How to perform clustering using data from multiple diverse sources

bull How to perform clustering using partial prior knowledge

33

Some Projects in My Group

bull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Business Intelligence

ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysisbull Matrix-Based Algorithms for Data Mining

34

Business Continuity Information Network (BCIN)

35

BCIN

bull Central questionndash How to help businesses recover and resume

operations quicker after a major hurricane (or other) disaster

3620082008--44--66 Florida International UniversityFlorida International University 3636

3720082008--44--66 Florida International UniversityFlorida International University 3737

3820082008--44--66 Florida International UniversityFlorida International University 3838

3920082008--44--66 Florida International UniversityFlorida International University 3939

40Florida International UniversityFlorida International University

Situation Analysis

John Smith Asst Manager Store 2345

Rcmd Non-Functional 10107 1001am R CarrRequest Roofer assessment amp repair Windowdoor replacementComment Parking lot may drain in 24hrs

Status EmailActions Report Contact JRNLRecommendation Summaries | SA OverallStaff Inadequate 10107 1232pm J SmithFacility Inadequate 10107 1001am R CarrGoods Inadequate 10107 1058am B FallsTrans Functional 10107 1145am L Staff

Facility Non-Func 10107 1001am R Carr

FACILITY Building amp GroundsRoof Damaged 10107 715am R GarciaWindows Damaged 10107 758am R GarciaDoor Damaged 10107 650am R GarciaSecurity NAWater NAParking Damaged 10107 640 am R Garcia

33

41

Acknowledgementsbull Funding

ndash NSF Career Awardndash NIH ndash Faculty Research Awards from IBM Xerox and NECndash Equipment Donation from IBM

bull Collaboratorsndash Dr Chris Ding (University of Texas at Arlington)ndash Dr Michael I Jordan (UC Berkeley)ndash Dr Charles Perng (IBM Research)ndash Dr Tong Sun (Xerox Research)ndash Dr Shenghuo Zhu (NEC Research)

bull PhD Studentsndash Dr David Kaiserndash Dr Wei Pengndash Bo Shaondash Dingding Wangndash Haifeng Wang ndash Yi Zhang

42

Questionbull Email taolicsfiuedubull httpwwwcsfiuedu~taoli

  • Data Mining Efficiently Extracting Interpretable and Actionable Patterns
  • Research Challenges
  • My Research Directions
  • Some Projects in My Group
  • Example of Call Center Cases
  • Motivation
  • Two Different Approaches
  • Semantic Feature Extraction
  • Case Study ndash Online Help Desk
  • Some Projects in My Group
  • SQL-Query-Result Navigation
  • Existing Solutions ndash Iterative Refinement
  • Existing Solutions - Ranking
  • Existing Solutions - Ranking
  • Existing Solutions - Categorization
  • Existing Solutions - Categorization
  • Address Diverse User Preferences (Chen amp Li SIGMOD 2007)
  • Some Projects in My Group
  • System Log Data
  • Knowledge validation completion and construction
  • A Case Study
  • Pattern Visualization in the Case Study
  • ERN in the Case Study
  • Other Projects in My Group
  • Examples of KPI Types
  • Methodology for Leading Indicator Learning
  • Some Projects in My Group
  • Biological Data Sources
  • Problem We are Studying
  • Some Projects in My Group
  • Business Continuity Information Network (BCIN)
  • BCIN
  • Acknowledgements
  • Question
Page 19: Data Mining: Efficiently Extracting Interpretable and ...taoli/TaoLi-Talk-1.pdf · Data Mining: Efficiently Extracting Interpretable and Actionable Patterns ... – Efficiently extracting

19

Some Projects in My Group

bull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Business Intelligence

ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysisbull Matrix-Based Algorithms for Data Mining

20

System Log Databull Performance Metric

ndash CPUMemorySwap utilization workload average response timebull Application-level Log

ndash Windows NT system log Websphere application log Oracle activitybull Failure data

ndash System crash dumps error reportsbull Network Traffic Data

ndash SNMP Link-level byte counts Netflow Link-level flow statistics PMA link-level packet header traces TCPdump host-side packet traces

bull Request Datandash Apache and Microsoft IIS log

bull Reports from Operatorsndash Trouble ticket data problems reports possible causes and symptoms for failures

bull Othersndash Network-based alert logs Probes

21

logs

logs

logs

HistoricalData Collection

Realtime Analysis

Offline Analysis

Knowledge Base

Real Time Management

Anomaly Detection

Fault Diagnosis

Problem Determination

CorrelationDependencyKnowledge

SummarizationVisualizationRule Construction

Temporal Pattern DiscoveryKnowledge Management

PlanningActions

The Integrated Framework on Data-Driven Computing System Management

Log Data Organization

Componentlogs Active Data Collection

Log Adapter

Situation Identification

and Categorization

22

Raw data-gtData table

Aggregationsummarization

Event Plots

Pattern MiningPattern Mining

OutagesTemporal patternsEvent bursts

Morning rebootingperiodic patterns

Weekly rebootingPeriodic patterns

Hosts

Knowledge validation completion and construction

23

A Case Study

State start stop dependency create connection report request configuration otherPlotX‐axis Time Y‐axis State

24

Pattern Visualization in the Case Study

25

ERN in the Case Study

26

Other Projects in My Group

bull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Business Intelligence

ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysis

27

bull Operationalbusiness Intelligencendash Collecting and analyzing metrics (Key

Performance Indicators) to support better decision-making

bull Focus leading indicator discovery

ndash Using operational metrics to adaptively manage amp improve workflow efficiency

bull Focus workflow exception detection and tracing

28

Examples of KPI Types

Leading Indicators(Value Drivers)

Lagging Indicators(Outcomes )

Number of clients that sales people meet face to face each week

Sales revenue

Complex repairs completed successfully during the first call or visit

Customer satisfaction

Number of parts for which orders exceed forecasts within 30 days of scheduled delivery

Per unit manufacturing costs

Number of customers who are delinquent paying their first bill

Customer churn

29

Methodology for Leading Indicator Learning

Selected Leading Indicators or Leading Indicator Combinations

clustering

Similarity matrix (or Distance matrix) using non-metric distance or Granger Causality

hellip

hellip

hellipThe root KPI may be the key leading indicator

Time seriesproduction model(eg assumption data model)

Leading Indicator Selection based on Domain Knowledge

sequence KPIs

Reduced Time series

define KPIs

correlation

dimensionalityreduction

If no update modelIs Model Valid

New Business Concern(s)

If yes

Leading Indicators

Raw databullProduction metricsbullWorkflow activity logsbullJobDeviceDocument events

30

Some Projects in My Groupbull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Social Network Analysisbull Business Intelligence

ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysisbull Matrix-Based Algorithms for Data Mining

31

Biological Data SourcesGene Expression Text LiteratureSequence Data

Clustering Protein ExpressionProtein-protein Interactions

Transcription FactorBinding SitesGene Ontology

Gene Clusters

Function Co-regulation

32

Problem We are Studyingbull Marker Gene Identification

ndash httpwwwcsfiuedu~yzhan004geneselhtmlndash 15 gene selection algorithms

bull ReliefF F-statistic A-optimality D-Optimality mRNR ReliefF-mRNR GSNR

bull RankGene 11

bull How to perform clustering using data from multiple diverse sources

bull How to perform clustering using partial prior knowledge

33

Some Projects in My Group

bull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Business Intelligence

ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysisbull Matrix-Based Algorithms for Data Mining

34

Business Continuity Information Network (BCIN)

35

BCIN

bull Central questionndash How to help businesses recover and resume

operations quicker after a major hurricane (or other) disaster

3620082008--44--66 Florida International UniversityFlorida International University 3636

3720082008--44--66 Florida International UniversityFlorida International University 3737

3820082008--44--66 Florida International UniversityFlorida International University 3838

3920082008--44--66 Florida International UniversityFlorida International University 3939

40Florida International UniversityFlorida International University

Situation Analysis

John Smith Asst Manager Store 2345

Rcmd Non-Functional 10107 1001am R CarrRequest Roofer assessment amp repair Windowdoor replacementComment Parking lot may drain in 24hrs

Status EmailActions Report Contact JRNLRecommendation Summaries | SA OverallStaff Inadequate 10107 1232pm J SmithFacility Inadequate 10107 1001am R CarrGoods Inadequate 10107 1058am B FallsTrans Functional 10107 1145am L Staff

Facility Non-Func 10107 1001am R Carr

FACILITY Building amp GroundsRoof Damaged 10107 715am R GarciaWindows Damaged 10107 758am R GarciaDoor Damaged 10107 650am R GarciaSecurity NAWater NAParking Damaged 10107 640 am R Garcia

33

41

Acknowledgementsbull Funding

ndash NSF Career Awardndash NIH ndash Faculty Research Awards from IBM Xerox and NECndash Equipment Donation from IBM

bull Collaboratorsndash Dr Chris Ding (University of Texas at Arlington)ndash Dr Michael I Jordan (UC Berkeley)ndash Dr Charles Perng (IBM Research)ndash Dr Tong Sun (Xerox Research)ndash Dr Shenghuo Zhu (NEC Research)

bull PhD Studentsndash Dr David Kaiserndash Dr Wei Pengndash Bo Shaondash Dingding Wangndash Haifeng Wang ndash Yi Zhang

42

Questionbull Email taolicsfiuedubull httpwwwcsfiuedu~taoli

  • Data Mining Efficiently Extracting Interpretable and Actionable Patterns
  • Research Challenges
  • My Research Directions
  • Some Projects in My Group
  • Example of Call Center Cases
  • Motivation
  • Two Different Approaches
  • Semantic Feature Extraction
  • Case Study ndash Online Help Desk
  • Some Projects in My Group
  • SQL-Query-Result Navigation
  • Existing Solutions ndash Iterative Refinement
  • Existing Solutions - Ranking
  • Existing Solutions - Ranking
  • Existing Solutions - Categorization
  • Existing Solutions - Categorization
  • Address Diverse User Preferences (Chen amp Li SIGMOD 2007)
  • Some Projects in My Group
  • System Log Data
  • Knowledge validation completion and construction
  • A Case Study
  • Pattern Visualization in the Case Study
  • ERN in the Case Study
  • Other Projects in My Group
  • Examples of KPI Types
  • Methodology for Leading Indicator Learning
  • Some Projects in My Group
  • Biological Data Sources
  • Problem We are Studying
  • Some Projects in My Group
  • Business Continuity Information Network (BCIN)
  • BCIN
  • Acknowledgements
  • Question
Page 20: Data Mining: Efficiently Extracting Interpretable and ...taoli/TaoLi-Talk-1.pdf · Data Mining: Efficiently Extracting Interpretable and Actionable Patterns ... – Efficiently extracting

20

System Log Databull Performance Metric

ndash CPUMemorySwap utilization workload average response timebull Application-level Log

ndash Windows NT system log Websphere application log Oracle activitybull Failure data

ndash System crash dumps error reportsbull Network Traffic Data

ndash SNMP Link-level byte counts Netflow Link-level flow statistics PMA link-level packet header traces TCPdump host-side packet traces

bull Request Datandash Apache and Microsoft IIS log

bull Reports from Operatorsndash Trouble ticket data problems reports possible causes and symptoms for failures

bull Othersndash Network-based alert logs Probes

21

logs

logs

logs

HistoricalData Collection

Realtime Analysis

Offline Analysis

Knowledge Base

Real Time Management

Anomaly Detection

Fault Diagnosis

Problem Determination

CorrelationDependencyKnowledge

SummarizationVisualizationRule Construction

Temporal Pattern DiscoveryKnowledge Management

PlanningActions

The Integrated Framework on Data-Driven Computing System Management

Log Data Organization

Componentlogs Active Data Collection

Log Adapter

Situation Identification

and Categorization

22

Raw data-gtData table

Aggregationsummarization

Event Plots

Pattern MiningPattern Mining

OutagesTemporal patternsEvent bursts

Morning rebootingperiodic patterns

Weekly rebootingPeriodic patterns

Hosts

Knowledge validation completion and construction

23

A Case Study

State start stop dependency create connection report request configuration otherPlotX‐axis Time Y‐axis State

24

Pattern Visualization in the Case Study

25

ERN in the Case Study

26

Other Projects in My Group

bull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Business Intelligence

ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysis

27

bull Operationalbusiness Intelligencendash Collecting and analyzing metrics (Key

Performance Indicators) to support better decision-making

bull Focus leading indicator discovery

ndash Using operational metrics to adaptively manage amp improve workflow efficiency

bull Focus workflow exception detection and tracing

28

Examples of KPI Types

Leading Indicators(Value Drivers)

Lagging Indicators(Outcomes )

Number of clients that sales people meet face to face each week

Sales revenue

Complex repairs completed successfully during the first call or visit

Customer satisfaction

Number of parts for which orders exceed forecasts within 30 days of scheduled delivery

Per unit manufacturing costs

Number of customers who are delinquent paying their first bill

Customer churn

29

Methodology for Leading Indicator Learning

Selected Leading Indicators or Leading Indicator Combinations

clustering

Similarity matrix (or Distance matrix) using non-metric distance or Granger Causality

hellip

hellip

hellipThe root KPI may be the key leading indicator

Time seriesproduction model(eg assumption data model)

Leading Indicator Selection based on Domain Knowledge

sequence KPIs

Reduced Time series

define KPIs

correlation

dimensionalityreduction

If no update modelIs Model Valid

New Business Concern(s)

If yes

Leading Indicators

Raw databullProduction metricsbullWorkflow activity logsbullJobDeviceDocument events

30

Some Projects in My Groupbull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Social Network Analysisbull Business Intelligence

ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysisbull Matrix-Based Algorithms for Data Mining

31

Biological Data SourcesGene Expression Text LiteratureSequence Data

Clustering Protein ExpressionProtein-protein Interactions

Transcription FactorBinding SitesGene Ontology

Gene Clusters

Function Co-regulation

32

Problem We are Studyingbull Marker Gene Identification

ndash httpwwwcsfiuedu~yzhan004geneselhtmlndash 15 gene selection algorithms

bull ReliefF F-statistic A-optimality D-Optimality mRNR ReliefF-mRNR GSNR

bull RankGene 11

bull How to perform clustering using data from multiple diverse sources

bull How to perform clustering using partial prior knowledge

33

Some Projects in My Group

bull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Business Intelligence

ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysisbull Matrix-Based Algorithms for Data Mining

34

Business Continuity Information Network (BCIN)

35

BCIN

bull Central questionndash How to help businesses recover and resume

operations quicker after a major hurricane (or other) disaster

3620082008--44--66 Florida International UniversityFlorida International University 3636

3720082008--44--66 Florida International UniversityFlorida International University 3737

3820082008--44--66 Florida International UniversityFlorida International University 3838

3920082008--44--66 Florida International UniversityFlorida International University 3939

40Florida International UniversityFlorida International University

Situation Analysis

John Smith Asst Manager Store 2345

Rcmd Non-Functional 10107 1001am R CarrRequest Roofer assessment amp repair Windowdoor replacementComment Parking lot may drain in 24hrs

Status EmailActions Report Contact JRNLRecommendation Summaries | SA OverallStaff Inadequate 10107 1232pm J SmithFacility Inadequate 10107 1001am R CarrGoods Inadequate 10107 1058am B FallsTrans Functional 10107 1145am L Staff

Facility Non-Func 10107 1001am R Carr

FACILITY Building amp GroundsRoof Damaged 10107 715am R GarciaWindows Damaged 10107 758am R GarciaDoor Damaged 10107 650am R GarciaSecurity NAWater NAParking Damaged 10107 640 am R Garcia

33

41

Acknowledgementsbull Funding

ndash NSF Career Awardndash NIH ndash Faculty Research Awards from IBM Xerox and NECndash Equipment Donation from IBM

bull Collaboratorsndash Dr Chris Ding (University of Texas at Arlington)ndash Dr Michael I Jordan (UC Berkeley)ndash Dr Charles Perng (IBM Research)ndash Dr Tong Sun (Xerox Research)ndash Dr Shenghuo Zhu (NEC Research)

bull PhD Studentsndash Dr David Kaiserndash Dr Wei Pengndash Bo Shaondash Dingding Wangndash Haifeng Wang ndash Yi Zhang

42

Questionbull Email taolicsfiuedubull httpwwwcsfiuedu~taoli

  • Data Mining Efficiently Extracting Interpretable and Actionable Patterns
  • Research Challenges
  • My Research Directions
  • Some Projects in My Group
  • Example of Call Center Cases
  • Motivation
  • Two Different Approaches
  • Semantic Feature Extraction
  • Case Study ndash Online Help Desk
  • Some Projects in My Group
  • SQL-Query-Result Navigation
  • Existing Solutions ndash Iterative Refinement
  • Existing Solutions - Ranking
  • Existing Solutions - Ranking
  • Existing Solutions - Categorization
  • Existing Solutions - Categorization
  • Address Diverse User Preferences (Chen amp Li SIGMOD 2007)
  • Some Projects in My Group
  • System Log Data
  • Knowledge validation completion and construction
  • A Case Study
  • Pattern Visualization in the Case Study
  • ERN in the Case Study
  • Other Projects in My Group
  • Examples of KPI Types
  • Methodology for Leading Indicator Learning
  • Some Projects in My Group
  • Biological Data Sources
  • Problem We are Studying
  • Some Projects in My Group
  • Business Continuity Information Network (BCIN)
  • BCIN
  • Acknowledgements
  • Question
Page 21: Data Mining: Efficiently Extracting Interpretable and ...taoli/TaoLi-Talk-1.pdf · Data Mining: Efficiently Extracting Interpretable and Actionable Patterns ... – Efficiently extracting

21

logs

logs

logs

HistoricalData Collection

Realtime Analysis

Offline Analysis

Knowledge Base

Real Time Management

Anomaly Detection

Fault Diagnosis

Problem Determination

CorrelationDependencyKnowledge

SummarizationVisualizationRule Construction

Temporal Pattern DiscoveryKnowledge Management

PlanningActions

The Integrated Framework on Data-Driven Computing System Management

Log Data Organization

Componentlogs Active Data Collection

Log Adapter

Situation Identification

and Categorization

22

Raw data-gtData table

Aggregationsummarization

Event Plots

Pattern MiningPattern Mining

OutagesTemporal patternsEvent bursts

Morning rebootingperiodic patterns

Weekly rebootingPeriodic patterns

Hosts

Knowledge validation completion and construction

23

A Case Study

State start stop dependency create connection report request configuration otherPlotX‐axis Time Y‐axis State

24

Pattern Visualization in the Case Study

25

ERN in the Case Study

26

Other Projects in My Group

bull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Business Intelligence

ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysis

27

bull Operationalbusiness Intelligencendash Collecting and analyzing metrics (Key

Performance Indicators) to support better decision-making

bull Focus leading indicator discovery

ndash Using operational metrics to adaptively manage amp improve workflow efficiency

bull Focus workflow exception detection and tracing

28

Examples of KPI Types

Leading Indicators(Value Drivers)

Lagging Indicators(Outcomes )

Number of clients that sales people meet face to face each week

Sales revenue

Complex repairs completed successfully during the first call or visit

Customer satisfaction

Number of parts for which orders exceed forecasts within 30 days of scheduled delivery

Per unit manufacturing costs

Number of customers who are delinquent paying their first bill

Customer churn

29

Methodology for Leading Indicator Learning

Selected Leading Indicators or Leading Indicator Combinations

clustering

Similarity matrix (or Distance matrix) using non-metric distance or Granger Causality

hellip

hellip

hellipThe root KPI may be the key leading indicator

Time seriesproduction model(eg assumption data model)

Leading Indicator Selection based on Domain Knowledge

sequence KPIs

Reduced Time series

define KPIs

correlation

dimensionalityreduction

If no update modelIs Model Valid

New Business Concern(s)

If yes

Leading Indicators

Raw databullProduction metricsbullWorkflow activity logsbullJobDeviceDocument events

30

Some Projects in My Groupbull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Social Network Analysisbull Business Intelligence

ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysisbull Matrix-Based Algorithms for Data Mining

31

Biological Data SourcesGene Expression Text LiteratureSequence Data

Clustering Protein ExpressionProtein-protein Interactions

Transcription FactorBinding SitesGene Ontology

Gene Clusters

Function Co-regulation

32

Problem We are Studyingbull Marker Gene Identification

ndash httpwwwcsfiuedu~yzhan004geneselhtmlndash 15 gene selection algorithms

bull ReliefF F-statistic A-optimality D-Optimality mRNR ReliefF-mRNR GSNR

bull RankGene 11

bull How to perform clustering using data from multiple diverse sources

bull How to perform clustering using partial prior knowledge

33

Some Projects in My Group

bull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Business Intelligence

ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysisbull Matrix-Based Algorithms for Data Mining

34

Business Continuity Information Network (BCIN)

35

BCIN

bull Central questionndash How to help businesses recover and resume

operations quicker after a major hurricane (or other) disaster

3620082008--44--66 Florida International UniversityFlorida International University 3636

3720082008--44--66 Florida International UniversityFlorida International University 3737

3820082008--44--66 Florida International UniversityFlorida International University 3838

3920082008--44--66 Florida International UniversityFlorida International University 3939

40Florida International UniversityFlorida International University

Situation Analysis

John Smith Asst Manager Store 2345

Rcmd Non-Functional 10107 1001am R CarrRequest Roofer assessment amp repair Windowdoor replacementComment Parking lot may drain in 24hrs

Status EmailActions Report Contact JRNLRecommendation Summaries | SA OverallStaff Inadequate 10107 1232pm J SmithFacility Inadequate 10107 1001am R CarrGoods Inadequate 10107 1058am B FallsTrans Functional 10107 1145am L Staff

Facility Non-Func 10107 1001am R Carr

FACILITY Building amp GroundsRoof Damaged 10107 715am R GarciaWindows Damaged 10107 758am R GarciaDoor Damaged 10107 650am R GarciaSecurity NAWater NAParking Damaged 10107 640 am R Garcia

33

41

Acknowledgementsbull Funding

ndash NSF Career Awardndash NIH ndash Faculty Research Awards from IBM Xerox and NECndash Equipment Donation from IBM

bull Collaboratorsndash Dr Chris Ding (University of Texas at Arlington)ndash Dr Michael I Jordan (UC Berkeley)ndash Dr Charles Perng (IBM Research)ndash Dr Tong Sun (Xerox Research)ndash Dr Shenghuo Zhu (NEC Research)

bull PhD Studentsndash Dr David Kaiserndash Dr Wei Pengndash Bo Shaondash Dingding Wangndash Haifeng Wang ndash Yi Zhang

42

Questionbull Email taolicsfiuedubull httpwwwcsfiuedu~taoli

  • Data Mining Efficiently Extracting Interpretable and Actionable Patterns
  • Research Challenges
  • My Research Directions
  • Some Projects in My Group
  • Example of Call Center Cases
  • Motivation
  • Two Different Approaches
  • Semantic Feature Extraction
  • Case Study ndash Online Help Desk
  • Some Projects in My Group
  • SQL-Query-Result Navigation
  • Existing Solutions ndash Iterative Refinement
  • Existing Solutions - Ranking
  • Existing Solutions - Ranking
  • Existing Solutions - Categorization
  • Existing Solutions - Categorization
  • Address Diverse User Preferences (Chen amp Li SIGMOD 2007)
  • Some Projects in My Group
  • System Log Data
  • Knowledge validation completion and construction
  • A Case Study
  • Pattern Visualization in the Case Study
  • ERN in the Case Study
  • Other Projects in My Group
  • Examples of KPI Types
  • Methodology for Leading Indicator Learning
  • Some Projects in My Group
  • Biological Data Sources
  • Problem We are Studying
  • Some Projects in My Group
  • Business Continuity Information Network (BCIN)
  • BCIN
  • Acknowledgements
  • Question
Page 22: Data Mining: Efficiently Extracting Interpretable and ...taoli/TaoLi-Talk-1.pdf · Data Mining: Efficiently Extracting Interpretable and Actionable Patterns ... – Efficiently extracting

22

Raw data-gtData table

Aggregationsummarization

Event Plots

Pattern MiningPattern Mining

OutagesTemporal patternsEvent bursts

Morning rebootingperiodic patterns

Weekly rebootingPeriodic patterns

Hosts

Knowledge validation completion and construction

23

A Case Study

State start stop dependency create connection report request configuration otherPlotX‐axis Time Y‐axis State

24

Pattern Visualization in the Case Study

25

ERN in the Case Study

26

Other Projects in My Group

bull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Business Intelligence

ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysis

27

bull Operationalbusiness Intelligencendash Collecting and analyzing metrics (Key

Performance Indicators) to support better decision-making

bull Focus leading indicator discovery

ndash Using operational metrics to adaptively manage amp improve workflow efficiency

bull Focus workflow exception detection and tracing

28

Examples of KPI Types

Leading Indicators(Value Drivers)

Lagging Indicators(Outcomes )

Number of clients that sales people meet face to face each week

Sales revenue

Complex repairs completed successfully during the first call or visit

Customer satisfaction

Number of parts for which orders exceed forecasts within 30 days of scheduled delivery

Per unit manufacturing costs

Number of customers who are delinquent paying their first bill

Customer churn

29

Methodology for Leading Indicator Learning

Selected Leading Indicators or Leading Indicator Combinations

clustering

Similarity matrix (or Distance matrix) using non-metric distance or Granger Causality

hellip

hellip

hellipThe root KPI may be the key leading indicator

Time seriesproduction model(eg assumption data model)

Leading Indicator Selection based on Domain Knowledge

sequence KPIs

Reduced Time series

define KPIs

correlation

dimensionalityreduction

If no update modelIs Model Valid

New Business Concern(s)

If yes

Leading Indicators

Raw databullProduction metricsbullWorkflow activity logsbullJobDeviceDocument events

30

Some Projects in My Groupbull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Social Network Analysisbull Business Intelligence

ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysisbull Matrix-Based Algorithms for Data Mining

31

Biological Data SourcesGene Expression Text LiteratureSequence Data

Clustering Protein ExpressionProtein-protein Interactions

Transcription FactorBinding SitesGene Ontology

Gene Clusters

Function Co-regulation

32

Problem We are Studyingbull Marker Gene Identification

ndash httpwwwcsfiuedu~yzhan004geneselhtmlndash 15 gene selection algorithms

bull ReliefF F-statistic A-optimality D-Optimality mRNR ReliefF-mRNR GSNR

bull RankGene 11

bull How to perform clustering using data from multiple diverse sources

bull How to perform clustering using partial prior knowledge

33

Some Projects in My Group

bull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Business Intelligence

ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysisbull Matrix-Based Algorithms for Data Mining

34

Business Continuity Information Network (BCIN)

35

BCIN

bull Central questionndash How to help businesses recover and resume

operations quicker after a major hurricane (or other) disaster

3620082008--44--66 Florida International UniversityFlorida International University 3636

3720082008--44--66 Florida International UniversityFlorida International University 3737

3820082008--44--66 Florida International UniversityFlorida International University 3838

3920082008--44--66 Florida International UniversityFlorida International University 3939

40Florida International UniversityFlorida International University

Situation Analysis

John Smith Asst Manager Store 2345

Rcmd Non-Functional 10107 1001am R CarrRequest Roofer assessment amp repair Windowdoor replacementComment Parking lot may drain in 24hrs

Status EmailActions Report Contact JRNLRecommendation Summaries | SA OverallStaff Inadequate 10107 1232pm J SmithFacility Inadequate 10107 1001am R CarrGoods Inadequate 10107 1058am B FallsTrans Functional 10107 1145am L Staff

Facility Non-Func 10107 1001am R Carr

FACILITY Building amp GroundsRoof Damaged 10107 715am R GarciaWindows Damaged 10107 758am R GarciaDoor Damaged 10107 650am R GarciaSecurity NAWater NAParking Damaged 10107 640 am R Garcia

33

41

Acknowledgementsbull Funding

ndash NSF Career Awardndash NIH ndash Faculty Research Awards from IBM Xerox and NECndash Equipment Donation from IBM

bull Collaboratorsndash Dr Chris Ding (University of Texas at Arlington)ndash Dr Michael I Jordan (UC Berkeley)ndash Dr Charles Perng (IBM Research)ndash Dr Tong Sun (Xerox Research)ndash Dr Shenghuo Zhu (NEC Research)

bull PhD Studentsndash Dr David Kaiserndash Dr Wei Pengndash Bo Shaondash Dingding Wangndash Haifeng Wang ndash Yi Zhang

42

Questionbull Email taolicsfiuedubull httpwwwcsfiuedu~taoli

  • Data Mining Efficiently Extracting Interpretable and Actionable Patterns
  • Research Challenges
  • My Research Directions
  • Some Projects in My Group
  • Example of Call Center Cases
  • Motivation
  • Two Different Approaches
  • Semantic Feature Extraction
  • Case Study ndash Online Help Desk
  • Some Projects in My Group
  • SQL-Query-Result Navigation
  • Existing Solutions ndash Iterative Refinement
  • Existing Solutions - Ranking
  • Existing Solutions - Ranking
  • Existing Solutions - Categorization
  • Existing Solutions - Categorization
  • Address Diverse User Preferences (Chen amp Li SIGMOD 2007)
  • Some Projects in My Group
  • System Log Data
  • Knowledge validation completion and construction
  • A Case Study
  • Pattern Visualization in the Case Study
  • ERN in the Case Study
  • Other Projects in My Group
  • Examples of KPI Types
  • Methodology for Leading Indicator Learning
  • Some Projects in My Group
  • Biological Data Sources
  • Problem We are Studying
  • Some Projects in My Group
  • Business Continuity Information Network (BCIN)
  • BCIN
  • Acknowledgements
  • Question
Page 23: Data Mining: Efficiently Extracting Interpretable and ...taoli/TaoLi-Talk-1.pdf · Data Mining: Efficiently Extracting Interpretable and Actionable Patterns ... – Efficiently extracting

23

A Case Study

State start stop dependency create connection report request configuration otherPlotX‐axis Time Y‐axis State

24

Pattern Visualization in the Case Study

25

ERN in the Case Study

26

Other Projects in My Group

bull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Business Intelligence

ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysis

27

bull Operationalbusiness Intelligencendash Collecting and analyzing metrics (Key

Performance Indicators) to support better decision-making

bull Focus leading indicator discovery

ndash Using operational metrics to adaptively manage amp improve workflow efficiency

bull Focus workflow exception detection and tracing

28

Examples of KPI Types

Leading Indicators(Value Drivers)

Lagging Indicators(Outcomes )

Number of clients that sales people meet face to face each week

Sales revenue

Complex repairs completed successfully during the first call or visit

Customer satisfaction

Number of parts for which orders exceed forecasts within 30 days of scheduled delivery

Per unit manufacturing costs

Number of customers who are delinquent paying their first bill

Customer churn

29

Methodology for Leading Indicator Learning

Selected Leading Indicators or Leading Indicator Combinations

clustering

Similarity matrix (or Distance matrix) using non-metric distance or Granger Causality

hellip

hellip

hellipThe root KPI may be the key leading indicator

Time seriesproduction model(eg assumption data model)

Leading Indicator Selection based on Domain Knowledge

sequence KPIs

Reduced Time series

define KPIs

correlation

dimensionalityreduction

If no update modelIs Model Valid

New Business Concern(s)

If yes

Leading Indicators

Raw databullProduction metricsbullWorkflow activity logsbullJobDeviceDocument events

30

Some Projects in My Groupbull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Social Network Analysisbull Business Intelligence

ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysisbull Matrix-Based Algorithms for Data Mining

31

Biological Data SourcesGene Expression Text LiteratureSequence Data

Clustering Protein ExpressionProtein-protein Interactions

Transcription FactorBinding SitesGene Ontology

Gene Clusters

Function Co-regulation

32

Problem We are Studyingbull Marker Gene Identification

ndash httpwwwcsfiuedu~yzhan004geneselhtmlndash 15 gene selection algorithms

bull ReliefF F-statistic A-optimality D-Optimality mRNR ReliefF-mRNR GSNR

bull RankGene 11

bull How to perform clustering using data from multiple diverse sources

bull How to perform clustering using partial prior knowledge

33

Some Projects in My Group

bull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Business Intelligence

ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysisbull Matrix-Based Algorithms for Data Mining

34

Business Continuity Information Network (BCIN)

35

BCIN

bull Central questionndash How to help businesses recover and resume

operations quicker after a major hurricane (or other) disaster

3620082008--44--66 Florida International UniversityFlorida International University 3636

3720082008--44--66 Florida International UniversityFlorida International University 3737

3820082008--44--66 Florida International UniversityFlorida International University 3838

3920082008--44--66 Florida International UniversityFlorida International University 3939

40Florida International UniversityFlorida International University

Situation Analysis

John Smith Asst Manager Store 2345

Rcmd Non-Functional 10107 1001am R CarrRequest Roofer assessment amp repair Windowdoor replacementComment Parking lot may drain in 24hrs

Status EmailActions Report Contact JRNLRecommendation Summaries | SA OverallStaff Inadequate 10107 1232pm J SmithFacility Inadequate 10107 1001am R CarrGoods Inadequate 10107 1058am B FallsTrans Functional 10107 1145am L Staff

Facility Non-Func 10107 1001am R Carr

FACILITY Building amp GroundsRoof Damaged 10107 715am R GarciaWindows Damaged 10107 758am R GarciaDoor Damaged 10107 650am R GarciaSecurity NAWater NAParking Damaged 10107 640 am R Garcia

33

41

Acknowledgementsbull Funding

ndash NSF Career Awardndash NIH ndash Faculty Research Awards from IBM Xerox and NECndash Equipment Donation from IBM

bull Collaboratorsndash Dr Chris Ding (University of Texas at Arlington)ndash Dr Michael I Jordan (UC Berkeley)ndash Dr Charles Perng (IBM Research)ndash Dr Tong Sun (Xerox Research)ndash Dr Shenghuo Zhu (NEC Research)

bull PhD Studentsndash Dr David Kaiserndash Dr Wei Pengndash Bo Shaondash Dingding Wangndash Haifeng Wang ndash Yi Zhang

42

Questionbull Email taolicsfiuedubull httpwwwcsfiuedu~taoli

  • Data Mining Efficiently Extracting Interpretable and Actionable Patterns
  • Research Challenges
  • My Research Directions
  • Some Projects in My Group
  • Example of Call Center Cases
  • Motivation
  • Two Different Approaches
  • Semantic Feature Extraction
  • Case Study ndash Online Help Desk
  • Some Projects in My Group
  • SQL-Query-Result Navigation
  • Existing Solutions ndash Iterative Refinement
  • Existing Solutions - Ranking
  • Existing Solutions - Ranking
  • Existing Solutions - Categorization
  • Existing Solutions - Categorization
  • Address Diverse User Preferences (Chen amp Li SIGMOD 2007)
  • Some Projects in My Group
  • System Log Data
  • Knowledge validation completion and construction
  • A Case Study
  • Pattern Visualization in the Case Study
  • ERN in the Case Study
  • Other Projects in My Group
  • Examples of KPI Types
  • Methodology for Leading Indicator Learning
  • Some Projects in My Group
  • Biological Data Sources
  • Problem We are Studying
  • Some Projects in My Group
  • Business Continuity Information Network (BCIN)
  • BCIN
  • Acknowledgements
  • Question
Page 24: Data Mining: Efficiently Extracting Interpretable and ...taoli/TaoLi-Talk-1.pdf · Data Mining: Efficiently Extracting Interpretable and Actionable Patterns ... – Efficiently extracting

24

Pattern Visualization in the Case Study

25

ERN in the Case Study

26

Other Projects in My Group

bull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Business Intelligence

ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysis

27

bull Operationalbusiness Intelligencendash Collecting and analyzing metrics (Key

Performance Indicators) to support better decision-making

bull Focus leading indicator discovery

ndash Using operational metrics to adaptively manage amp improve workflow efficiency

bull Focus workflow exception detection and tracing

28

Examples of KPI Types

Leading Indicators(Value Drivers)

Lagging Indicators(Outcomes )

Number of clients that sales people meet face to face each week

Sales revenue

Complex repairs completed successfully during the first call or visit

Customer satisfaction

Number of parts for which orders exceed forecasts within 30 days of scheduled delivery

Per unit manufacturing costs

Number of customers who are delinquent paying their first bill

Customer churn

29

Methodology for Leading Indicator Learning

Selected Leading Indicators or Leading Indicator Combinations

clustering

Similarity matrix (or Distance matrix) using non-metric distance or Granger Causality

hellip

hellip

hellipThe root KPI may be the key leading indicator

Time seriesproduction model(eg assumption data model)

Leading Indicator Selection based on Domain Knowledge

sequence KPIs

Reduced Time series

define KPIs

correlation

dimensionalityreduction

If no update modelIs Model Valid

New Business Concern(s)

If yes

Leading Indicators

Raw databullProduction metricsbullWorkflow activity logsbullJobDeviceDocument events

30

Some Projects in My Groupbull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Social Network Analysisbull Business Intelligence

ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysisbull Matrix-Based Algorithms for Data Mining

31

Biological Data SourcesGene Expression Text LiteratureSequence Data

Clustering Protein ExpressionProtein-protein Interactions

Transcription FactorBinding SitesGene Ontology

Gene Clusters

Function Co-regulation

32

Problem We are Studyingbull Marker Gene Identification

ndash httpwwwcsfiuedu~yzhan004geneselhtmlndash 15 gene selection algorithms

bull ReliefF F-statistic A-optimality D-Optimality mRNR ReliefF-mRNR GSNR

bull RankGene 11

bull How to perform clustering using data from multiple diverse sources

bull How to perform clustering using partial prior knowledge

33

Some Projects in My Group

bull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Business Intelligence

ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysisbull Matrix-Based Algorithms for Data Mining

34

Business Continuity Information Network (BCIN)

35

BCIN

bull Central questionndash How to help businesses recover and resume

operations quicker after a major hurricane (or other) disaster

3620082008--44--66 Florida International UniversityFlorida International University 3636

3720082008--44--66 Florida International UniversityFlorida International University 3737

3820082008--44--66 Florida International UniversityFlorida International University 3838

3920082008--44--66 Florida International UniversityFlorida International University 3939

40Florida International UniversityFlorida International University

Situation Analysis

John Smith Asst Manager Store 2345

Rcmd Non-Functional 10107 1001am R CarrRequest Roofer assessment amp repair Windowdoor replacementComment Parking lot may drain in 24hrs

Status EmailActions Report Contact JRNLRecommendation Summaries | SA OverallStaff Inadequate 10107 1232pm J SmithFacility Inadequate 10107 1001am R CarrGoods Inadequate 10107 1058am B FallsTrans Functional 10107 1145am L Staff

Facility Non-Func 10107 1001am R Carr

FACILITY Building amp GroundsRoof Damaged 10107 715am R GarciaWindows Damaged 10107 758am R GarciaDoor Damaged 10107 650am R GarciaSecurity NAWater NAParking Damaged 10107 640 am R Garcia

33

41

Acknowledgementsbull Funding

ndash NSF Career Awardndash NIH ndash Faculty Research Awards from IBM Xerox and NECndash Equipment Donation from IBM

bull Collaboratorsndash Dr Chris Ding (University of Texas at Arlington)ndash Dr Michael I Jordan (UC Berkeley)ndash Dr Charles Perng (IBM Research)ndash Dr Tong Sun (Xerox Research)ndash Dr Shenghuo Zhu (NEC Research)

bull PhD Studentsndash Dr David Kaiserndash Dr Wei Pengndash Bo Shaondash Dingding Wangndash Haifeng Wang ndash Yi Zhang

42

Questionbull Email taolicsfiuedubull httpwwwcsfiuedu~taoli

  • Data Mining Efficiently Extracting Interpretable and Actionable Patterns
  • Research Challenges
  • My Research Directions
  • Some Projects in My Group
  • Example of Call Center Cases
  • Motivation
  • Two Different Approaches
  • Semantic Feature Extraction
  • Case Study ndash Online Help Desk
  • Some Projects in My Group
  • SQL-Query-Result Navigation
  • Existing Solutions ndash Iterative Refinement
  • Existing Solutions - Ranking
  • Existing Solutions - Ranking
  • Existing Solutions - Categorization
  • Existing Solutions - Categorization
  • Address Diverse User Preferences (Chen amp Li SIGMOD 2007)
  • Some Projects in My Group
  • System Log Data
  • Knowledge validation completion and construction
  • A Case Study
  • Pattern Visualization in the Case Study
  • ERN in the Case Study
  • Other Projects in My Group
  • Examples of KPI Types
  • Methodology for Leading Indicator Learning
  • Some Projects in My Group
  • Biological Data Sources
  • Problem We are Studying
  • Some Projects in My Group
  • Business Continuity Information Network (BCIN)
  • BCIN
  • Acknowledgements
  • Question
Page 25: Data Mining: Efficiently Extracting Interpretable and ...taoli/TaoLi-Talk-1.pdf · Data Mining: Efficiently Extracting Interpretable and Actionable Patterns ... – Efficiently extracting

25

ERN in the Case Study

26

Other Projects in My Group

bull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Business Intelligence

ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysis

27

bull Operationalbusiness Intelligencendash Collecting and analyzing metrics (Key

Performance Indicators) to support better decision-making

bull Focus leading indicator discovery

ndash Using operational metrics to adaptively manage amp improve workflow efficiency

bull Focus workflow exception detection and tracing

28

Examples of KPI Types

Leading Indicators(Value Drivers)

Lagging Indicators(Outcomes )

Number of clients that sales people meet face to face each week

Sales revenue

Complex repairs completed successfully during the first call or visit

Customer satisfaction

Number of parts for which orders exceed forecasts within 30 days of scheduled delivery

Per unit manufacturing costs

Number of customers who are delinquent paying their first bill

Customer churn

29

Methodology for Leading Indicator Learning

Selected Leading Indicators or Leading Indicator Combinations

clustering

Similarity matrix (or Distance matrix) using non-metric distance or Granger Causality

hellip

hellip

hellipThe root KPI may be the key leading indicator

Time seriesproduction model(eg assumption data model)

Leading Indicator Selection based on Domain Knowledge

sequence KPIs

Reduced Time series

define KPIs

correlation

dimensionalityreduction

If no update modelIs Model Valid

New Business Concern(s)

If yes

Leading Indicators

Raw databullProduction metricsbullWorkflow activity logsbullJobDeviceDocument events

30

Some Projects in My Groupbull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Social Network Analysisbull Business Intelligence

ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysisbull Matrix-Based Algorithms for Data Mining

31

Biological Data SourcesGene Expression Text LiteratureSequence Data

Clustering Protein ExpressionProtein-protein Interactions

Transcription FactorBinding SitesGene Ontology

Gene Clusters

Function Co-regulation

32

Problem We are Studyingbull Marker Gene Identification

ndash httpwwwcsfiuedu~yzhan004geneselhtmlndash 15 gene selection algorithms

bull ReliefF F-statistic A-optimality D-Optimality mRNR ReliefF-mRNR GSNR

bull RankGene 11

bull How to perform clustering using data from multiple diverse sources

bull How to perform clustering using partial prior knowledge

33

Some Projects in My Group

bull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Business Intelligence

ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysisbull Matrix-Based Algorithms for Data Mining

34

Business Continuity Information Network (BCIN)

35

BCIN

bull Central questionndash How to help businesses recover and resume

operations quicker after a major hurricane (or other) disaster

3620082008--44--66 Florida International UniversityFlorida International University 3636

3720082008--44--66 Florida International UniversityFlorida International University 3737

3820082008--44--66 Florida International UniversityFlorida International University 3838

3920082008--44--66 Florida International UniversityFlorida International University 3939

40Florida International UniversityFlorida International University

Situation Analysis

John Smith Asst Manager Store 2345

Rcmd Non-Functional 10107 1001am R CarrRequest Roofer assessment amp repair Windowdoor replacementComment Parking lot may drain in 24hrs

Status EmailActions Report Contact JRNLRecommendation Summaries | SA OverallStaff Inadequate 10107 1232pm J SmithFacility Inadequate 10107 1001am R CarrGoods Inadequate 10107 1058am B FallsTrans Functional 10107 1145am L Staff

Facility Non-Func 10107 1001am R Carr

FACILITY Building amp GroundsRoof Damaged 10107 715am R GarciaWindows Damaged 10107 758am R GarciaDoor Damaged 10107 650am R GarciaSecurity NAWater NAParking Damaged 10107 640 am R Garcia

33

41

Acknowledgementsbull Funding

ndash NSF Career Awardndash NIH ndash Faculty Research Awards from IBM Xerox and NECndash Equipment Donation from IBM

bull Collaboratorsndash Dr Chris Ding (University of Texas at Arlington)ndash Dr Michael I Jordan (UC Berkeley)ndash Dr Charles Perng (IBM Research)ndash Dr Tong Sun (Xerox Research)ndash Dr Shenghuo Zhu (NEC Research)

bull PhD Studentsndash Dr David Kaiserndash Dr Wei Pengndash Bo Shaondash Dingding Wangndash Haifeng Wang ndash Yi Zhang

42

Questionbull Email taolicsfiuedubull httpwwwcsfiuedu~taoli

  • Data Mining Efficiently Extracting Interpretable and Actionable Patterns
  • Research Challenges
  • My Research Directions
  • Some Projects in My Group
  • Example of Call Center Cases
  • Motivation
  • Two Different Approaches
  • Semantic Feature Extraction
  • Case Study ndash Online Help Desk
  • Some Projects in My Group
  • SQL-Query-Result Navigation
  • Existing Solutions ndash Iterative Refinement
  • Existing Solutions - Ranking
  • Existing Solutions - Ranking
  • Existing Solutions - Categorization
  • Existing Solutions - Categorization
  • Address Diverse User Preferences (Chen amp Li SIGMOD 2007)
  • Some Projects in My Group
  • System Log Data
  • Knowledge validation completion and construction
  • A Case Study
  • Pattern Visualization in the Case Study
  • ERN in the Case Study
  • Other Projects in My Group
  • Examples of KPI Types
  • Methodology for Leading Indicator Learning
  • Some Projects in My Group
  • Biological Data Sources
  • Problem We are Studying
  • Some Projects in My Group
  • Business Continuity Information Network (BCIN)
  • BCIN
  • Acknowledgements
  • Question
Page 26: Data Mining: Efficiently Extracting Interpretable and ...taoli/TaoLi-Talk-1.pdf · Data Mining: Efficiently Extracting Interpretable and Actionable Patterns ... – Efficiently extracting

26

Other Projects in My Group

bull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Business Intelligence

ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysis

27

bull Operationalbusiness Intelligencendash Collecting and analyzing metrics (Key

Performance Indicators) to support better decision-making

bull Focus leading indicator discovery

ndash Using operational metrics to adaptively manage amp improve workflow efficiency

bull Focus workflow exception detection and tracing

28

Examples of KPI Types

Leading Indicators(Value Drivers)

Lagging Indicators(Outcomes )

Number of clients that sales people meet face to face each week

Sales revenue

Complex repairs completed successfully during the first call or visit

Customer satisfaction

Number of parts for which orders exceed forecasts within 30 days of scheduled delivery

Per unit manufacturing costs

Number of customers who are delinquent paying their first bill

Customer churn

29

Methodology for Leading Indicator Learning

Selected Leading Indicators or Leading Indicator Combinations

clustering

Similarity matrix (or Distance matrix) using non-metric distance or Granger Causality

hellip

hellip

hellipThe root KPI may be the key leading indicator

Time seriesproduction model(eg assumption data model)

Leading Indicator Selection based on Domain Knowledge

sequence KPIs

Reduced Time series

define KPIs

correlation

dimensionalityreduction

If no update modelIs Model Valid

New Business Concern(s)

If yes

Leading Indicators

Raw databullProduction metricsbullWorkflow activity logsbullJobDeviceDocument events

30

Some Projects in My Groupbull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Social Network Analysisbull Business Intelligence

ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysisbull Matrix-Based Algorithms for Data Mining

31

Biological Data SourcesGene Expression Text LiteratureSequence Data

Clustering Protein ExpressionProtein-protein Interactions

Transcription FactorBinding SitesGene Ontology

Gene Clusters

Function Co-regulation

32

Problem We are Studyingbull Marker Gene Identification

ndash httpwwwcsfiuedu~yzhan004geneselhtmlndash 15 gene selection algorithms

bull ReliefF F-statistic A-optimality D-Optimality mRNR ReliefF-mRNR GSNR

bull RankGene 11

bull How to perform clustering using data from multiple diverse sources

bull How to perform clustering using partial prior knowledge

33

Some Projects in My Group

bull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Business Intelligence

ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysisbull Matrix-Based Algorithms for Data Mining

34

Business Continuity Information Network (BCIN)

35

BCIN

bull Central questionndash How to help businesses recover and resume

operations quicker after a major hurricane (or other) disaster

3620082008--44--66 Florida International UniversityFlorida International University 3636

3720082008--44--66 Florida International UniversityFlorida International University 3737

3820082008--44--66 Florida International UniversityFlorida International University 3838

3920082008--44--66 Florida International UniversityFlorida International University 3939

40Florida International UniversityFlorida International University

Situation Analysis

John Smith Asst Manager Store 2345

Rcmd Non-Functional 10107 1001am R CarrRequest Roofer assessment amp repair Windowdoor replacementComment Parking lot may drain in 24hrs

Status EmailActions Report Contact JRNLRecommendation Summaries | SA OverallStaff Inadequate 10107 1232pm J SmithFacility Inadequate 10107 1001am R CarrGoods Inadequate 10107 1058am B FallsTrans Functional 10107 1145am L Staff

Facility Non-Func 10107 1001am R Carr

FACILITY Building amp GroundsRoof Damaged 10107 715am R GarciaWindows Damaged 10107 758am R GarciaDoor Damaged 10107 650am R GarciaSecurity NAWater NAParking Damaged 10107 640 am R Garcia

33

41

Acknowledgementsbull Funding

ndash NSF Career Awardndash NIH ndash Faculty Research Awards from IBM Xerox and NECndash Equipment Donation from IBM

bull Collaboratorsndash Dr Chris Ding (University of Texas at Arlington)ndash Dr Michael I Jordan (UC Berkeley)ndash Dr Charles Perng (IBM Research)ndash Dr Tong Sun (Xerox Research)ndash Dr Shenghuo Zhu (NEC Research)

bull PhD Studentsndash Dr David Kaiserndash Dr Wei Pengndash Bo Shaondash Dingding Wangndash Haifeng Wang ndash Yi Zhang

42

Questionbull Email taolicsfiuedubull httpwwwcsfiuedu~taoli

  • Data Mining Efficiently Extracting Interpretable and Actionable Patterns
  • Research Challenges
  • My Research Directions
  • Some Projects in My Group
  • Example of Call Center Cases
  • Motivation
  • Two Different Approaches
  • Semantic Feature Extraction
  • Case Study ndash Online Help Desk
  • Some Projects in My Group
  • SQL-Query-Result Navigation
  • Existing Solutions ndash Iterative Refinement
  • Existing Solutions - Ranking
  • Existing Solutions - Ranking
  • Existing Solutions - Categorization
  • Existing Solutions - Categorization
  • Address Diverse User Preferences (Chen amp Li SIGMOD 2007)
  • Some Projects in My Group
  • System Log Data
  • Knowledge validation completion and construction
  • A Case Study
  • Pattern Visualization in the Case Study
  • ERN in the Case Study
  • Other Projects in My Group
  • Examples of KPI Types
  • Methodology for Leading Indicator Learning
  • Some Projects in My Group
  • Biological Data Sources
  • Problem We are Studying
  • Some Projects in My Group
  • Business Continuity Information Network (BCIN)
  • BCIN
  • Acknowledgements
  • Question
Page 27: Data Mining: Efficiently Extracting Interpretable and ...taoli/TaoLi-Talk-1.pdf · Data Mining: Efficiently Extracting Interpretable and Actionable Patterns ... – Efficiently extracting

27

bull Operationalbusiness Intelligencendash Collecting and analyzing metrics (Key

Performance Indicators) to support better decision-making

bull Focus leading indicator discovery

ndash Using operational metrics to adaptively manage amp improve workflow efficiency

bull Focus workflow exception detection and tracing

28

Examples of KPI Types

Leading Indicators(Value Drivers)

Lagging Indicators(Outcomes )

Number of clients that sales people meet face to face each week

Sales revenue

Complex repairs completed successfully during the first call or visit

Customer satisfaction

Number of parts for which orders exceed forecasts within 30 days of scheduled delivery

Per unit manufacturing costs

Number of customers who are delinquent paying their first bill

Customer churn

29

Methodology for Leading Indicator Learning

Selected Leading Indicators or Leading Indicator Combinations

clustering

Similarity matrix (or Distance matrix) using non-metric distance or Granger Causality

hellip

hellip

hellipThe root KPI may be the key leading indicator

Time seriesproduction model(eg assumption data model)

Leading Indicator Selection based on Domain Knowledge

sequence KPIs

Reduced Time series

define KPIs

correlation

dimensionalityreduction

If no update modelIs Model Valid

New Business Concern(s)

If yes

Leading Indicators

Raw databullProduction metricsbullWorkflow activity logsbullJobDeviceDocument events

30

Some Projects in My Groupbull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Social Network Analysisbull Business Intelligence

ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysisbull Matrix-Based Algorithms for Data Mining

31

Biological Data SourcesGene Expression Text LiteratureSequence Data

Clustering Protein ExpressionProtein-protein Interactions

Transcription FactorBinding SitesGene Ontology

Gene Clusters

Function Co-regulation

32

Problem We are Studyingbull Marker Gene Identification

ndash httpwwwcsfiuedu~yzhan004geneselhtmlndash 15 gene selection algorithms

bull ReliefF F-statistic A-optimality D-Optimality mRNR ReliefF-mRNR GSNR

bull RankGene 11

bull How to perform clustering using data from multiple diverse sources

bull How to perform clustering using partial prior knowledge

33

Some Projects in My Group

bull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Business Intelligence

ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysisbull Matrix-Based Algorithms for Data Mining

34

Business Continuity Information Network (BCIN)

35

BCIN

bull Central questionndash How to help businesses recover and resume

operations quicker after a major hurricane (or other) disaster

3620082008--44--66 Florida International UniversityFlorida International University 3636

3720082008--44--66 Florida International UniversityFlorida International University 3737

3820082008--44--66 Florida International UniversityFlorida International University 3838

3920082008--44--66 Florida International UniversityFlorida International University 3939

40Florida International UniversityFlorida International University

Situation Analysis

John Smith Asst Manager Store 2345

Rcmd Non-Functional 10107 1001am R CarrRequest Roofer assessment amp repair Windowdoor replacementComment Parking lot may drain in 24hrs

Status EmailActions Report Contact JRNLRecommendation Summaries | SA OverallStaff Inadequate 10107 1232pm J SmithFacility Inadequate 10107 1001am R CarrGoods Inadequate 10107 1058am B FallsTrans Functional 10107 1145am L Staff

Facility Non-Func 10107 1001am R Carr

FACILITY Building amp GroundsRoof Damaged 10107 715am R GarciaWindows Damaged 10107 758am R GarciaDoor Damaged 10107 650am R GarciaSecurity NAWater NAParking Damaged 10107 640 am R Garcia

33

41

Acknowledgementsbull Funding

ndash NSF Career Awardndash NIH ndash Faculty Research Awards from IBM Xerox and NECndash Equipment Donation from IBM

bull Collaboratorsndash Dr Chris Ding (University of Texas at Arlington)ndash Dr Michael I Jordan (UC Berkeley)ndash Dr Charles Perng (IBM Research)ndash Dr Tong Sun (Xerox Research)ndash Dr Shenghuo Zhu (NEC Research)

bull PhD Studentsndash Dr David Kaiserndash Dr Wei Pengndash Bo Shaondash Dingding Wangndash Haifeng Wang ndash Yi Zhang

42

Questionbull Email taolicsfiuedubull httpwwwcsfiuedu~taoli

  • Data Mining Efficiently Extracting Interpretable and Actionable Patterns
  • Research Challenges
  • My Research Directions
  • Some Projects in My Group
  • Example of Call Center Cases
  • Motivation
  • Two Different Approaches
  • Semantic Feature Extraction
  • Case Study ndash Online Help Desk
  • Some Projects in My Group
  • SQL-Query-Result Navigation
  • Existing Solutions ndash Iterative Refinement
  • Existing Solutions - Ranking
  • Existing Solutions - Ranking
  • Existing Solutions - Categorization
  • Existing Solutions - Categorization
  • Address Diverse User Preferences (Chen amp Li SIGMOD 2007)
  • Some Projects in My Group
  • System Log Data
  • Knowledge validation completion and construction
  • A Case Study
  • Pattern Visualization in the Case Study
  • ERN in the Case Study
  • Other Projects in My Group
  • Examples of KPI Types
  • Methodology for Leading Indicator Learning
  • Some Projects in My Group
  • Biological Data Sources
  • Problem We are Studying
  • Some Projects in My Group
  • Business Continuity Information Network (BCIN)
  • BCIN
  • Acknowledgements
  • Question
Page 28: Data Mining: Efficiently Extracting Interpretable and ...taoli/TaoLi-Talk-1.pdf · Data Mining: Efficiently Extracting Interpretable and Actionable Patterns ... – Efficiently extracting

28

Examples of KPI Types

Leading Indicators(Value Drivers)

Lagging Indicators(Outcomes )

Number of clients that sales people meet face to face each week

Sales revenue

Complex repairs completed successfully during the first call or visit

Customer satisfaction

Number of parts for which orders exceed forecasts within 30 days of scheduled delivery

Per unit manufacturing costs

Number of customers who are delinquent paying their first bill

Customer churn

29

Methodology for Leading Indicator Learning

Selected Leading Indicators or Leading Indicator Combinations

clustering

Similarity matrix (or Distance matrix) using non-metric distance or Granger Causality

hellip

hellip

hellipThe root KPI may be the key leading indicator

Time seriesproduction model(eg assumption data model)

Leading Indicator Selection based on Domain Knowledge

sequence KPIs

Reduced Time series

define KPIs

correlation

dimensionalityreduction

If no update modelIs Model Valid

New Business Concern(s)

If yes

Leading Indicators

Raw databullProduction metricsbullWorkflow activity logsbullJobDeviceDocument events

30

Some Projects in My Groupbull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Social Network Analysisbull Business Intelligence

ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysisbull Matrix-Based Algorithms for Data Mining

31

Biological Data SourcesGene Expression Text LiteratureSequence Data

Clustering Protein ExpressionProtein-protein Interactions

Transcription FactorBinding SitesGene Ontology

Gene Clusters

Function Co-regulation

32

Problem We are Studyingbull Marker Gene Identification

ndash httpwwwcsfiuedu~yzhan004geneselhtmlndash 15 gene selection algorithms

bull ReliefF F-statistic A-optimality D-Optimality mRNR ReliefF-mRNR GSNR

bull RankGene 11

bull How to perform clustering using data from multiple diverse sources

bull How to perform clustering using partial prior knowledge

33

Some Projects in My Group

bull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Business Intelligence

ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysisbull Matrix-Based Algorithms for Data Mining

34

Business Continuity Information Network (BCIN)

35

BCIN

bull Central questionndash How to help businesses recover and resume

operations quicker after a major hurricane (or other) disaster

3620082008--44--66 Florida International UniversityFlorida International University 3636

3720082008--44--66 Florida International UniversityFlorida International University 3737

3820082008--44--66 Florida International UniversityFlorida International University 3838

3920082008--44--66 Florida International UniversityFlorida International University 3939

40Florida International UniversityFlorida International University

Situation Analysis

John Smith Asst Manager Store 2345

Rcmd Non-Functional 10107 1001am R CarrRequest Roofer assessment amp repair Windowdoor replacementComment Parking lot may drain in 24hrs

Status EmailActions Report Contact JRNLRecommendation Summaries | SA OverallStaff Inadequate 10107 1232pm J SmithFacility Inadequate 10107 1001am R CarrGoods Inadequate 10107 1058am B FallsTrans Functional 10107 1145am L Staff

Facility Non-Func 10107 1001am R Carr

FACILITY Building amp GroundsRoof Damaged 10107 715am R GarciaWindows Damaged 10107 758am R GarciaDoor Damaged 10107 650am R GarciaSecurity NAWater NAParking Damaged 10107 640 am R Garcia

33

41

Acknowledgementsbull Funding

ndash NSF Career Awardndash NIH ndash Faculty Research Awards from IBM Xerox and NECndash Equipment Donation from IBM

bull Collaboratorsndash Dr Chris Ding (University of Texas at Arlington)ndash Dr Michael I Jordan (UC Berkeley)ndash Dr Charles Perng (IBM Research)ndash Dr Tong Sun (Xerox Research)ndash Dr Shenghuo Zhu (NEC Research)

bull PhD Studentsndash Dr David Kaiserndash Dr Wei Pengndash Bo Shaondash Dingding Wangndash Haifeng Wang ndash Yi Zhang

42

Questionbull Email taolicsfiuedubull httpwwwcsfiuedu~taoli

  • Data Mining Efficiently Extracting Interpretable and Actionable Patterns
  • Research Challenges
  • My Research Directions
  • Some Projects in My Group
  • Example of Call Center Cases
  • Motivation
  • Two Different Approaches
  • Semantic Feature Extraction
  • Case Study ndash Online Help Desk
  • Some Projects in My Group
  • SQL-Query-Result Navigation
  • Existing Solutions ndash Iterative Refinement
  • Existing Solutions - Ranking
  • Existing Solutions - Ranking
  • Existing Solutions - Categorization
  • Existing Solutions - Categorization
  • Address Diverse User Preferences (Chen amp Li SIGMOD 2007)
  • Some Projects in My Group
  • System Log Data
  • Knowledge validation completion and construction
  • A Case Study
  • Pattern Visualization in the Case Study
  • ERN in the Case Study
  • Other Projects in My Group
  • Examples of KPI Types
  • Methodology for Leading Indicator Learning
  • Some Projects in My Group
  • Biological Data Sources
  • Problem We are Studying
  • Some Projects in My Group
  • Business Continuity Information Network (BCIN)
  • BCIN
  • Acknowledgements
  • Question
Page 29: Data Mining: Efficiently Extracting Interpretable and ...taoli/TaoLi-Talk-1.pdf · Data Mining: Efficiently Extracting Interpretable and Actionable Patterns ... – Efficiently extracting

29

Methodology for Leading Indicator Learning

Selected Leading Indicators or Leading Indicator Combinations

clustering

Similarity matrix (or Distance matrix) using non-metric distance or Granger Causality

hellip

hellip

hellipThe root KPI may be the key leading indicator

Time seriesproduction model(eg assumption data model)

Leading Indicator Selection based on Domain Knowledge

sequence KPIs

Reduced Time series

define KPIs

correlation

dimensionalityreduction

If no update modelIs Model Valid

New Business Concern(s)

If yes

Leading Indicators

Raw databullProduction metricsbullWorkflow activity logsbullJobDeviceDocument events

30

Some Projects in My Groupbull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Social Network Analysisbull Business Intelligence

ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysisbull Matrix-Based Algorithms for Data Mining

31

Biological Data SourcesGene Expression Text LiteratureSequence Data

Clustering Protein ExpressionProtein-protein Interactions

Transcription FactorBinding SitesGene Ontology

Gene Clusters

Function Co-regulation

32

Problem We are Studyingbull Marker Gene Identification

ndash httpwwwcsfiuedu~yzhan004geneselhtmlndash 15 gene selection algorithms

bull ReliefF F-statistic A-optimality D-Optimality mRNR ReliefF-mRNR GSNR

bull RankGene 11

bull How to perform clustering using data from multiple diverse sources

bull How to perform clustering using partial prior knowledge

33

Some Projects in My Group

bull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Business Intelligence

ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysisbull Matrix-Based Algorithms for Data Mining

34

Business Continuity Information Network (BCIN)

35

BCIN

bull Central questionndash How to help businesses recover and resume

operations quicker after a major hurricane (or other) disaster

3620082008--44--66 Florida International UniversityFlorida International University 3636

3720082008--44--66 Florida International UniversityFlorida International University 3737

3820082008--44--66 Florida International UniversityFlorida International University 3838

3920082008--44--66 Florida International UniversityFlorida International University 3939

40Florida International UniversityFlorida International University

Situation Analysis

John Smith Asst Manager Store 2345

Rcmd Non-Functional 10107 1001am R CarrRequest Roofer assessment amp repair Windowdoor replacementComment Parking lot may drain in 24hrs

Status EmailActions Report Contact JRNLRecommendation Summaries | SA OverallStaff Inadequate 10107 1232pm J SmithFacility Inadequate 10107 1001am R CarrGoods Inadequate 10107 1058am B FallsTrans Functional 10107 1145am L Staff

Facility Non-Func 10107 1001am R Carr

FACILITY Building amp GroundsRoof Damaged 10107 715am R GarciaWindows Damaged 10107 758am R GarciaDoor Damaged 10107 650am R GarciaSecurity NAWater NAParking Damaged 10107 640 am R Garcia

33

41

Acknowledgementsbull Funding

ndash NSF Career Awardndash NIH ndash Faculty Research Awards from IBM Xerox and NECndash Equipment Donation from IBM

bull Collaboratorsndash Dr Chris Ding (University of Texas at Arlington)ndash Dr Michael I Jordan (UC Berkeley)ndash Dr Charles Perng (IBM Research)ndash Dr Tong Sun (Xerox Research)ndash Dr Shenghuo Zhu (NEC Research)

bull PhD Studentsndash Dr David Kaiserndash Dr Wei Pengndash Bo Shaondash Dingding Wangndash Haifeng Wang ndash Yi Zhang

42

Questionbull Email taolicsfiuedubull httpwwwcsfiuedu~taoli

  • Data Mining Efficiently Extracting Interpretable and Actionable Patterns
  • Research Challenges
  • My Research Directions
  • Some Projects in My Group
  • Example of Call Center Cases
  • Motivation
  • Two Different Approaches
  • Semantic Feature Extraction
  • Case Study ndash Online Help Desk
  • Some Projects in My Group
  • SQL-Query-Result Navigation
  • Existing Solutions ndash Iterative Refinement
  • Existing Solutions - Ranking
  • Existing Solutions - Ranking
  • Existing Solutions - Categorization
  • Existing Solutions - Categorization
  • Address Diverse User Preferences (Chen amp Li SIGMOD 2007)
  • Some Projects in My Group
  • System Log Data
  • Knowledge validation completion and construction
  • A Case Study
  • Pattern Visualization in the Case Study
  • ERN in the Case Study
  • Other Projects in My Group
  • Examples of KPI Types
  • Methodology for Leading Indicator Learning
  • Some Projects in My Group
  • Biological Data Sources
  • Problem We are Studying
  • Some Projects in My Group
  • Business Continuity Information Network (BCIN)
  • BCIN
  • Acknowledgements
  • Question
Page 30: Data Mining: Efficiently Extracting Interpretable and ...taoli/TaoLi-Talk-1.pdf · Data Mining: Efficiently Extracting Interpretable and Actionable Patterns ... – Efficiently extracting

30

Some Projects in My Groupbull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Social Network Analysisbull Business Intelligence

ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysisbull Matrix-Based Algorithms for Data Mining

31

Biological Data SourcesGene Expression Text LiteratureSequence Data

Clustering Protein ExpressionProtein-protein Interactions

Transcription FactorBinding SitesGene Ontology

Gene Clusters

Function Co-regulation

32

Problem We are Studyingbull Marker Gene Identification

ndash httpwwwcsfiuedu~yzhan004geneselhtmlndash 15 gene selection algorithms

bull ReliefF F-statistic A-optimality D-Optimality mRNR ReliefF-mRNR GSNR

bull RankGene 11

bull How to perform clustering using data from multiple diverse sources

bull How to perform clustering using partial prior knowledge

33

Some Projects in My Group

bull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Business Intelligence

ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysisbull Matrix-Based Algorithms for Data Mining

34

Business Continuity Information Network (BCIN)

35

BCIN

bull Central questionndash How to help businesses recover and resume

operations quicker after a major hurricane (or other) disaster

3620082008--44--66 Florida International UniversityFlorida International University 3636

3720082008--44--66 Florida International UniversityFlorida International University 3737

3820082008--44--66 Florida International UniversityFlorida International University 3838

3920082008--44--66 Florida International UniversityFlorida International University 3939

40Florida International UniversityFlorida International University

Situation Analysis

John Smith Asst Manager Store 2345

Rcmd Non-Functional 10107 1001am R CarrRequest Roofer assessment amp repair Windowdoor replacementComment Parking lot may drain in 24hrs

Status EmailActions Report Contact JRNLRecommendation Summaries | SA OverallStaff Inadequate 10107 1232pm J SmithFacility Inadequate 10107 1001am R CarrGoods Inadequate 10107 1058am B FallsTrans Functional 10107 1145am L Staff

Facility Non-Func 10107 1001am R Carr

FACILITY Building amp GroundsRoof Damaged 10107 715am R GarciaWindows Damaged 10107 758am R GarciaDoor Damaged 10107 650am R GarciaSecurity NAWater NAParking Damaged 10107 640 am R Garcia

33

41

Acknowledgementsbull Funding

ndash NSF Career Awardndash NIH ndash Faculty Research Awards from IBM Xerox and NECndash Equipment Donation from IBM

bull Collaboratorsndash Dr Chris Ding (University of Texas at Arlington)ndash Dr Michael I Jordan (UC Berkeley)ndash Dr Charles Perng (IBM Research)ndash Dr Tong Sun (Xerox Research)ndash Dr Shenghuo Zhu (NEC Research)

bull PhD Studentsndash Dr David Kaiserndash Dr Wei Pengndash Bo Shaondash Dingding Wangndash Haifeng Wang ndash Yi Zhang

42

Questionbull Email taolicsfiuedubull httpwwwcsfiuedu~taoli

  • Data Mining Efficiently Extracting Interpretable and Actionable Patterns
  • Research Challenges
  • My Research Directions
  • Some Projects in My Group
  • Example of Call Center Cases
  • Motivation
  • Two Different Approaches
  • Semantic Feature Extraction
  • Case Study ndash Online Help Desk
  • Some Projects in My Group
  • SQL-Query-Result Navigation
  • Existing Solutions ndash Iterative Refinement
  • Existing Solutions - Ranking
  • Existing Solutions - Ranking
  • Existing Solutions - Categorization
  • Existing Solutions - Categorization
  • Address Diverse User Preferences (Chen amp Li SIGMOD 2007)
  • Some Projects in My Group
  • System Log Data
  • Knowledge validation completion and construction
  • A Case Study
  • Pattern Visualization in the Case Study
  • ERN in the Case Study
  • Other Projects in My Group
  • Examples of KPI Types
  • Methodology for Leading Indicator Learning
  • Some Projects in My Group
  • Biological Data Sources
  • Problem We are Studying
  • Some Projects in My Group
  • Business Continuity Information Network (BCIN)
  • BCIN
  • Acknowledgements
  • Question
Page 31: Data Mining: Efficiently Extracting Interpretable and ...taoli/TaoLi-Talk-1.pdf · Data Mining: Efficiently Extracting Interpretable and Actionable Patterns ... – Efficiently extracting

31

Biological Data SourcesGene Expression Text LiteratureSequence Data

Clustering Protein ExpressionProtein-protein Interactions

Transcription FactorBinding SitesGene Ontology

Gene Clusters

Function Co-regulation

32

Problem We are Studyingbull Marker Gene Identification

ndash httpwwwcsfiuedu~yzhan004geneselhtmlndash 15 gene selection algorithms

bull ReliefF F-statistic A-optimality D-Optimality mRNR ReliefF-mRNR GSNR

bull RankGene 11

bull How to perform clustering using data from multiple diverse sources

bull How to perform clustering using partial prior knowledge

33

Some Projects in My Group

bull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Business Intelligence

ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysisbull Matrix-Based Algorithms for Data Mining

34

Business Continuity Information Network (BCIN)

35

BCIN

bull Central questionndash How to help businesses recover and resume

operations quicker after a major hurricane (or other) disaster

3620082008--44--66 Florida International UniversityFlorida International University 3636

3720082008--44--66 Florida International UniversityFlorida International University 3737

3820082008--44--66 Florida International UniversityFlorida International University 3838

3920082008--44--66 Florida International UniversityFlorida International University 3939

40Florida International UniversityFlorida International University

Situation Analysis

John Smith Asst Manager Store 2345

Rcmd Non-Functional 10107 1001am R CarrRequest Roofer assessment amp repair Windowdoor replacementComment Parking lot may drain in 24hrs

Status EmailActions Report Contact JRNLRecommendation Summaries | SA OverallStaff Inadequate 10107 1232pm J SmithFacility Inadequate 10107 1001am R CarrGoods Inadequate 10107 1058am B FallsTrans Functional 10107 1145am L Staff

Facility Non-Func 10107 1001am R Carr

FACILITY Building amp GroundsRoof Damaged 10107 715am R GarciaWindows Damaged 10107 758am R GarciaDoor Damaged 10107 650am R GarciaSecurity NAWater NAParking Damaged 10107 640 am R Garcia

33

41

Acknowledgementsbull Funding

ndash NSF Career Awardndash NIH ndash Faculty Research Awards from IBM Xerox and NECndash Equipment Donation from IBM

bull Collaboratorsndash Dr Chris Ding (University of Texas at Arlington)ndash Dr Michael I Jordan (UC Berkeley)ndash Dr Charles Perng (IBM Research)ndash Dr Tong Sun (Xerox Research)ndash Dr Shenghuo Zhu (NEC Research)

bull PhD Studentsndash Dr David Kaiserndash Dr Wei Pengndash Bo Shaondash Dingding Wangndash Haifeng Wang ndash Yi Zhang

42

Questionbull Email taolicsfiuedubull httpwwwcsfiuedu~taoli

  • Data Mining Efficiently Extracting Interpretable and Actionable Patterns
  • Research Challenges
  • My Research Directions
  • Some Projects in My Group
  • Example of Call Center Cases
  • Motivation
  • Two Different Approaches
  • Semantic Feature Extraction
  • Case Study ndash Online Help Desk
  • Some Projects in My Group
  • SQL-Query-Result Navigation
  • Existing Solutions ndash Iterative Refinement
  • Existing Solutions - Ranking
  • Existing Solutions - Ranking
  • Existing Solutions - Categorization
  • Existing Solutions - Categorization
  • Address Diverse User Preferences (Chen amp Li SIGMOD 2007)
  • Some Projects in My Group
  • System Log Data
  • Knowledge validation completion and construction
  • A Case Study
  • Pattern Visualization in the Case Study
  • ERN in the Case Study
  • Other Projects in My Group
  • Examples of KPI Types
  • Methodology for Leading Indicator Learning
  • Some Projects in My Group
  • Biological Data Sources
  • Problem We are Studying
  • Some Projects in My Group
  • Business Continuity Information Network (BCIN)
  • BCIN
  • Acknowledgements
  • Question
Page 32: Data Mining: Efficiently Extracting Interpretable and ...taoli/TaoLi-Talk-1.pdf · Data Mining: Efficiently Extracting Interpretable and Actionable Patterns ... – Efficiently extracting

32

Problem We are Studyingbull Marker Gene Identification

ndash httpwwwcsfiuedu~yzhan004geneselhtmlndash 15 gene selection algorithms

bull ReliefF F-statistic A-optimality D-Optimality mRNR ReliefF-mRNR GSNR

bull RankGene 11

bull How to perform clustering using data from multiple diverse sources

bull How to perform clustering using partial prior knowledge

33

Some Projects in My Group

bull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Business Intelligence

ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysisbull Matrix-Based Algorithms for Data Mining

34

Business Continuity Information Network (BCIN)

35

BCIN

bull Central questionndash How to help businesses recover and resume

operations quicker after a major hurricane (or other) disaster

3620082008--44--66 Florida International UniversityFlorida International University 3636

3720082008--44--66 Florida International UniversityFlorida International University 3737

3820082008--44--66 Florida International UniversityFlorida International University 3838

3920082008--44--66 Florida International UniversityFlorida International University 3939

40Florida International UniversityFlorida International University

Situation Analysis

John Smith Asst Manager Store 2345

Rcmd Non-Functional 10107 1001am R CarrRequest Roofer assessment amp repair Windowdoor replacementComment Parking lot may drain in 24hrs

Status EmailActions Report Contact JRNLRecommendation Summaries | SA OverallStaff Inadequate 10107 1232pm J SmithFacility Inadequate 10107 1001am R CarrGoods Inadequate 10107 1058am B FallsTrans Functional 10107 1145am L Staff

Facility Non-Func 10107 1001am R Carr

FACILITY Building amp GroundsRoof Damaged 10107 715am R GarciaWindows Damaged 10107 758am R GarciaDoor Damaged 10107 650am R GarciaSecurity NAWater NAParking Damaged 10107 640 am R Garcia

33

41

Acknowledgementsbull Funding

ndash NSF Career Awardndash NIH ndash Faculty Research Awards from IBM Xerox and NECndash Equipment Donation from IBM

bull Collaboratorsndash Dr Chris Ding (University of Texas at Arlington)ndash Dr Michael I Jordan (UC Berkeley)ndash Dr Charles Perng (IBM Research)ndash Dr Tong Sun (Xerox Research)ndash Dr Shenghuo Zhu (NEC Research)

bull PhD Studentsndash Dr David Kaiserndash Dr Wei Pengndash Bo Shaondash Dingding Wangndash Haifeng Wang ndash Yi Zhang

42

Questionbull Email taolicsfiuedubull httpwwwcsfiuedu~taoli

  • Data Mining Efficiently Extracting Interpretable and Actionable Patterns
  • Research Challenges
  • My Research Directions
  • Some Projects in My Group
  • Example of Call Center Cases
  • Motivation
  • Two Different Approaches
  • Semantic Feature Extraction
  • Case Study ndash Online Help Desk
  • Some Projects in My Group
  • SQL-Query-Result Navigation
  • Existing Solutions ndash Iterative Refinement
  • Existing Solutions - Ranking
  • Existing Solutions - Ranking
  • Existing Solutions - Categorization
  • Existing Solutions - Categorization
  • Address Diverse User Preferences (Chen amp Li SIGMOD 2007)
  • Some Projects in My Group
  • System Log Data
  • Knowledge validation completion and construction
  • A Case Study
  • Pattern Visualization in the Case Study
  • ERN in the Case Study
  • Other Projects in My Group
  • Examples of KPI Types
  • Methodology for Leading Indicator Learning
  • Some Projects in My Group
  • Biological Data Sources
  • Problem We are Studying
  • Some Projects in My Group
  • Business Continuity Information Network (BCIN)
  • BCIN
  • Acknowledgements
  • Question
Page 33: Data Mining: Efficiently Extracting Interpretable and ...taoli/TaoLi-Talk-1.pdf · Data Mining: Efficiently Extracting Interpretable and Actionable Patterns ... – Efficiently extracting

33

Some Projects in My Group

bull Intelligent Help Deskbull Database Navigationbull Music Information Retrieval bull Computing System Managementbull Business Intelligence

ndash Leading KPI Discovery bull Bioinformatics bull Social Network Analysisbull Matrix-Based Algorithms for Data Mining

34

Business Continuity Information Network (BCIN)

35

BCIN

bull Central questionndash How to help businesses recover and resume

operations quicker after a major hurricane (or other) disaster

3620082008--44--66 Florida International UniversityFlorida International University 3636

3720082008--44--66 Florida International UniversityFlorida International University 3737

3820082008--44--66 Florida International UniversityFlorida International University 3838

3920082008--44--66 Florida International UniversityFlorida International University 3939

40Florida International UniversityFlorida International University

Situation Analysis

John Smith Asst Manager Store 2345

Rcmd Non-Functional 10107 1001am R CarrRequest Roofer assessment amp repair Windowdoor replacementComment Parking lot may drain in 24hrs

Status EmailActions Report Contact JRNLRecommendation Summaries | SA OverallStaff Inadequate 10107 1232pm J SmithFacility Inadequate 10107 1001am R CarrGoods Inadequate 10107 1058am B FallsTrans Functional 10107 1145am L Staff

Facility Non-Func 10107 1001am R Carr

FACILITY Building amp GroundsRoof Damaged 10107 715am R GarciaWindows Damaged 10107 758am R GarciaDoor Damaged 10107 650am R GarciaSecurity NAWater NAParking Damaged 10107 640 am R Garcia

33

41

Acknowledgementsbull Funding

ndash NSF Career Awardndash NIH ndash Faculty Research Awards from IBM Xerox and NECndash Equipment Donation from IBM

bull Collaboratorsndash Dr Chris Ding (University of Texas at Arlington)ndash Dr Michael I Jordan (UC Berkeley)ndash Dr Charles Perng (IBM Research)ndash Dr Tong Sun (Xerox Research)ndash Dr Shenghuo Zhu (NEC Research)

bull PhD Studentsndash Dr David Kaiserndash Dr Wei Pengndash Bo Shaondash Dingding Wangndash Haifeng Wang ndash Yi Zhang

42

Questionbull Email taolicsfiuedubull httpwwwcsfiuedu~taoli

  • Data Mining Efficiently Extracting Interpretable and Actionable Patterns
  • Research Challenges
  • My Research Directions
  • Some Projects in My Group
  • Example of Call Center Cases
  • Motivation
  • Two Different Approaches
  • Semantic Feature Extraction
  • Case Study ndash Online Help Desk
  • Some Projects in My Group
  • SQL-Query-Result Navigation
  • Existing Solutions ndash Iterative Refinement
  • Existing Solutions - Ranking
  • Existing Solutions - Ranking
  • Existing Solutions - Categorization
  • Existing Solutions - Categorization
  • Address Diverse User Preferences (Chen amp Li SIGMOD 2007)
  • Some Projects in My Group
  • System Log Data
  • Knowledge validation completion and construction
  • A Case Study
  • Pattern Visualization in the Case Study
  • ERN in the Case Study
  • Other Projects in My Group
  • Examples of KPI Types
  • Methodology for Leading Indicator Learning
  • Some Projects in My Group
  • Biological Data Sources
  • Problem We are Studying
  • Some Projects in My Group
  • Business Continuity Information Network (BCIN)
  • BCIN
  • Acknowledgements
  • Question
Page 34: Data Mining: Efficiently Extracting Interpretable and ...taoli/TaoLi-Talk-1.pdf · Data Mining: Efficiently Extracting Interpretable and Actionable Patterns ... – Efficiently extracting

34

Business Continuity Information Network (BCIN)

35

BCIN

bull Central questionndash How to help businesses recover and resume

operations quicker after a major hurricane (or other) disaster

3620082008--44--66 Florida International UniversityFlorida International University 3636

3720082008--44--66 Florida International UniversityFlorida International University 3737

3820082008--44--66 Florida International UniversityFlorida International University 3838

3920082008--44--66 Florida International UniversityFlorida International University 3939

40Florida International UniversityFlorida International University

Situation Analysis

John Smith Asst Manager Store 2345

Rcmd Non-Functional 10107 1001am R CarrRequest Roofer assessment amp repair Windowdoor replacementComment Parking lot may drain in 24hrs

Status EmailActions Report Contact JRNLRecommendation Summaries | SA OverallStaff Inadequate 10107 1232pm J SmithFacility Inadequate 10107 1001am R CarrGoods Inadequate 10107 1058am B FallsTrans Functional 10107 1145am L Staff

Facility Non-Func 10107 1001am R Carr

FACILITY Building amp GroundsRoof Damaged 10107 715am R GarciaWindows Damaged 10107 758am R GarciaDoor Damaged 10107 650am R GarciaSecurity NAWater NAParking Damaged 10107 640 am R Garcia

33

41

Acknowledgementsbull Funding

ndash NSF Career Awardndash NIH ndash Faculty Research Awards from IBM Xerox and NECndash Equipment Donation from IBM

bull Collaboratorsndash Dr Chris Ding (University of Texas at Arlington)ndash Dr Michael I Jordan (UC Berkeley)ndash Dr Charles Perng (IBM Research)ndash Dr Tong Sun (Xerox Research)ndash Dr Shenghuo Zhu (NEC Research)

bull PhD Studentsndash Dr David Kaiserndash Dr Wei Pengndash Bo Shaondash Dingding Wangndash Haifeng Wang ndash Yi Zhang

42

Questionbull Email taolicsfiuedubull httpwwwcsfiuedu~taoli

  • Data Mining Efficiently Extracting Interpretable and Actionable Patterns
  • Research Challenges
  • My Research Directions
  • Some Projects in My Group
  • Example of Call Center Cases
  • Motivation
  • Two Different Approaches
  • Semantic Feature Extraction
  • Case Study ndash Online Help Desk
  • Some Projects in My Group
  • SQL-Query-Result Navigation
  • Existing Solutions ndash Iterative Refinement
  • Existing Solutions - Ranking
  • Existing Solutions - Ranking
  • Existing Solutions - Categorization
  • Existing Solutions - Categorization
  • Address Diverse User Preferences (Chen amp Li SIGMOD 2007)
  • Some Projects in My Group
  • System Log Data
  • Knowledge validation completion and construction
  • A Case Study
  • Pattern Visualization in the Case Study
  • ERN in the Case Study
  • Other Projects in My Group
  • Examples of KPI Types
  • Methodology for Leading Indicator Learning
  • Some Projects in My Group
  • Biological Data Sources
  • Problem We are Studying
  • Some Projects in My Group
  • Business Continuity Information Network (BCIN)
  • BCIN
  • Acknowledgements
  • Question
Page 35: Data Mining: Efficiently Extracting Interpretable and ...taoli/TaoLi-Talk-1.pdf · Data Mining: Efficiently Extracting Interpretable and Actionable Patterns ... – Efficiently extracting

35

BCIN

bull Central questionndash How to help businesses recover and resume

operations quicker after a major hurricane (or other) disaster

3620082008--44--66 Florida International UniversityFlorida International University 3636

3720082008--44--66 Florida International UniversityFlorida International University 3737

3820082008--44--66 Florida International UniversityFlorida International University 3838

3920082008--44--66 Florida International UniversityFlorida International University 3939

40Florida International UniversityFlorida International University

Situation Analysis

John Smith Asst Manager Store 2345

Rcmd Non-Functional 10107 1001am R CarrRequest Roofer assessment amp repair Windowdoor replacementComment Parking lot may drain in 24hrs

Status EmailActions Report Contact JRNLRecommendation Summaries | SA OverallStaff Inadequate 10107 1232pm J SmithFacility Inadequate 10107 1001am R CarrGoods Inadequate 10107 1058am B FallsTrans Functional 10107 1145am L Staff

Facility Non-Func 10107 1001am R Carr

FACILITY Building amp GroundsRoof Damaged 10107 715am R GarciaWindows Damaged 10107 758am R GarciaDoor Damaged 10107 650am R GarciaSecurity NAWater NAParking Damaged 10107 640 am R Garcia

33

41

Acknowledgementsbull Funding

ndash NSF Career Awardndash NIH ndash Faculty Research Awards from IBM Xerox and NECndash Equipment Donation from IBM

bull Collaboratorsndash Dr Chris Ding (University of Texas at Arlington)ndash Dr Michael I Jordan (UC Berkeley)ndash Dr Charles Perng (IBM Research)ndash Dr Tong Sun (Xerox Research)ndash Dr Shenghuo Zhu (NEC Research)

bull PhD Studentsndash Dr David Kaiserndash Dr Wei Pengndash Bo Shaondash Dingding Wangndash Haifeng Wang ndash Yi Zhang

42

Questionbull Email taolicsfiuedubull httpwwwcsfiuedu~taoli

  • Data Mining Efficiently Extracting Interpretable and Actionable Patterns
  • Research Challenges
  • My Research Directions
  • Some Projects in My Group
  • Example of Call Center Cases
  • Motivation
  • Two Different Approaches
  • Semantic Feature Extraction
  • Case Study ndash Online Help Desk
  • Some Projects in My Group
  • SQL-Query-Result Navigation
  • Existing Solutions ndash Iterative Refinement
  • Existing Solutions - Ranking
  • Existing Solutions - Ranking
  • Existing Solutions - Categorization
  • Existing Solutions - Categorization
  • Address Diverse User Preferences (Chen amp Li SIGMOD 2007)
  • Some Projects in My Group
  • System Log Data
  • Knowledge validation completion and construction
  • A Case Study
  • Pattern Visualization in the Case Study
  • ERN in the Case Study
  • Other Projects in My Group
  • Examples of KPI Types
  • Methodology for Leading Indicator Learning
  • Some Projects in My Group
  • Biological Data Sources
  • Problem We are Studying
  • Some Projects in My Group
  • Business Continuity Information Network (BCIN)
  • BCIN
  • Acknowledgements
  • Question
Page 36: Data Mining: Efficiently Extracting Interpretable and ...taoli/TaoLi-Talk-1.pdf · Data Mining: Efficiently Extracting Interpretable and Actionable Patterns ... – Efficiently extracting

3620082008--44--66 Florida International UniversityFlorida International University 3636

3720082008--44--66 Florida International UniversityFlorida International University 3737

3820082008--44--66 Florida International UniversityFlorida International University 3838

3920082008--44--66 Florida International UniversityFlorida International University 3939

40Florida International UniversityFlorida International University

Situation Analysis

John Smith Asst Manager Store 2345

Rcmd Non-Functional 10107 1001am R CarrRequest Roofer assessment amp repair Windowdoor replacementComment Parking lot may drain in 24hrs

Status EmailActions Report Contact JRNLRecommendation Summaries | SA OverallStaff Inadequate 10107 1232pm J SmithFacility Inadequate 10107 1001am R CarrGoods Inadequate 10107 1058am B FallsTrans Functional 10107 1145am L Staff

Facility Non-Func 10107 1001am R Carr

FACILITY Building amp GroundsRoof Damaged 10107 715am R GarciaWindows Damaged 10107 758am R GarciaDoor Damaged 10107 650am R GarciaSecurity NAWater NAParking Damaged 10107 640 am R Garcia

33

41

Acknowledgementsbull Funding

ndash NSF Career Awardndash NIH ndash Faculty Research Awards from IBM Xerox and NECndash Equipment Donation from IBM

bull Collaboratorsndash Dr Chris Ding (University of Texas at Arlington)ndash Dr Michael I Jordan (UC Berkeley)ndash Dr Charles Perng (IBM Research)ndash Dr Tong Sun (Xerox Research)ndash Dr Shenghuo Zhu (NEC Research)

bull PhD Studentsndash Dr David Kaiserndash Dr Wei Pengndash Bo Shaondash Dingding Wangndash Haifeng Wang ndash Yi Zhang

42

Questionbull Email taolicsfiuedubull httpwwwcsfiuedu~taoli

  • Data Mining Efficiently Extracting Interpretable and Actionable Patterns
  • Research Challenges
  • My Research Directions
  • Some Projects in My Group
  • Example of Call Center Cases
  • Motivation
  • Two Different Approaches
  • Semantic Feature Extraction
  • Case Study ndash Online Help Desk
  • Some Projects in My Group
  • SQL-Query-Result Navigation
  • Existing Solutions ndash Iterative Refinement
  • Existing Solutions - Ranking
  • Existing Solutions - Ranking
  • Existing Solutions - Categorization
  • Existing Solutions - Categorization
  • Address Diverse User Preferences (Chen amp Li SIGMOD 2007)
  • Some Projects in My Group
  • System Log Data
  • Knowledge validation completion and construction
  • A Case Study
  • Pattern Visualization in the Case Study
  • ERN in the Case Study
  • Other Projects in My Group
  • Examples of KPI Types
  • Methodology for Leading Indicator Learning
  • Some Projects in My Group
  • Biological Data Sources
  • Problem We are Studying
  • Some Projects in My Group
  • Business Continuity Information Network (BCIN)
  • BCIN
  • Acknowledgements
  • Question
Page 37: Data Mining: Efficiently Extracting Interpretable and ...taoli/TaoLi-Talk-1.pdf · Data Mining: Efficiently Extracting Interpretable and Actionable Patterns ... – Efficiently extracting

3720082008--44--66 Florida International UniversityFlorida International University 3737

3820082008--44--66 Florida International UniversityFlorida International University 3838

3920082008--44--66 Florida International UniversityFlorida International University 3939

40Florida International UniversityFlorida International University

Situation Analysis

John Smith Asst Manager Store 2345

Rcmd Non-Functional 10107 1001am R CarrRequest Roofer assessment amp repair Windowdoor replacementComment Parking lot may drain in 24hrs

Status EmailActions Report Contact JRNLRecommendation Summaries | SA OverallStaff Inadequate 10107 1232pm J SmithFacility Inadequate 10107 1001am R CarrGoods Inadequate 10107 1058am B FallsTrans Functional 10107 1145am L Staff

Facility Non-Func 10107 1001am R Carr

FACILITY Building amp GroundsRoof Damaged 10107 715am R GarciaWindows Damaged 10107 758am R GarciaDoor Damaged 10107 650am R GarciaSecurity NAWater NAParking Damaged 10107 640 am R Garcia

33

41

Acknowledgementsbull Funding

ndash NSF Career Awardndash NIH ndash Faculty Research Awards from IBM Xerox and NECndash Equipment Donation from IBM

bull Collaboratorsndash Dr Chris Ding (University of Texas at Arlington)ndash Dr Michael I Jordan (UC Berkeley)ndash Dr Charles Perng (IBM Research)ndash Dr Tong Sun (Xerox Research)ndash Dr Shenghuo Zhu (NEC Research)

bull PhD Studentsndash Dr David Kaiserndash Dr Wei Pengndash Bo Shaondash Dingding Wangndash Haifeng Wang ndash Yi Zhang

42

Questionbull Email taolicsfiuedubull httpwwwcsfiuedu~taoli

  • Data Mining Efficiently Extracting Interpretable and Actionable Patterns
  • Research Challenges
  • My Research Directions
  • Some Projects in My Group
  • Example of Call Center Cases
  • Motivation
  • Two Different Approaches
  • Semantic Feature Extraction
  • Case Study ndash Online Help Desk
  • Some Projects in My Group
  • SQL-Query-Result Navigation
  • Existing Solutions ndash Iterative Refinement
  • Existing Solutions - Ranking
  • Existing Solutions - Ranking
  • Existing Solutions - Categorization
  • Existing Solutions - Categorization
  • Address Diverse User Preferences (Chen amp Li SIGMOD 2007)
  • Some Projects in My Group
  • System Log Data
  • Knowledge validation completion and construction
  • A Case Study
  • Pattern Visualization in the Case Study
  • ERN in the Case Study
  • Other Projects in My Group
  • Examples of KPI Types
  • Methodology for Leading Indicator Learning
  • Some Projects in My Group
  • Biological Data Sources
  • Problem We are Studying
  • Some Projects in My Group
  • Business Continuity Information Network (BCIN)
  • BCIN
  • Acknowledgements
  • Question
Page 38: Data Mining: Efficiently Extracting Interpretable and ...taoli/TaoLi-Talk-1.pdf · Data Mining: Efficiently Extracting Interpretable and Actionable Patterns ... – Efficiently extracting

3820082008--44--66 Florida International UniversityFlorida International University 3838

3920082008--44--66 Florida International UniversityFlorida International University 3939

40Florida International UniversityFlorida International University

Situation Analysis

John Smith Asst Manager Store 2345

Rcmd Non-Functional 10107 1001am R CarrRequest Roofer assessment amp repair Windowdoor replacementComment Parking lot may drain in 24hrs

Status EmailActions Report Contact JRNLRecommendation Summaries | SA OverallStaff Inadequate 10107 1232pm J SmithFacility Inadequate 10107 1001am R CarrGoods Inadequate 10107 1058am B FallsTrans Functional 10107 1145am L Staff

Facility Non-Func 10107 1001am R Carr

FACILITY Building amp GroundsRoof Damaged 10107 715am R GarciaWindows Damaged 10107 758am R GarciaDoor Damaged 10107 650am R GarciaSecurity NAWater NAParking Damaged 10107 640 am R Garcia

33

41

Acknowledgementsbull Funding

ndash NSF Career Awardndash NIH ndash Faculty Research Awards from IBM Xerox and NECndash Equipment Donation from IBM

bull Collaboratorsndash Dr Chris Ding (University of Texas at Arlington)ndash Dr Michael I Jordan (UC Berkeley)ndash Dr Charles Perng (IBM Research)ndash Dr Tong Sun (Xerox Research)ndash Dr Shenghuo Zhu (NEC Research)

bull PhD Studentsndash Dr David Kaiserndash Dr Wei Pengndash Bo Shaondash Dingding Wangndash Haifeng Wang ndash Yi Zhang

42

Questionbull Email taolicsfiuedubull httpwwwcsfiuedu~taoli

  • Data Mining Efficiently Extracting Interpretable and Actionable Patterns
  • Research Challenges
  • My Research Directions
  • Some Projects in My Group
  • Example of Call Center Cases
  • Motivation
  • Two Different Approaches
  • Semantic Feature Extraction
  • Case Study ndash Online Help Desk
  • Some Projects in My Group
  • SQL-Query-Result Navigation
  • Existing Solutions ndash Iterative Refinement
  • Existing Solutions - Ranking
  • Existing Solutions - Ranking
  • Existing Solutions - Categorization
  • Existing Solutions - Categorization
  • Address Diverse User Preferences (Chen amp Li SIGMOD 2007)
  • Some Projects in My Group
  • System Log Data
  • Knowledge validation completion and construction
  • A Case Study
  • Pattern Visualization in the Case Study
  • ERN in the Case Study
  • Other Projects in My Group
  • Examples of KPI Types
  • Methodology for Leading Indicator Learning
  • Some Projects in My Group
  • Biological Data Sources
  • Problem We are Studying
  • Some Projects in My Group
  • Business Continuity Information Network (BCIN)
  • BCIN
  • Acknowledgements
  • Question
Page 39: Data Mining: Efficiently Extracting Interpretable and ...taoli/TaoLi-Talk-1.pdf · Data Mining: Efficiently Extracting Interpretable and Actionable Patterns ... – Efficiently extracting

3920082008--44--66 Florida International UniversityFlorida International University 3939

40Florida International UniversityFlorida International University

Situation Analysis

John Smith Asst Manager Store 2345

Rcmd Non-Functional 10107 1001am R CarrRequest Roofer assessment amp repair Windowdoor replacementComment Parking lot may drain in 24hrs

Status EmailActions Report Contact JRNLRecommendation Summaries | SA OverallStaff Inadequate 10107 1232pm J SmithFacility Inadequate 10107 1001am R CarrGoods Inadequate 10107 1058am B FallsTrans Functional 10107 1145am L Staff

Facility Non-Func 10107 1001am R Carr

FACILITY Building amp GroundsRoof Damaged 10107 715am R GarciaWindows Damaged 10107 758am R GarciaDoor Damaged 10107 650am R GarciaSecurity NAWater NAParking Damaged 10107 640 am R Garcia

33

41

Acknowledgementsbull Funding

ndash NSF Career Awardndash NIH ndash Faculty Research Awards from IBM Xerox and NECndash Equipment Donation from IBM

bull Collaboratorsndash Dr Chris Ding (University of Texas at Arlington)ndash Dr Michael I Jordan (UC Berkeley)ndash Dr Charles Perng (IBM Research)ndash Dr Tong Sun (Xerox Research)ndash Dr Shenghuo Zhu (NEC Research)

bull PhD Studentsndash Dr David Kaiserndash Dr Wei Pengndash Bo Shaondash Dingding Wangndash Haifeng Wang ndash Yi Zhang

42

Questionbull Email taolicsfiuedubull httpwwwcsfiuedu~taoli

  • Data Mining Efficiently Extracting Interpretable and Actionable Patterns
  • Research Challenges
  • My Research Directions
  • Some Projects in My Group
  • Example of Call Center Cases
  • Motivation
  • Two Different Approaches
  • Semantic Feature Extraction
  • Case Study ndash Online Help Desk
  • Some Projects in My Group
  • SQL-Query-Result Navigation
  • Existing Solutions ndash Iterative Refinement
  • Existing Solutions - Ranking
  • Existing Solutions - Ranking
  • Existing Solutions - Categorization
  • Existing Solutions - Categorization
  • Address Diverse User Preferences (Chen amp Li SIGMOD 2007)
  • Some Projects in My Group
  • System Log Data
  • Knowledge validation completion and construction
  • A Case Study
  • Pattern Visualization in the Case Study
  • ERN in the Case Study
  • Other Projects in My Group
  • Examples of KPI Types
  • Methodology for Leading Indicator Learning
  • Some Projects in My Group
  • Biological Data Sources
  • Problem We are Studying
  • Some Projects in My Group
  • Business Continuity Information Network (BCIN)
  • BCIN
  • Acknowledgements
  • Question
Page 40: Data Mining: Efficiently Extracting Interpretable and ...taoli/TaoLi-Talk-1.pdf · Data Mining: Efficiently Extracting Interpretable and Actionable Patterns ... – Efficiently extracting

40Florida International UniversityFlorida International University

Situation Analysis

John Smith Asst Manager Store 2345

Rcmd Non-Functional 10107 1001am R CarrRequest Roofer assessment amp repair Windowdoor replacementComment Parking lot may drain in 24hrs

Status EmailActions Report Contact JRNLRecommendation Summaries | SA OverallStaff Inadequate 10107 1232pm J SmithFacility Inadequate 10107 1001am R CarrGoods Inadequate 10107 1058am B FallsTrans Functional 10107 1145am L Staff

Facility Non-Func 10107 1001am R Carr

FACILITY Building amp GroundsRoof Damaged 10107 715am R GarciaWindows Damaged 10107 758am R GarciaDoor Damaged 10107 650am R GarciaSecurity NAWater NAParking Damaged 10107 640 am R Garcia

33

41

Acknowledgementsbull Funding

ndash NSF Career Awardndash NIH ndash Faculty Research Awards from IBM Xerox and NECndash Equipment Donation from IBM

bull Collaboratorsndash Dr Chris Ding (University of Texas at Arlington)ndash Dr Michael I Jordan (UC Berkeley)ndash Dr Charles Perng (IBM Research)ndash Dr Tong Sun (Xerox Research)ndash Dr Shenghuo Zhu (NEC Research)

bull PhD Studentsndash Dr David Kaiserndash Dr Wei Pengndash Bo Shaondash Dingding Wangndash Haifeng Wang ndash Yi Zhang

42

Questionbull Email taolicsfiuedubull httpwwwcsfiuedu~taoli

  • Data Mining Efficiently Extracting Interpretable and Actionable Patterns
  • Research Challenges
  • My Research Directions
  • Some Projects in My Group
  • Example of Call Center Cases
  • Motivation
  • Two Different Approaches
  • Semantic Feature Extraction
  • Case Study ndash Online Help Desk
  • Some Projects in My Group
  • SQL-Query-Result Navigation
  • Existing Solutions ndash Iterative Refinement
  • Existing Solutions - Ranking
  • Existing Solutions - Ranking
  • Existing Solutions - Categorization
  • Existing Solutions - Categorization
  • Address Diverse User Preferences (Chen amp Li SIGMOD 2007)
  • Some Projects in My Group
  • System Log Data
  • Knowledge validation completion and construction
  • A Case Study
  • Pattern Visualization in the Case Study
  • ERN in the Case Study
  • Other Projects in My Group
  • Examples of KPI Types
  • Methodology for Leading Indicator Learning
  • Some Projects in My Group
  • Biological Data Sources
  • Problem We are Studying
  • Some Projects in My Group
  • Business Continuity Information Network (BCIN)
  • BCIN
  • Acknowledgements
  • Question
Page 41: Data Mining: Efficiently Extracting Interpretable and ...taoli/TaoLi-Talk-1.pdf · Data Mining: Efficiently Extracting Interpretable and Actionable Patterns ... – Efficiently extracting

41

Acknowledgementsbull Funding

ndash NSF Career Awardndash NIH ndash Faculty Research Awards from IBM Xerox and NECndash Equipment Donation from IBM

bull Collaboratorsndash Dr Chris Ding (University of Texas at Arlington)ndash Dr Michael I Jordan (UC Berkeley)ndash Dr Charles Perng (IBM Research)ndash Dr Tong Sun (Xerox Research)ndash Dr Shenghuo Zhu (NEC Research)

bull PhD Studentsndash Dr David Kaiserndash Dr Wei Pengndash Bo Shaondash Dingding Wangndash Haifeng Wang ndash Yi Zhang

42

Questionbull Email taolicsfiuedubull httpwwwcsfiuedu~taoli

  • Data Mining Efficiently Extracting Interpretable and Actionable Patterns
  • Research Challenges
  • My Research Directions
  • Some Projects in My Group
  • Example of Call Center Cases
  • Motivation
  • Two Different Approaches
  • Semantic Feature Extraction
  • Case Study ndash Online Help Desk
  • Some Projects in My Group
  • SQL-Query-Result Navigation
  • Existing Solutions ndash Iterative Refinement
  • Existing Solutions - Ranking
  • Existing Solutions - Ranking
  • Existing Solutions - Categorization
  • Existing Solutions - Categorization
  • Address Diverse User Preferences (Chen amp Li SIGMOD 2007)
  • Some Projects in My Group
  • System Log Data
  • Knowledge validation completion and construction
  • A Case Study
  • Pattern Visualization in the Case Study
  • ERN in the Case Study
  • Other Projects in My Group
  • Examples of KPI Types
  • Methodology for Leading Indicator Learning
  • Some Projects in My Group
  • Biological Data Sources
  • Problem We are Studying
  • Some Projects in My Group
  • Business Continuity Information Network (BCIN)
  • BCIN
  • Acknowledgements
  • Question
Page 42: Data Mining: Efficiently Extracting Interpretable and ...taoli/TaoLi-Talk-1.pdf · Data Mining: Efficiently Extracting Interpretable and Actionable Patterns ... – Efficiently extracting

42

Questionbull Email taolicsfiuedubull httpwwwcsfiuedu~taoli

  • Data Mining Efficiently Extracting Interpretable and Actionable Patterns
  • Research Challenges
  • My Research Directions
  • Some Projects in My Group
  • Example of Call Center Cases
  • Motivation
  • Two Different Approaches
  • Semantic Feature Extraction
  • Case Study ndash Online Help Desk
  • Some Projects in My Group
  • SQL-Query-Result Navigation
  • Existing Solutions ndash Iterative Refinement
  • Existing Solutions - Ranking
  • Existing Solutions - Ranking
  • Existing Solutions - Categorization
  • Existing Solutions - Categorization
  • Address Diverse User Preferences (Chen amp Li SIGMOD 2007)
  • Some Projects in My Group
  • System Log Data
  • Knowledge validation completion and construction
  • A Case Study
  • Pattern Visualization in the Case Study
  • ERN in the Case Study
  • Other Projects in My Group
  • Examples of KPI Types
  • Methodology for Leading Indicator Learning
  • Some Projects in My Group
  • Biological Data Sources
  • Problem We are Studying
  • Some Projects in My Group
  • Business Continuity Information Network (BCIN)
  • BCIN
  • Acknowledgements
  • Question