data mining-current status and research directions
Post on 12-Jan-2015
3.513 Views
Preview:
DESCRIPTION
TRANSCRIPT
2023년 4월 10일 Data Mining: Status and Directions 1
Data Mining: Current Status and Research
Directions
Jiawei Han
Intelligent Database Systems Research Lab
School of Computing Science
Simon Fraser University, Canada
http://www.cs.sfu.ca/~han
2023년 4월 10일 Data Mining: Status and Directions 2
Why Is Data Mining Hot?
Data mining (knowledge discovery in databases)
Extraction of interesting (non-trivial, implicit,
previously unknown and potentially useful)
information (knowledge) or patterns from data in
large databases or other information repositories
Necessity is the mother of invention
Data is everywhere—data mining should be
everywhere, too!
Understand and use data—an imminent task!
2023년 4월 10일 Data Mining: Status and Directions 3
Data, Data, Everywhere!!
Relational database—A commodity of every enterprise Huge data warehouses are under construction POS (Point of Sales): Transactional DBs in terabytes Object-relational databases, distributed, heterogeneous,
and legacy databases Spatial databases (GIS), remote sensing database (EOS),
and scientific/engineering databases Time-series data (e.g., stock trading) and temporal data Text (documents, emails) and multimedia databases WWW: A huge, hyper-linked, dynamic, global information
system
2023년 4월 10일 Data Mining: Status and Directions 4
Data Mining Is Everywhere, too!—A Multi-Dimensional View of Data Mining
Databases to be mined
Relational, transactional, object-relational, active, spatial,
time-series, text, multi-media, heterogeneous, legacy,
WWW, etc. Knowledge to be mined
Characterization, discrimination, association, classification,
clustering, trend, deviation and outlier analysis, etc. Techniques utilized
Database-oriented, data warehouse (OLAP), machine
learning, statistics, visualization, neural network, etc. Applications adapted
Retail, telecommunication, banking, fraud analysis, DNA mining,
stock market analysis, Web mining, Weblog analysis, etc.
2023년 4월 10일 Data Mining: Status and Directions 5
Data Mining: Confluence of Multiple Disciplines
Data Mining
Database Technology
Statistics
OtherDisciplines
InformationScience
MachineLearning (AI) Visualization
2023년 4월 10일 Data Mining: Status and Directions 6
Data Mining—One Can Trace Back to Early Civilization
Most scientific discoveries involve “data mining” Kepler’s Law, Newton’s Laws, periodic table of
chemical elements, …, from “big bang” to DNA Statistics: A discipline dedicated to data analysis Then why data mining? What are the differences?
Huge amount of data—in giga to tera bytes Fast computer—quick response, interactive analysis Multi-dimensional, powerful, thorough analysis High-level, “declarative”—user’s ease and control Automated or semi-automated—mining functions
hidden or built-in in many systems
2023년 4월 10일 Data Mining: Status and Directions 7
A Brief History of Data Mining Activities
1989 IJCAI Workshop on Knowledge Discovery in Databases Knowledge Discovery in Databases (G. Piatetsky-Shapiro and W.
Frawley, 1991) 1991-1994 Workshops on Knowledge Discovery in Databases
Advances in Knowledge Discovery and Data Mining (U. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, 1996)
1995-1998 International Conferences on Knowledge Discovery in Databases and Data Mining (KDD’95-98)
Journal of Data Mining and Knowledge Discovery (1997) 1998 ACM SIGKDD, SIGKDD’1999-2001 conferences, and
SIGKDD Explorations More conferences on data mining
PAKDD, PKDD, SIAM-Data Mining, (IEEE) ICDM, DaWaK, SPIE-DM, etc.
2023년 4월 10일 Data Mining: Status and Directions 8
Research Progress in the Last Decade
Multi-dimensional data analysis: Data warehouse and OLAP (on-line analytical processing)
Association, correlation, and causality analysis Classification: scalability and new approaches Clustering and outlier analysis Sequential patterns and time-series analysis Similarity analysis: curves, trends, images, texts,
etc. Text mining, Web mining and Weblog analysis Spatial, multimedia, scientific data analysis Data preprocessing and database compression Data visualization and visual data mining Many others, e.g., collaborative filtering
2023년 4월 10일 Data Mining: Status and Directions 9
Multi-Dimensional Data Analysis
Data warehousing: integration from heterogeneous or semi-structured databases
Multi-dimensional modeling of data: star & snowflake schemas
Efficient and scalable computation of data cubes or iceberg cubes
OLAP (on-line analytical processing): drilling, dicing, slicing, etc.
Discovery-driven exploration of data cubes From OLAP to OLAM: A multi-dimensional
view for on-line analytical mining
2023년 4월 10일 Data Mining: Status and Directions 10
Association and Frequent Pattern Analysis
Efficient mining of frequent patterns and association rules: Apriori and FP-growth algorithms Multi-level, multi-dimensional, quantitative
association mining From association to correlation, sequential
patterns, partial periodicity, cyclic rules, ratio rules, etc.
Query and constraint-based association analysis
2023년 4월 10일 Data Mining: Status and Directions 11
Classification: Scalable Methods and Handling of Complex Types of Data
Classification has been an essential theme in machine learning, and statistics research Decision trees, Bayesian classification, neural
networks, k-nearest neighbors, etc. Tree-pruning, Boosting, bagging techniques
Efficient and scalable classification methods Exploration of attribute-class pairs SLIQ, SPRINT, RainForest, BOAT, etc.
Classification of semi-structured and non-structured data Classification by clustering association rules (ARCS) Association-based classification Web document classification
2023년 4월 10일 Data Mining: Status and Directions 12
Clustering and Outlier Analysis
Partitioning methods k-means, k-medoids, CLARANS
Hierarchical methods: micro-clusters Birch, Cure, Chameleon
Density-based methods: DBSCAN and OPTICS, DENCLU
Grid-based methods STING, CLIQUE, WaveCluster
Outlier analysis: statistics-based, distance-based, deviation-
based Constraint-based clustering
COD (Clustering with Obstructed Distance) User-specified constraints
2023년 4월 10일 Data Mining: Status and Directions 13
Sequential Patterns and Time-Series Analysis
Trend analysis Trend movement vs. cyclic variations, seasonal
variations and random fluctuations Similarity search in time-series database
Handling gaps, scaling, etc. Indexing methods and query languages for time-
series Sequential pattern mining
Various kinds of sequences, various methods From GSP to PrefixSpan
Periodicity analysis Full periodicity, partial periodicity, cyclic
association rules
2023년 4월 10일 Data Mining: Status and Directions 14
Similarity Search: Similar Curves, Trends, Images, and Texts
Various kinds of data, various similarity mining methods
Discovery of similar trends in time-series data Data transformation & high-dimensional structures
Finding similar images based on color, texture, etc. Content-based vs. keyword-based retrieval Color histogram-based signature Multi-feature composed signature
Finding documents with similar texts Similar keywords (synonymy & polysemy) Term frequency matrix Latent semantic indexing
2023년 4월 10일 Data Mining: Status and Directions 15
Spatial, Multimedia, Scientific Data Analysis
Multi-dimensional analysis of spatial, multimedia and scientific data Geo-spatial data cube and spatial OLAP The curse of dimensionality problem
Association analysis A progressive refinement methodology Micro-clustering can be used for preprocessing
in the analysis of complex types of data Classification
Association-based for handling high-dimensionality and sparse data
2023년 4월 10일 Data Mining: Status and Directions 16
Data Mining Industry and Applications
From research prototypes to data mining products, languages, and standards IBM Intelligent Miner, SAS Enterprise Miner,
SGI MineSet, Clementine, MS/SQLServer 2000, DBMiner, BlueMartini, MineIt, DigiMine, etc.
A few data mining languages and standards (esp. MS OLEDB for Data Mining).
Application achievements in many domains Market analysis, trend analysis, fraud
detection, outlier analysis, Web mining, etc.
2023년 4월 10일 Data Mining: Status and Directions 17
Web Mining: A Fast Expanding Frontier in Data Mining
Mine what Web search engine finds
Automatic classification of Web documents
Discovery of authoritative Web pages, Web
structures and Web communities
Meta-Web Warehousing: Web yellow page
service
Web usage mining
2023년 4월 10일 Data Mining: Status and Directions 18
Mine What Web Search Engine Finds
Current Web search engines: A convenient source for mining keyword-based, return too many, often low quality
answers, still missing a lot, not customized, etc. Data mining will help:
coverage: “Enlarge and then shrink,” using synonyms and conceptual hierarchies
better search primitives: user preferences/hints linkage analysis: authoritative pages and clusters Web-based languages: XML + WebSQL + WebML customization: home page + Weblog + user
profiles
2023년 4월 10일 Data Mining: Status and Directions 19
Discovery of Authoritative Pages in WWW
Page-rank method ( Brin and Page, 1998): Rank the "importance" of Web pages, based on a
model of a "random browser." Hub/authority method (Kleinberg, 1998):
Prominent authorities often do not endorse one another directly on the Web.
Hub pages have a large number of links to many relevant authorities.
Thus hubs and authorities exhibit a mutually reinforcing relationship:
Both the page-rank and hub/authority methodologies have been shown to provide qualitatively good search results for broad query topics on the WWW.
2023년 4월 10일 Data Mining: Status and Directions 20
Automatic Classification of Web Documents
Web document classification: Good human classification: Yahoo!, CS term
hierarchies These classifications can be used as training
sets to build up learning model Key-word based classification is different from
multi-dimensional classification Association or clustering-based classification is
often more effective Multi-level classification is important
2023년 4월 10일 Data Mining: Status and Directions 21
Web Usage (Click-Stream) Mining
Weblog provides rich information about Web dynamics Multidimensional Weblog analysis:
disclose potential customers, users, markets, etc. Plan mining (mining general Web accessing regularities):
Web linkage adjustment, performance improvements Web accessing association/sequential pattern analysis:
Web cashing, prefetching, swapping Trend analysis:
Dynamics of the Web: what has been changing? Customized to individual users
2023년 4월 10일 Data Mining: Status and Directions 22
Querying and Mining: An Integrated Information Analysis Environment
Data mining as a component of DBMS, data warehouse, or Web information system Integrated information processing environment
MS/SQLServer-2000 (Analysis service) IBM IntelligentMiner on DB2 SAS EnterpriseMiner: data warehousing + mining
Query-based mining Querying database/DW/Web knowledge Efficiency and flexibility: preprocessing, on-line
processing, optimization, integration, etc.
2023년 4월 10일 Data Mining: Status and Directions 23
Basic Mining Operations and Mining Query Optimization
Relational databases: There are a set of basic relational operations and a standard query language, SQL E.g., selection, projection, join, set difference,
intersection, Cartesian product, etc. Are there a set of standard data mining operations, on
which optimizations can be done? Difficulty: different definitions on operations Importance: optimization can be performed on them
systematically, standardization to facilitate information exchange and system interoperability
2023년 4월 10일 Data Mining: Status and Directions 24
“Vertical” Data Mining
Generic data mining tools? —Too simple to match domain-specific, sophisticated applications
Expert knowledge and business logic represent many years of work in their own fields!
Data mining + business logic + domain experts
A multi-dimensional view of data miners Complexity of data: Web, sequence, spatial, multimedia, … Complexity of domains: DNA, astronomy, market, telecom, …
Domain-specific data mining tools Provide concrete, killer solution to specific problems Feedback to build more powerful tools
2023년 4월 10일 Data Mining: Status and Directions 25
One Picture May Worth 1000 Words!
Visual Data Mining Visualization of data Visualization of data mining results Visualization of data mining processes Interactive data mining: visual classification
One melody may worth 1000 words too! Audio data mining: turn data into music and
melody! Uses audio signals to indicate the patterns of data
or the features of data mining results
2023년 4월 10일 Data Mining: Status and Directions 26
Visualization of data mining results in SAS Enterprise Miner: scatter plots
2023년 4월 10일 Data Mining: Status and Directions 27
Visualization of association rules in MineSet 3.0
2023년 4월 10일 Data Mining: Status and Directions 28
Visualization of a decision tree in MineSet 3.0
2023년 4월 10일 Data Mining: Status and Directions 29
Visualization of Data Mining Processes by Clementine
2023년 4월 10일 Data Mining: Status and Directions 30
Interactive Visual Mining by Perception-Based Classification (PBC)
2023년 4월 10일 Data Mining: Status and Directions 31
Constraint-Based Mining
What kinds of constraints can be used in mining? Knowledge type constraint: classification, association,
etc. Data constraint: SQL-like queries
Find products sold together in Vancouver in Feb.’01. Dimension/level constraints:
in relevance to region, price, brand, customer category.
Rule constraints: small sales (price < $10) triggers big sales (sum >
$200). Interestingness constraints:
E.g., strong rules (min_support 3%, min_confidence 60%, min_lift > 3.0).
2023년 4월 10일 Data Mining: Status and Directions 32
Conclusions
Data mining—A promising research frontier
Data mining research has been striding forward greatly
in the last decade
However, data mining, as an industry, has not been
flying as high as expected
Much research and application exploration are needed Web mining
Towards integrated data mining environments and tools
Towards intelligent, efficient, and scalable data mining methods
2023년 4월 10일 Data Mining: Status and Directions 33
http://www.cs.sfu.ca/~han http://db.cs.sfu.ca
Thank you !!!Thank you !!!
2023년 4월 10일 Data Mining: Status and Directions 34
References
J. Han and M. Kamber, Data Mining: Concepts and Techniques, Morgan Kaufmann, 2001.
J. Han, L. V. S. Lakshmanan, and R. T. Ng, "Constraint-Based, Multidimensional Data Mining", COMPUTER (special issues on Data Mining), 32(8): 46-50, 1999.
top related