data mining - introductionhomepage.univie.ac.at/.../datamining.pdf · • alternative view: data...
TRANSCRIPT
Page 1
Institut für Softwarewissenschaften – Universität WienP.Brezany
Data Mining - Introduction
Peter Brezany
Institut für Scientific Computing
Universität WienTel. 4277 39425
Sprechstunde: Di, 13.00-14.00
Institut für Softwarewissenschaften – Universität WienP.Brezany2
OutlineBusiness Intelligence and its componentsKnowledge discovery in databasesData mining techniques
- associative and sequence rules- classification- prediction- clustering- neural networks
Data warehousingData webhousingAdvanced topics: parallel and distributed data analysis
Page 2
Institut für Softwarewissenschaften – Universität WienP.Brezany3
LiteratureMark and Mary Whitehorn: Business Intelligence: The IBM Solution. Springer-Verlag, 2000.
R. Kimball: The Data Warehouse Toolkit. John Willey, 1996.
J. Han, M. Kamber: Data Mining. Concepts and TechniquesMorgam Kaufmann Publishers, 2000.
M. Ester, J. Sander: Knowledge Discovery in Databases.Springer-Verlag, 2000.
I.H. Witten, E. Frank: Data Mining. (Practical Machine Learning Tools and Techniques with Java Implementations).Morgam Kaufmann Publishers, 2000.
Institut für Softwarewissenschaften – Universität WienP.Brezany4
Business Intelligence
Definition:
Business Intelligence is an umbrella term, broadly covering theprocesses involved in extracting valuable business informationand knowledge from the mass of data that exists within a typical enterprise.
Page 3
Institut für Softwarewissenschaften – Universität WienP.Brezany5
Business Intelligence Tools
• Data warehouses
• OLAP (On-Line Analytical Processing) tools
• Data mining tools
• Text mining tools
• Web mining tools
• Data joiners (integrators)
• Business Intelligence portals, etc.
the focusof ourlectures
Institut für Softwarewissenschaften – Universität WienP.Brezany6
Business Intelligence Tools (cont.)
• Data warehouse - a repository of multiple heterogeneous datasources, organized under a unified schema at a single site in order to facilitate management decision making.
• OLAP – analysis techniques with functionlities such as summari-zation, consolidation, and aggregation, as well as the ability to viewinformation from different angles.
• Data mining – extracting or “mining“ knowledge from large data sets.• Text mining – “mining“ large textual (document) databases.• Web mining – discovering knowledge from hypertext data.• Data joiner - working with data from disparate, heterogeneous data
sources• Business Intelligence portal – a Web site designed to be the first
point of entry for visitors to information about a company. With helpof the portal´s personalising functions, the user can choose informa-tion sources that he needs for performing a specific task.
Page 4
Institut für Softwarewissenschaften – Universität WienP.Brezany7
DATA MINING
Institut für Softwarewissenschaften – Universität WienP.Brezany8
Introduction• This lecture topic is about the theme which has
come to be known as data mining and knowledge discovery in large databases, data warehouses, and other massive information repositories.
• Data mining emerged during the late 1980s; has made great strides during the late 1990s, and is expected to continue to flourish into the next future.
• We introduce interesting data mining techniques and systems, and discuss applications and research directions.
• Data mining can be viewed as a result of the natural evolution of information technology - including database technology, artificial intelligence, machine learning, neural networks, statistics, pattern-recognition, knowledge-based systems, high-performance computing, and data visualization.
Page 5
Institut für Softwarewissenschaften – Universität WienP.Brezany9
What Motivated Data Mining? Why Is It Important?
• There is the wide availability of huge amounts of data and the imminent need for turning such data into useful information and knowledge.
• Applications ranging from business management, production control, and market analysis, to engineering design and science exploration.
Institut für Softwarewissenschaften – Universität WienP.Brezany10
MotivationBusiness
Medicine
Scientificexperiments
SimulationsEarth observations
Data and dataexploration
cloud
Data and dataexploration
cloud
Page 6
Institut für Softwarewissenschaften – Universität WienP.Brezany11
CERNs challenge
• Starting point – New detector LHC
» Large Hadron Collider, 14 TeV» Goals: Search for Higgs Boson and
Graviton (and others)– Start 2006
• Challenges – Data are accessed worldwide
» CERN and Regional Centers (Europe, Asia, America)» 2000 users
– Huge data volumes– Data semantics– Performance and throughput
Institut für Softwarewissenschaften – Universität WienP.Brezany12
CMSATLAS
LHCb
The LHC Detectors
Page 7
Institut für Softwarewissenschaften – Universität WienP.Brezany13
Multi-Tier Model
Institut für Softwarewissenschaften – Universität WienP.Brezany14
The Evolution of Database TechnologyData Collection and Database Creation (1960s and earlier)- Primitive file processing
Database Management Systems (1970s-early 1980s)- Hierarchical, network and relational DB systems- Query languages (SQL, etc), query optimization- Transaction management, concurrency control, recovery- Data modeling tools
Advanced Database Systems(mid-1980s-present)object-oriented, object-relational,spatial, multimedia, ...
Web-based Database Systems(1990s-present)- XML-based DB systems, - Web mining
Data Warehousing and Data Mining (late 1980s-present)- Data warehouse and OLAP technology- Data mining and knowledge discovery
Page 8
Institut für Softwarewissenschaften – Universität WienP.Brezany15
Database Querying and Data Mining
Query languages like SQL are standardized and powerful, but for not skilledusers are they too difficult.OLAP Tools allow flexible multidimensional queries. Their methods are query-centric.
Query languages like SQLOLAP Tools Data Mining Tools
Data Warehouse
Institut für Softwarewissenschaften – Universität WienP.Brezany16
We Are Data Rich, But Information Poor
Page 9
Institut für Softwarewissenschaften – Universität WienP.Brezany17
So, What Is Data Mining?
Data mining – searching for knowledge (interesting patterns)in your data.
Institut für Softwarewissenschaften – Universität WienP.Brezany18
Data Mining As a Step in the Process of Knowledge Discovery
• Many people treat data mining as a synonym for the term Knowledge Discovery in Databases, or KDD.
• Alternative view: data mining as n step in KDD:– 1, Data cleaning (to remove noise and inconsistent data)– 2. Data integration (where multiple data sources may be combined)– 3. Data selection (where data relevant to the analysis task are
retrieved from the database)– 4. Data transformation (where data are transformed or consolidated
into forms appropriate for mining by performing summary or aggregation operations, for instance)
– 5. Data mining (an essential process where intelligent methods are applied in order to extract patterns)
– 6. Pattern evaluation (to identify the truly interesting patterns representing knowledge based on some interestingness measures)
– 7. Knowledge presentation to the user
Page 10
Institut für Softwarewissenschaften – Universität WienP.Brezany19
Data Mining in Knowledge Discovery
Institut für Softwarewissenschaften – Universität WienP.Brezany20
Architecture of a Data Mining System
Graphical user interface
Pattern evaluation
Data mining engine
Database or data warehouse server
Knowledgebase
Database Datawarehouse
FilteringData cleaning, data integration
Page 11
Institut für Softwarewissenschaften – Universität WienP.Brezany21
Architecture of a Data Mining System (2)Database, data warehouse, or other information repository:One or a set of databases, data warehouses, spreadsheets, etc.
Database or data warehouse server: responsible for fetchingthe relevant data, based on the user’s data mining request.
Knowledge base: domain knowledge that is used to guide thesearch, or evaluate the interestingness of resulting patterns.Such knowledge can include concept hierarchies, used to organi-ze attribute values into different levels of abstraction.
Data mining engine: essential to the data mining system; ideallyconsists of a set of functional modules for tasks such as charac-terization, association, classification, cluster analysis, and evolu-tion and deviation analysis.
Institut für Softwarewissenschaften – Universität WienP.Brezany22
Architecture of a Data Mining System (3)Pattern evaluation module: This component typically employsinterestingness measures and interacts with the data mining soas to focus the search towards interesting patterns. It may useinterestingness thresholds to filter out discovered patterns.
Graphical user interface: This module communicates betweenusers and the data mining system allowing the user• to specify a data mining query or task• provide information to help focus the search• perform exploratory data mining based on the intermediatedata mining results
• browse database and data warehouse schemas or data structures• evaluate mined patterns• visualize the patterns in different forms.
Page 12
Institut für Softwarewissenschaften – Universität WienP.Brezany23
Stages of a Data Exploration Project
Time to Importancecomplete to success(percent of total) (percent of total)
1. Exploring the problem 10 15
2. Exploring the solution 9 20 14 80
3. Implementation specification 1 51
4. Knowledge discovery
a. Data preparation 60 15
b. Data surveying 15 3
c. Data modeling 5 2
80 20
Based on: Data Preparation for Data Mining,by Dorian Pyle, Morgan Kaufmann
Institut für Softwarewissenschaften – Universität WienP.Brezany24
Relational Database• A database system, also called a database management system
(DBMS), consists of a collection of interrelated data, known as a database, and a set of software programs to manage and access the data.
• A relational database is a collection of tables, each of which is assigned a unique name. Each table consists of a set of attributes (columns or fields) and usually stores a large set of tuples(records or rows). Each tuple represents an object identified by a unique key.
• Relational data can be accessed by database queries written in a relational query language, such as SQL.
• Using data mining, one can search for trends or data patterns in relational databases.
Page 13
Institut für Softwarewissenschaften – Universität WienP.Brezany25
Relational Databases – Example
The AllElectronics company is described by the followingtable: customer, item, employee, and branch. Fragments ofthese tables are shown on the next slide; the attribute thatrepresents the key or composite key component is underlined.
•The relation customer consists of a set of attributes, inclu-ding a unique customer identity number (cust_ID), and so on.
•Tables can also be used to represent the relationships bet-ween or among multiple relational tables. E.g., these includepurchases (customer purchases items, creating a sales tran-saction that is handled by an employee), items_sold (lists theitems sold in the given transaction), and works_at (employeeworks at a branch of AllElectronics).
Institut für Softwarewissenschaften – Universität WienP.Brezany26
Fragments of Relations from AllElectronics DB
Page 14
Institut für Softwarewissenschaften – Universität WienP.Brezany27
Data WarehousesA data warehouse is a repository of information collected frommultiple sources, stored under a unified schema, and which usually resides at a single site.
Data warehouses are constructed via a process of data cleaning,data transformation, data integration, data loading and periodicdata refreshing.
Figure on the next slide shows the basic architecture of a datawarehouse for AllElectronics.
In order to facilitate decision making, the data in a data ware-house are organized around major subjects, such as customer,item, supplier, and activity. The data are stored from a histori-cal perspective and are typically summarized.
Institut für Softwarewissenschaften – Universität WienP.Brezany28
Architecture of a Data Warehouse
CleanTransformIntegrateLoad
Data source in Ch.
Data source in NY
Data source in T.
Data source in Vancouver Remarks: Ch - Chicago, NY - New York, T - Toronto
Datawarehouse
Query andanalysis tools
Client
Client
Load = periodical data refreshing
Page 15
Institut für Softwarewissenschaften – Universität WienP.Brezany29
Modeling a Data WarehouseA data warehouse is usually modeled by a multidimensional database structure, where each dimension corresponds to anattribute in the schema, each cell stores the value of someaggregate measure, such as count or sales_amount.
The actual physical structure of a data warehouse may be arelational data store or a multidimensional data cube. Itprovides a multidimensional view of data and allows the precomputation and fast accessing of summarized data.
Example: A data cube for summarized sales data of AllElectronics is presented in the next slide.
Institut für Softwarewissenschaften – Universität WienP.Brezany30
A Multidimensional Data Cube
Page 16
Institut für Softwarewissenschaften – Universität WienP.Brezany31
Modeling a Data Warehouse (2)
Data warehouse vs. Data mart: A data warehouse collectsinformation about subjects and span an entire organization,and thus its scope is enterprise wide. A data mart isa department-wide.
Data warehouse systems are well suited for On-Line Analyticalprocessing, or OLAP.
OLAP operations allow the presentation of data at differentlevels of abstractions.
Examples of OLAP operations include drill-down and roll-up,which allow the user to view the data at different degrees ofsummarization as illustrated in the previous slide.
Institut für Softwarewissenschaften – Universität WienP.Brezany32
Transactional DatabasesA transactional database consists of a file where each recordrepresents a transaction.
A transaction includes a unique transaction identity number(trans_id), and a list of the items making up the transaction (such as items purchased in a store).
The transactional database may have additional tables associatedwith it, which contain other information regarding the sale, suchas the date of the transaction, the custommer ID number, the IDnumber of the sales person, etc.
Example: Transactions can be stored in a table, with onerecord per transaction. A fragment of a transactional databasefor AllElectronics is shown in the next slide.
Page 17
Institut für Softwarewissenschaften – Universität WienP.Brezany33
Transactional Databases (2)
Trans_id list of item_Ids
T100 I1, I3, I8, I16. . . . . .
The transactional database is usually either stored in a flat filein a format similar to that of the above table, or unfolded intoa standard relation in a format similar to that of the items_sold table in slide no. 18.
A regular data retrieval system is not able to answer querieslike “Which items sold well together?”
Institut für Softwarewissenschaften – Universität WienP.Brezany34
Advanced Database Systems and Database Applications
Relational DB systems have been widely used in business app-lications.
The new database applications include handling• spatial data (e.g. maps)• engineering design data (e.g., the design of buildings or
integrated circuits)• hypertext and multimedia data (text, image, video, audio data)• time-related data (e.g. stock exchange data)• World Wide Web (a huge, widely distributed information repo-
sitory made available by the Internet)
Page 18
Institut für Softwarewissenschaften – Universität WienP.Brezany35
Data Mining Tasks
Institut für Softwarewissenschaften – Universität WienP.Brezany36
Data Mining Functionalities - What Kinds of Patterns Can be Minded?
• Data mining functionalities are used to specify the kind of patterns that can be found in data mining tasks.
• Data mining tasks can be classified into 2 categories:– Descriptive - they characterize the general properties of the data in
the database.– Prescriptive - they perform inference on the current data in order to
make predictions.
• In some cases, users may have no idea which kinds of patterns may be interesting => searching for several different kinds of patterns in parallel.
• Data mining systems should be able to discover patterns at various granularities (abstraction levels).
• Specifying hints to guide or focus the search.
Page 19
Institut für Softwarewissenschaften – Universität WienP.Brezany37
Association Analysis• Association analysis is the discovery of association rules
showing attribute-value conditions that occur frequently in a given set of data.
• The association rule X => Y is interpreted as “database tuples that satisfy the conditions in X are also likely to satisfy the conditions in Y.”
• Example A data mining system may find in AllElectronics: age(X, “20..29”) and income(X, “20K..29K”) => buys(X,”CD player”) [support = 2%, confidence = 60%]
• X is a variable representing a customer. The rule indicates that of the customers under study, 2% are 20 to 29 years of age with an income of 20K to 29K and have purchased a CD player. There is a 60% probability that a customer in this age and income group will purchase a CD player.
Institut für Softwarewissenschaften – Universität WienP.Brezany38
Association Analysis (Cont.)
• We would like to determine which items are frequently purchased together within the same transactions. E.g.,contains(T, “computer”) => contains(T, “software”) [support = 1%, confidence = 50%]
• Explanation: if a transaction, T, contains “computer”, there is a 50% chance that it contains “software” as well, and 1% of all of the transactions contain both.
• This rule involves a single attribute or predicate (i.e. contains) => single-dimensional association rule. It can be written simpy as “computer => software {1%,50%]” Remark: On the last slide, we have: multi-dimensional assoc. rule.
Page 20
Institut für Softwarewissenschaften – Universität WienP.Brezany39
Classification and Prediction• Classification is the process of finding a set of models
(or functions) that describe and distinguish data classes or concepts, for the purpose of being able to use the model to predict the class of objects whose class label is unknown. The derived model is based on the analysis of a training data (i.e., data objects whose class label is known),
• “How is the derived model presented?” – Classification (IF-THEN) rules– Mathematical formulae– Decision tree - it is a flow-chart-like tree structure, where each
node denotes a test on an attribute value, each branch represents an outcome of the test, and the tree leaves represent classes or class distributions.
– Neural networks - a collection of neuron-like processing units with weighted connections between the units.
Institut für Softwarewissenschaften – Universität WienP.Brezany40
Classification and Prediction (Cont.)• Prediction - in many applications, users may wish to
predict some missing or unavailable data values rather then class labels. The predicted values are usually numerical data.
Page 21
Institut für Softwarewissenschaften – Universität WienP.Brezany41
A Sample Data Set
Fictional data set that describes the weather conditions for playingsome unspecified game.
sunny hot high false no
sunny hot high true no
overcast hot high false yes
rainy mild high false yes
rainy cool normal false yes
rainy cool normal true no
overcast cool normal true yes
outlook temperat. humidity rainy play outlook temperat. humidity rainy play
sunny mild high false no
sunny cool normal false yes
rainy mild normal false yes
sunny mild normal true yes
overcast mild high true yes
overcast hot normal false yes
rainy mild high true no
Institut für Softwarewissenschaften – Universität WienP.Brezany42
Learning Rules
Example for a set of rules learned from the example data set:
if outlook = sunny and humidity = high then play = no
if outlook = rainy and windy = true then play = no
if outlook = overcast then play = yes
if humidity = normal then play = yes
if none of the above then play = yes
These are classification rules that assign an output class (playor not) to each instance (single example in a data set).
Page 22
Institut für Softwarewissenschaften – Universität WienP.Brezany43
Learning Decision Trees
Institut für Softwarewissenschaften – Universität WienP.Brezany44
Classification learning
• Training set: set of examples, where each example is a feature vector (i.e., a set of (attribute,value) pairs) with its associated class. Model is built on this set.
• Test set: a set of examples disjoint from the training set, used for testing the accuracy of a model.
Page 23
Institut für Softwarewissenschaften – Universität WienP.Brezany45Slide author: J. Han
Institut für Softwarewissenschaften – Universität WienP.Brezany46Slide author: J. Han
Page 24
Institut für Softwarewissenschaften – Universität WienP.Brezany47
Cluster Analysis• Clustering analyzes data objects without consulting a
known class label.• Clustering can be used to generate such labels.• The objects are clustered or grouped based on the
principle of maximizing the intraclass similarity and minimizing the interclass similarity.
• Each cluster can be viewed as a class of objects, from which rules can be derived.
• Example Cluster analysis can be performed on AllElec-tronics customer data in order to identify homoge-neous subpopulations of customers. These clusters may represent individual target groups for marketing. (Figure on the next slide shows a 2-D plot of customers with respect to customer locations in a city).
Institut für Softwarewissenschaften – Universität WienP.Brezany48
Cluster Analysis - Example
A 2-D plot of customer data with respect to customer locationsin a city, showing 3 data clusters. Each cluster „center“ is marked with a „+“.
Page 25
Institut für Softwarewissenschaften – Universität WienP.Brezany49
Outlier Analysis• A database may contain data objects that do not
comply with the general behaviour or model of the data. These data objects are outliers,
• Most data mining methods discard outliers as noise or exceptions.
• In some applications such as fraud detection, the rare events can be more interesting than the more regularly occurring ones,
• Example Outlier analysis may uncover fraudulent usage of credit cards by detecting purchases of extremely large amounts for a given account number in comparison to regular charges incurred by the same account.
Institut für Softwarewissenschaften – Universität WienP.Brezany50
On-Line Demo on Clustering
• http://www.elet.polimi.it/upload/matteucc/Clustering/tutorial_html/AppletKM.html