02 related concepts

Chapter 2Related Concepts

2.1 Database/OLTP Systems• Unlike a simple set, data in a database are usually viewed to have a particular

structure or schema which it is associated with.• Unlike a file, a database is independent of the physical method used to store it.• Data model is used to describe the data, attributes, and relationships among them. A

common data model is the ER (entity-relationship) data model. It can be viewed as a documentation and communication tool to convey type and structure of the actual data. A data model is independent of the particular the DBMS used.

• Basic database queries are well defined with precise results. Data mining applications conversely are often vaguely defined with imprecise results. A data mining query outputs a KDD object.

• A KDD object is either a rule, a classification, or a cluster, which do not exist before executing the query, and are not part of the database being queried.

2.2 Fuzzy Sets and Fuzzy Logic• A fuzzy set is a set, , in which the set membership function, f, is a real valued (as opposed to

Boolean) function with output in the range . An element is said to belong to with probability and simultaneously to be in with probability

• Membership function is not Boolean so the results of this query are fuzzy. Classification problem is solved by assigning a set membership function to each record for each class. The record is then assigned to the class that has the highest membership function value.

• Association rules are generated given a confidence value that indicates the degree to which it holds in the entire database. This can be thought of as a membership function.

• Fuzzy logic uses rules and membership functions to estimate a continuous function. Fuzzy logic is a valuable tool to develop control systems for such things as elevators, trains, and heating systems.

2.3 Information RetrievalThe effectiveness of the IR system in processing the query is often measured by precision and recall:

The inverse document frequency (IDF) is often used by similarity measures. Given a keyword, , and documents, IDF can be defined as:

Concept hierarchies (tree or DAG (directed acyclic graph) ) can be used in spatial data mining.

2.5 Data Warehousing• Data warehouse is a set of data that supports DSS and is subject-

oriented, integrated, time-variant, nonvolatile.• DW contains contains informational data, which are used to support

other functions such as planning and forecasting.• OLAP retrieval tools facilitate quick query response at all granularities.

2.6 Dimensional Modeling

Multidimensional star schema Multidimensional constellation schema

• A dimension is a collection of logically related attributes and is viewed as an axis for modeling the data.

• The specific data stored are called facts and usually are numeric data. In a relational system each dimension is a table and facts are stored in a fact table.

Multidimensional Data Cube

2.7 Online Analytic Processing (OLAP)OLAP supports as hoc querying of the data warehouse. OLAP requires a multidimensional view of the data and involves some analysis.Operations supported: Slice, Dice, Roll up, Drill down, Visualization

2.8 Statistics• Such simple concepts as determining a data distribution and calculating a mean, a variance can be

viewed as data mining techniques in their own, a descriptive model for the data under consideration.

• When a model is generated, the goal is to fit it to the entire data, not just a sample searched. Assumptions often made about independence of data may be incorrect, thus leading to errors in the resulting model. Any model should be statistically significant, meaningful, and valid.

• An often used tool in data mining and machine learning is one of sampling. Here a subset of the total population is examined, and a generalization (model) about the entire population is made from this subset.

• The term exploratory data analysis describes the fact that the data can actually drive the creation of the model and any statistical characteristics.

• Some data mining applications determine correlations among data. These relationships, however, are not casual in nature. Care must be taken when assigning significance to such relationships.

2.9 Machine Learning• Data mining involves not only modeling but also the development of effective and efficient

algorithms and data structures to perform the modeling on large data sets.• Machine learning is the area of AI that examines how to write programs that can learn. In data

mining, machine learning is often used for prediction or classification. • Predictive modeling is done in two phases. During the training phase, historical or sampled data

are used to create a model that represents those data. It is assumed to hold for the whole database and its future states. The testing phase then applies this model to the remaining and future data.

• With supervised learning a sample of the database is used to train the system to properly perform the desired task. The quality of the training data determines how well the program learns. With unsupervised learning there is no knowledge of the correct answers of applying the model to the data.

• The objective for data mining is to uncover useful information and provide it to humans, while machine learning research is focused more on the learning portion.

References:Dunham, Margaret H. “Data Mining: Introductory and Advanced Topics”. Pearson Education, Inc., 2003.

02 related concepts

Data & Analytics