working with data mining rafal lukawiecki strategic consultant, project botticelli ltd...

54
Working with Data Mining Rafal Lukawiecki Strategic Consultant, Project Botticelli Ltd [email protected]

Post on 21-Dec-2015

213 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Working with Data Mining Rafal Lukawiecki Strategic Consultant, Project Botticelli Ltd rafal@projectbotticelli.co.uk

Working with Data MiningRafal LukawieckiStrategic Consultant, Project Botticelli [email protected]

Page 2: Working with Data Mining Rafal Lukawiecki Strategic Consultant, Project Botticelli Ltd rafal@projectbotticelli.co.uk

2

Objectives

• Learn the Data Mining Process• Well understand key terminology

The information herein is for informational purposes only and represents the opinions and views of Project Botticelli and/or Rafal Lukawiecki. The material presented is not certain and may vary based on several factors. Microsoft makes no warranties, express, implied or statutory, as to the information in this presentation.

© 2007 Project Botticelli Ltd & Microsoft Corp. Some slides contain quotations from copyrighted materials by other authors, as individually attributed. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Project Botticelli Ltd as of the date of this presentation. Because Project Botticelli & Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft and Project Botticelli cannot guarantee the accuracy of any information provided after the date of this presentation. Project Botticelli makes no warranties, express, implied or statutory, as to the information in this presentation. E&OE.

This seminar is partly based on “Data Mining” book by ZhaoHui Tang and Jamie MacLennan, and also on Jamie’s presentations. Thank you to Jamie and to Donald Farmer for helping me in preparing this session. Thank you to Roni Karassik for a slide. Thank you to Mike Tsalidis, Olga Londer, and Marin Bezic for all the support. Thank you to Maciej Pilecki for assistance with demos.

Page 3: Working with Data Mining Rafal Lukawiecki Strategic Consultant, Project Botticelli Ltd rafal@projectbotticelli.co.uk

3

Agenda

• Technical Platform• Data Mining Process Overview• Key Concepts and Terminology• Data Mining Process in Detail Using DMX

Page 4: Working with Data Mining Rafal Lukawiecki Strategic Consultant, Project Botticelli Ltd rafal@projectbotticelli.co.uk

4

The Technical Platform

Page 5: Working with Data Mining Rafal Lukawiecki Strategic Consultant, Project Botticelli Ltd rafal@projectbotticelli.co.uk

5

Data acquisition and integration from multiple sources

Data transformation and synthesis

Knowledge and pattern detection through Data Mining

Data enrichment with logic rules and hierarchical views

Data presentation and distribution

Data publishing for mass recipients

Integrate Analyze Report

SQL Server 2005 We Need More Than Just Database Engine

Page 6: Working with Data Mining Rafal Lukawiecki Strategic Consultant, Project Botticelli Ltd rafal@projectbotticelli.co.uk

6

DM – Part of Microsoft SQL Server

Page 7: Working with Data Mining Rafal Lukawiecki Strategic Consultant, Project Botticelli Ltd rafal@projectbotticelli.co.uk

7

Analysis ServicesServer

Mining Model

Data Mining Algorithm DataSource

Server Mining Architecture

Excel/Visio/SSRS/Your App

OLE DB/ADOMD/XMLA

Deploy

BIDSExcelVisioSSMS

AppData

Page 8: Working with Data Mining Rafal Lukawiecki Strategic Consultant, Project Botticelli Ltd rafal@projectbotticelli.co.uk

8

SQL Server Analysis Services

OLEDB

ADOMD .NET

AMO IIS

TCP

HTTP

XMLA

ADOMD

Client Apps

BIDS

SSMS

Profiler

Excel

Protocols

Page 9: Working with Data Mining Rafal Lukawiecki Strategic Consultant, Project Botticelli Ltd rafal@projectbotticelli.co.uk

9

Analysis Services Considerations• DM components included in every edition of SQL

Server 2005 except Express• SSAS 2008 and 2005 can exist side-by-side• SSAS do not have to be on the same server as

SQL Database Engine• Performance Needs• Security Reasons

• SSAS can analyse data from sources other than SQL Server

Page 10: Working with Data Mining Rafal Lukawiecki Strategic Consultant, Project Botticelli Ltd rafal@projectbotticelli.co.uk

10

SSAS Security

• Follow “SQL Books Online” guide to securing SSAS

• User privilege needs (oversimplification):• Standard SQL DB user can:

• View & use models• Deploy temporary session models (if option

enabled)• Database Administrators also can:

• Deploy and change permanent models using SSMS (Management Studio) or BIDS online mode

• Server Administrators also can:• Use offline BIDS mode and batch deploy models

Page 11: Working with Data Mining Rafal Lukawiecki Strategic Consultant, Project Botticelli Ltd rafal@projectbotticelli.co.uk

11

SSAS Security Recommendation• For busy development environments in a large

corporation:• Two SSAS servers: production and

development/staging• Developers are server admins on the development

but not on production servers• This allows full offline mode of BIDS

• Configure SQL Server Replication between dev and production servers• You are in charge of replication

• End-users are database users of the deployed models (and use it in Excel, Visio etc.)

Page 12: Working with Data Mining Rafal Lukawiecki Strategic Consultant, Project Botticelli Ltd rafal@projectbotticelli.co.uk

12

Data Mining Process Overview

Page 13: Working with Data Mining Rafal Lukawiecki Strategic Consultant, Project Botticelli Ltd rafal@projectbotticelli.co.uk

13

Mining Model Mining ModelMining Model

Mining Process

DM EngineDM Engine

Training data

Data to be predictedMining Model

With predictions

Page 14: Working with Data Mining Rafal Lukawiecki Strategic Consultant, Project Botticelli Ltd rafal@projectbotticelli.co.uk

14

Steps for Building a DM Model1. Model Creation

• Define columns for cases: visually (BIDS), using DMX, or from PMML

2. Model Training• Feed lots of data from a real database, or from a

system log• Congratulations! We now have a model

3. Model Testing• Does it make sense? Test on sample data to check

predictions.• Testing data must be different from training• If we get nonsense, adjust the algorithm, its

parameters, model design, or even data

4. Model Use (Exploration and Prediction)• Use the model on new data to predict outcomes

Page 15: Working with Data Mining Rafal Lukawiecki Strategic Consultant, Project Botticelli Ltd rafal@projectbotticelli.co.uk

15

Many Approaches

• Work the way you like:• Database experts and SQL veterans:

• Write queries in DMX (similar to T-SQL)• Import/export DM entities using PMML

• Everyone else:• Use Business Intelligence Development Studio

(BIDS) – rich GUI included with SSAS• Hosted in Visual Studio (included!)• You don’t have to program – click-click instead

• Use Excel 2007 with Data Mining Add-Ins• The “Data Mining” tab has everything you need• “Table Analysis” tab is easier but simplified

Page 16: Working with Data Mining Rafal Lukawiecki Strategic Consultant, Project Botticelli Ltd rafal@projectbotticelli.co.uk

16

Please Note

• It is easier to explain the key terminology in PowerPoint slides by showing DMX

• It is easier to explain the key terminology in demos by showing BIDS and click-clicking

• We will do both!1. First, the DMX and slides2. Later (especially this afternoon), mostly demos

Page 17: Working with Data Mining Rafal Lukawiecki Strategic Consultant, Project Botticelli Ltd rafal@projectbotticelli.co.uk

17

Key Concepts and Terminology

Page 18: Working with Data Mining Rafal Lukawiecki Strategic Consultant, Project Botticelli Ltd rafal@projectbotticelli.co.uk

18

Mining Structure

• Describes data to be mined• Columns from a data source and their:

• Data Type• Content Type

• Contains Mining Models• Often we build several different models in one

structure• Holds training data, known as Cases (if required)• Holds testing data, known as Holdout (in SQL

2008)

Page 19: Working with Data Mining Rafal Lukawiecki Strategic Consultant, Project Botticelli Ltd rafal@projectbotticelli.co.uk

19

Data Mining Model

• Container of patterns discovered by a Data Mining Algorithm amongst the training Cases• A table containing patterns

• Expressed by visualisers• Specifies usage of columns already defined in

the Mining Structure

Page 20: Working with Data Mining Rafal Lukawiecki Strategic Consultant, Project Botticelli Ltd rafal@projectbotticelli.co.uk

20

Cases: The Things We Study

• Case – set of columns (attributes) you want to analyse• Age, Gender, Region, Annual Spending

• Case Key – unique ID of a case

• A column has:• Data Type• Content Type• And optionally:

• Distribution• Discretization• Related Columns• Flags (e.g. NOT NULL)

Page 21: Working with Data Mining Rafal Lukawiecki Strategic Consultant, Project Botticelli Ltd rafal@projectbotticelli.co.uk

21

Column Data Types

• We don’t care about detailed low-level types• DM only uses:

• Text• Long• Boolean• Double• Date• and by some 3rd party algorithms:

• Time, and Sequence

Page 22: Working with Data Mining Rafal Lukawiecki Strategic Consultant, Project Botticelli Ltd rafal@projectbotticelli.co.uk

22

Column Content TypesThey Guide Algorithms

• Common:• DISCRETE

• Red, Blue, Green• CONTINOUS

• €6511.49• DISCRETIZED

• 1-5, 6-20, 21+

• Denotes a key:• KEY

• For special purposes:• KEY SEQUENCE• KEY TIME• ORDERED• CYCLICAL

Page 23: Working with Data Mining Rafal Lukawiecki Strategic Consultant, Project Botticelli Ltd rafal@projectbotticelli.co.uk

23

Column Usage

• Some algorithms interpret this in subtly different ways, but in general, columns are for:• Input

• For predicting another column• PREDICT

• These columns are both predicted and act as inputs for predicting others

• PREDICT_ONLY• Not used as input

• You can have all columns as input or predictable

Page 24: Working with Data Mining Rafal Lukawiecki Strategic Consultant, Project Botticelli Ltd rafal@projectbotticelli.co.uk

24

Discretization

• Very important technique• When you don’t need to analyse full continuous

range• DM can automatically convert data into buckets

• By default, into 5• Techniques:

• AUTOMATIC• CLUSTERS• EQUAL_AREAS• THRESHOLDS

Page 25: Working with Data Mining Rafal Lukawiecki Strategic Consultant, Project Botticelli Ltd rafal@projectbotticelli.co.uk

25

Column Distributions

• If you know the distribution of your data (you should), indicate it:• NORMAL

• Typical Gaussian bell-curve• LOG NORMAL

• Most values at the “beginning” of the scale• UNIFORM

• Flat line – equally likely or perfectly random• Other distributions can exist, but you cannot

indicate them – algorithm will work fine

Page 26: Working with Data Mining Rafal Lukawiecki Strategic Consultant, Project Botticelli Ltd rafal@projectbotticelli.co.uk

26

And Finally

• Nested Case – case containing a table column• Purchases of a Customer

• Used for analysing patterns in a relationship

• It has a Nested Key• Not a “relational” foreign key!• Normally, the Nested Key is a column you want to

analyse• E.g.: Product Name or Model

Page 27: Working with Data Mining Rafal Lukawiecki Strategic Consultant, Project Botticelli Ltd rafal@projectbotticelli.co.uk

27

Data Mining Process in DetailUsing DMX

Page 28: Working with Data Mining Rafal Lukawiecki Strategic Consultant, Project Botticelli Ltd rafal@projectbotticelli.co.uk

28

Data Mining ExtensionsDMX• “T-SQL” for Data Mining• Easy! Like scripting for IT Pros

• Two types of statements:• Data Definition

• CREATE, ALTER, EXPORT, IMPORT, DROP• Data Manipulation

• INSERT INTO, SELECT, DELETE

Page 29: Working with Data Mining Rafal Lukawiecki Strategic Consultant, Project Botticelli Ltd rafal@projectbotticelli.co.uk

29

DMX – Just Like T-SQLCREATE MINING MODEL CreditRisk

(CustID LONG KEY,

Gender TEXT DISCRETE,

Income LONG CONTINUOUS,

Profession TEXT DISCRETE,

Risk TEXT DISCRETE PREDICT)

USING Microsoft_Decision_Trees

INSERT INTO CreditRisk

(CustId, Gender, Income, Profession, Risk)

Select

CustomerID, Gender, Income, Profession,Risk

From Customers

Select NewCustomers.CustomerID, CreditRisk.Risk, PredictProbability(CreditRisk.Risk)

FROM CreditRisk PREDICTION JOIN NewCustomers

ON CreditRisk.Gender=NewCustomer.Gender

AND CreditRisk.Income=NewCustomer.Income

AND CreditRisk.Profession=NewCustomer.Profession

Page 30: Working with Data Mining Rafal Lukawiecki Strategic Consultant, Project Botticelli Ltd rafal@projectbotticelli.co.uk

30

CREATE MINING MODEL

CREATE MINING MODEL <name>(< column definitions>

) USING <algorithm>[(<parameters>)][WITH DRILLTHROUGH]

Page 31: Working with Data Mining Rafal Lukawiecki Strategic Consultant, Project Botticelli Ltd rafal@projectbotticelli.co.uk

31

CREATE MINING MODEL

CREATE MINING MODEL MyModel(< column definitions>

) USING Microsoft_Decision_Trees

Page 32: Working with Data Mining Rafal Lukawiecki Strategic Consultant, Project Botticelli Ltd rafal@projectbotticelli.co.uk

32

CREATE MINING MODEL

CREATE MINING MODEL MyModel([CustID] TEXT KEY,

) USING Microsoft_Decision_Trees

Name Data TypeText

Long

Double

Boolean

Date

Page 33: Working with Data Mining Rafal Lukawiecki Strategic Consultant, Project Botticelli Ltd rafal@projectbotticelli.co.uk

33

CREATE MINING MODEL

CREATE MINING MODEL MyModel([CustID] TEXT KEY,

) USING Microsoft_Decision_Trees

Name Data TypeContent TypeKey

Key Time

Discrete

Continuous

Discretized

Page 34: Working with Data Mining Rafal Lukawiecki Strategic Consultant, Project Botticelli Ltd rafal@projectbotticelli.co.uk

34

CREATE MINING MODEL

CREATE MINING MODEL MyModel([Viewer] TEXT KEY,

) USING Microsoft_Decision_Trees

Page 35: Working with Data Mining Rafal Lukawiecki Strategic Consultant, Project Botticelli Ltd rafal@projectbotticelli.co.uk

35

CREATE MINING MODEL

CREATE MINING MODEL MyModel([Viewer] TEXT KEY,[Gender] TEXT DISCRETE,

) USING Microsoft_Decision_Trees

Page 36: Working with Data Mining Rafal Lukawiecki Strategic Consultant, Project Botticelli Ltd rafal@projectbotticelli.co.uk

36

CREATE MINING MODEL

CREATE MINING MODEL MyModel([Viewer] TEXT KEY,[Gender] TEXT DISCRETE,[Marital Status] TEXT DISCRETE,

) USING Microsoft_Decision_Trees

Page 37: Working with Data Mining Rafal Lukawiecki Strategic Consultant, Project Botticelli Ltd rafal@projectbotticelli.co.uk

37

CREATE MINING MODEL

CREATE MINING MODEL MyModel([Viewer] TEXT KEY,[Gender] TEXT DISCRETE,[Marital Status] TEXT DISCRETE,[Education] TEXT DISCRETE,

) USING Microsoft_Decision_Trees

Page 38: Working with Data Mining Rafal Lukawiecki Strategic Consultant, Project Botticelli Ltd rafal@projectbotticelli.co.uk

38

CREATE MINING MODEL

CREATE MINING MODEL MyModel([Viewer] TEXT KEY,[Gender] TEXT DISCRETE,[Marital Status] TEXT DISCRETE,[Education] TEXT DISCRETE,[Home Ownership] TEXT DISCRETE PREDICT,

) USING Microsoft_Decision_Trees

Usage

Predict

Predict Only

Page 39: Working with Data Mining Rafal Lukawiecki Strategic Consultant, Project Botticelli Ltd rafal@projectbotticelli.co.uk

39

CREATE MINING MODEL

CREATE MINING MODEL MyModel([CustID] LONG KEY,[Gender] TEXT DISCRETE,[Marital Status] TEXT DISCRETE,[Education] TEXT DISCRETE,[Home Ownership] TEXT DISCRETE PREDICT,[Age] LONG CONTINUOUS,[Income] DOUBLE CONTINUOUS

) USING Microsoft_Decision_Trees

Page 40: Working with Data Mining Rafal Lukawiecki Strategic Consultant, Project Botticelli Ltd rafal@projectbotticelli.co.uk

40

Nested Tables

CustID Gender Marital Status

Education Home Ownership

980001 Male Married Bachelors Rent

980002 Male Married Bachelors Own

980003 Female Single Masters Own

980004 Male Single Some College Own

980005 Female Married Bachelors Rent

980006 Female Married Masters Rent

Furniture

Sofa

TV

Ladder

Boiler

Sofa

Lazygirl Recliner

Boiler

TV

DVD Player

Bedstead

TV

Bookstand

Yoga Mat

Vase

Page 41: Working with Data Mining Rafal Lukawiecki Strategic Consultant, Project Botticelli Ltd rafal@projectbotticelli.co.uk

41

CREATE MINING MODELNestedCREATE MINING MODEL MyModel

(

[CustID] LONG KEY,

[Gender] TEXT DISCRETE,

[Marital Status] TEXT DISCRETE,

[Education] TEXT DISCRETE,

[Home Ownership] TEXT DISCRETE PREDICT,

[Age] LONG CONTINUOUS,

[Income] DOUBLE CONTINUOUS,

[Products] TABLE

(

[Product Name] TEXT KEY

)

) USING Microsoft_Decision_Trees

Page 42: Working with Data Mining Rafal Lukawiecki Strategic Consultant, Project Botticelli Ltd rafal@projectbotticelli.co.uk

42

Training

• Use INSERT INTO statement• This feeds cases to the engine

• Use SHAPE syntax to create nested input rowsets

• Important:• Use only training data (about 70% is common)• Leave some testing data aside

Page 43: Working with Data Mining Rafal Lukawiecki Strategic Consultant, Project Botticelli Ltd rafal@projectbotticelli.co.uk

43

How Much Training?

• No hard rules on number of cases• Impossible to over-train by providing too many

cases• Overtraining possible for models badly

parameterised• Too detailed, not generic enough models

• Using representative samples?• No need for much training data

• Enough training when model validates correctly (see later)

Page 44: Working with Data Mining Rafal Lukawiecki Strategic Consultant, Project Botticelli Ltd rafal@projectbotticelli.co.uk

44

Splitting Data for TestingHoldout• With SQL Server 2005

• Easily done using SSIS (Integration Services) Percentage Sampling task

• Several ways to do it in T-SQL• With SQL Server 2008

• Done for you!• Specify WITH HOLDOUT in CREATE MINING

STRUCTURE• Or select the amount in the GUI wizard/property

window of BIDS

Page 45: Working with Data Mining Rafal Lukawiecki Strategic Consultant, Project Botticelli Ltd rafal@projectbotticelli.co.uk

Demo1. Using SQL Server Integration Services to Split

Data Into Training and Testing Sets2. Using T-SQL for the Above

Page 46: Working with Data Mining Rafal Lukawiecki Strategic Consultant, Project Botticelli Ltd rafal@projectbotticelli.co.uk

46

INSERT INTO

• Source Data can be:• Data Query• DMX Query• MDX Query• Stored Procedure Call• Rowset Parameter

INSERT INTO [MINING MODEL | MINING STRUCTURE]<model or structure name>[( <column list> )]<source-data>

Page 47: Working with Data Mining Rafal Lukawiecki Strategic Consultant, Project Botticelli Ltd rafal@projectbotticelli.co.uk

47

Test and Validate

• Check that your model makes sense• Accuracy

• Does it correlate and predict correctly?• Reliability

• Does it work similarly for different test data?• Usefulness

• Does it provide insight or only obvious trivialities?

Page 48: Working with Data Mining Rafal Lukawiecki Strategic Consultant, Project Botticelli Ltd rafal@projectbotticelli.co.uk

48

Model Validation

• Typical approaches:• Accuracy

• Lift and Profit Charts• Scatter Plots• Classification Matrix

• Reliability• Cross-validation• External data predictions

• Usefulness• Requires human eye of a domain expert

• But simple attribute correlation checks help

Page 49: Working with Data Mining Rafal Lukawiecki Strategic Consultant, Project Botticelli Ltd rafal@projectbotticelli.co.uk

49

Automated Testing

• Great feature of SQL Server DM• Clicking on “Mining Accuracy” tab executes the

test automatically and quickly:1. Testing data is used to predict values2. Results of that prediction are compared to the

known value (in holdout)3. Comparison is expressed as:

• Lift Chart, Profit Chart, Scatter Plot, Classification Matrix, Cross-Validation Statistics

Page 50: Working with Data Mining Rafal Lukawiecki Strategic Consultant, Project Botticelli Ltd rafal@projectbotticelli.co.uk

50

Prediction!

• Apply the model to predict the unknown

• Use SELECT semantics• New PREDICTION JOIN

• Returned values can contain tables• Sub-select from nested tables

Page 51: Working with Data Mining Rafal Lukawiecki Strategic Consultant, Project Botticelli Ltd rafal@projectbotticelli.co.uk

51

PREDICTION JOIN

SELECT [TOP <count> ]<expression-list> FROM <model>[[NATURAL] PREDICTION JOIN <source data> AS <alias>[ ON <column-mapping> ][ WHERE <filter expression> ][ ORDER BY <expression> ]]

Page 52: Working with Data Mining Rafal Lukawiecki Strategic Consultant, Project Botticelli Ltd rafal@projectbotticelli.co.uk

52

Don’t Forget to Explore and Analyse• Several excellent visualizers for patters from

Microsoft• Available in: BIDS, SSMS, SSRS, Excel, Visio and

for your applications• You will see a lot of this today afternoon!

• You can also query the Mining Model directly for the patterns• SQL Books Online show many examples, or get

the book “Data Mining with SQL Server 2005” by Jamie McLennan and ZhaoHui Tang

Page 53: Working with Data Mining Rafal Lukawiecki Strategic Consultant, Project Botticelli Ltd rafal@projectbotticelli.co.uk

53

Summary

• Building a mining model starts with understanding the data: cases

• You should carefully define the column data and content types and their usage

• Build the model and train it using a separate data set

• Test and validate before continuing• Visually explore the patterns – and enjoy

Page 54: Working with Data Mining Rafal Lukawiecki Strategic Consultant, Project Botticelli Ltd rafal@projectbotticelli.co.uk

54

© 2007 Microsoft Corporation & Project Botticelli Ltd. All rights reserved.

The information herein is for informational purposes only and represents the opinions and views of Project Botticelli and/or Rafal Lukawiecki. The material presented is not certain and may vary based on several factors. Microsoft makes no warranties, express, implied or statutory, as to the information in this presentation.

© 2007 Project Botticelli Ltd & Microsoft Corp. Some slides contain quotations from copyrighted materials by other authors, as individually attributed. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Project Botticelli Ltd as of the date of this presentation. Because Project Botticelli & Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft and Project Botticelli cannot guarantee the accuracy of any information provided after the date of this presentation. Project Botticelli makes no warranties, express, implied or statutory, as to the information in this presentation. E&OE.