machine learning based on attribute interactions

24
Machine Learning Machine Learning based on based on Attribute Attribute Interactions Interactions Aleks Jakulin Advised by Acad. Prof. Dr. Ivan Bratko 2003-2005

Upload: corinthia-verga

Post on 01-Jan-2016

50 views

Category:

Documents


0 download

DESCRIPTION

Machine Learning based on Attribute Interactions. Aleks Jakulin Advised by Acad. Prof. Dr. Ivan Bratko. 2003-2005. Learning = Modelling. data shapes the model model is made of possible hypotheses model is generated by an algorithm utility is the goal of a model. Utility = -Loss. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Machine Learning  based on Attribute Interactions

Machine Learning Machine Learning based onbased on

Attribute InteractionsAttribute Interactions

Aleks Jakulin

Advised by Acad. Prof. Dr. Ivan Bratko

2003-2005

Page 2: Machine Learning  based on Attribute Interactions

Learning = Learning = ModellingModelling

Utility = -Loss

MODEL

Learning Algorithm

DataHypothesisSpace

B { A: “A bounds B”

The fixed data sample restricts the model to be consistent with it.

- data shapes the model

- model is made of possible hypotheses

- model is generated by an algorithm

- utility is the goal of a model

Page 3: Machine Learning  based on Attribute Interactions

Our Assumptions about ModelsOur Assumptions about Models

• Probabilistic Utility: logarithmic loss (alternatives: classification accuracy, Brier score, RMSE)

• Probabilistic Hypotheses: multinomial distribution, mixture of Gaussians (alternatives: classification trees, linear models)

• Algorithm: maximum likelihood (greedy), Bayesian integration (exhaustive)

• Data: instances + attributes

Page 4: Machine Learning  based on Attribute Interactions

Expected Minimum Loss = EntropyExpected Minimum Loss = Entropy

C

Entropy given C’s empirical probability distribution (p = [0.2, 0.8]).

A

H(A)Information

which came with the knowledge of A

I(A;C)=H(A)+H(C)-H(AC)Mutual information or information gain ---

How much have A and C in common?

H(C|A) = H(C)-I(A;C)Conditional entropy - Remaining uncertaintyin C after learning A.

H(AC)Joint entropy

The diagram is a visualization of a probabilistic model P(A,C)

Page 5: Machine Learning  based on Attribute Interactions

2-Way Interactions2-Way Interactions• Probabilistic models take the form of P(A,B)• We have two models:

– Interaction allowed: PY(a,b) := F(a,b)

– Interaction disallowed: PN(a,b) := P(a)P(b) = F(a)G(b)

• The error that PN makes when approximating PY:

D(PY || PN) := Ex ~ Py[L(x,PN)] = I(A;B) (mutual information)

• Also applies for predictive models:

• Also applies for Pearson’s correlation coefficient:

);())(||)|(( YAIYPAYPD

P is a bivariate Gaussian,obtained via max. likelihood

Page 6: Machine Learning  based on Attribute Interactions

Rajski’s DistanceRajski’s Distance

• The attributes that have more in common can be visualized as closer in some imaginary Euclidean space.

• How to avoid the influence of many/few-valued attributes? (Complex attributes seem to have more in common.)

• Rajski’s distance:

• This is a metric (e.g.: the triangle inequality)

Page 7: Machine Learning  based on Attribute Interactions

Interactions between Interactions between US SenatorsUS SenatorsDem

ocrats

Republicans

dark: strong interaction,high mutual information

light: weak interactionlow mutual information

Interaction matrix

Page 8: Machine Learning  based on Attribute Interactions

A Taxonomy of A Taxonomy of Machine Learning AlgorithmsMachine Learning Algorithms

CMC dataset

Interaction dendrogram

Page 9: Machine Learning  based on Attribute Interactions

3-Way Interactions3-Way Interactions

C

BA

label

attribute attribute

importance of attribute Bimportance of attribute A

3-Way Interaction: What is common to A, B and C together;

and cannot be inferred from any subset of attributes.

attribute correlation

2-Way Interactions

Page 10: Machine Learning  based on Attribute Interactions

Interaction InformationInteraction Information

I(A;B;C) :=

I(AB;C) - I(B;C)- I(A;C)

= I(B;C|A) - I(B;C)= I(A;C|B) - I(A;C)

(Partial) history of independent reinventions: Quastler ‘53 (Info. Theory in Biology) - measure of

specificityMcGill ‘54 (Psychometrika) - interaction

informationHan ‘80 (Information & Control) - multiple mutual

informationYeung ‘91 (IEEE Trans. On Inf. Theory) - mutual

informationGrabisch&Roubens ‘99 (I. J. of Game Theory) - Banzhaf

interaction indexMatsuda ‘00 (Physical Review E) - higher-order

mutual inf.Brenner et al. ‘00 (Neural Computation) - average synergyDemšar ’02 (A thesis in machine learning) - relative

information gainBell ‘03 (NIPS02, ICA2003) - co-informationJakulin ‘02 - interaction gain

How informative are A and B together?

Page 11: Machine Learning  based on Attribute Interactions

InteractionInteractionDendrogramDendrogram

Useful attributes

Useless attributes

farming

soil

vegetation

In classification taskswe are only interested inthose interactions that

involve the label

Page 12: Machine Learning  based on Attribute Interactions

Interaction GraphInteraction Graph

• The Titanic data set– Label: survived?– Attributes: describe the

passenger or crew member• 2-way interactions:

– Sex then Class; Age not as important

• 3-way interactions:– negative: ‘Crew’ dummy is

wholly contained within ‘Class’; ‘Sex’ largely explains the death rate among the crew.

– positive:• Children from the first and

second class were prioritized.• Men from the second class

mostly died (third class men and the crew were better off)

• Female crew members had good odds of survival.

blue: redundancy, negative int. red: synergy, positive int.

Page 13: Machine Learning  based on Attribute Interactions

An Interaction DrilledAn Interaction Drilled

Data for ~600 people

What’s the loss assuming no interaction between eyes in hair?

Area corresponds to probability:• black square: actual probability• colored square: predicted probability

Colors encode the type of error. The more saturated the color, the more “significant” the error. Codes:• blue: overestimate• red: underestimate• white: correct estimate

KL-d: 0.178

Page 14: Machine Learning  based on Attribute Interactions

Rules = ConstraintsRules = Constraints• Rule 1:

Blonde hair is connected with blue or green eyes.

• Rule 2:Black hair is connected with brown eyes.

KL-d: 0.045 KL-d: 0.134

KL-d:0.022

Both rules:

KL-d: 0.178

No interaction:

Page 15: Machine Learning  based on Attribute Interactions

ADULT/CENSUS

Attribute Value TaxonomiesAttribute Value Taxonomies

Interactions can also be computed between pairs of attribute (or label) values. This way, we can structure attributes with many values (e.g., Cartesian products ☺).

Page 16: Machine Learning  based on Attribute Interactions

Attribute Selection with InteractionsAttribute Selection with Interactions

• 2-way interactions I(A;Y) are the staple of attribute selection– Examples: information gain, Gini ratio, etc.– Myopia! We ignore both positive and negative

interactions.

• Compare this with controlled 2-way interactions: I(A;Y | B,C,D,E,…)– Examples: Relief, regression coefficients– We have to build a model on all attributes anyway,

making many assumptions… What does it buy us?– We add another attribute, and the usefulness of a

previous attribute is reduced?

Page 17: Machine Learning  based on Attribute Interactions

Attribute Subset Selection with NBCAttribute Subset Selection with NBC

The calibration of the classifier (expected likelihood of an instance’s label) first improves then deteriorates as we add attributes. The optimal number is ~8 attributes. The first few attributes are important, the rest is noise?

Page 18: Machine Learning  based on Attribute Interactions

Attribute Subset Selection with NBCAttribute Subset Selection with NBC

NO! We sorted the attributes from the worst to the best. It is some of the best attributes that ruin the performance! Why? NBC gets confused by redundancies.

Page 19: Machine Learning  based on Attribute Interactions

Accounting for RedundanciesAccounting for Redundancies

At each step, we pick the next best attribute, accounting for the attributes already in the model:– Fleuret’s procedure:

– Our procedure:

Page 20: Machine Learning  based on Attribute Interactions

Example:Example:the naïve the naïve Bayesian Bayesian ClassifierClassifier

myopic →

↑ Interaction-proof

Page 21: Machine Learning  based on Attribute Interactions

Predicting with InteractionsPredicting with Interactions• Interactions are meaningful self-contained views of the

data.• Can we use these views for prediction?• It’s easy if the views do not overlap: we just multiply

them together, and normalize: P(a,b)P(c)P(d,e,f)• If they do overlap:

• In a general overlap situation, Kikuchi approximation efficiently handles the intersections between interactions, and intersections-of-intersections.

• Algorithm: select interactions, use Kikuchi approximation to fuse them into a joint prediction, use this to classify.

)(

),(),(),|( 21

21 yP

yxPyxPxxyP

Page 22: Machine Learning  based on Attribute Interactions

Interaction Interaction ModelsModels

• Transparent and intuitive• Efficient• Quick• Can be improved by replacing

Kikuchi with Conditional MaxEnt, and Cartesian product with something better.

Page 23: Machine Learning  based on Attribute Interactions

Summary of the TalkSummary of the Talk

• Interactions are a good metaphor for understanding models and data. They can be a part of the hypothesis space, but do not have to.

• Probability is crucial for real-world problems.• Watch your assumptions (utility, model,

algorithm, data)• Information theory provides solid notation. • The Bayesian approach to modelling is very

robust (naïve Bayes and Bayes nets are not Bayesian approaches)

Page 24: Machine Learning  based on Attribute Interactions

Summary of ContributionsSummary of Contributions

Practice• A number of novel

visualization methods.• A heuristic for efficient

non-myopic attribute selection.

• An interaction-centered machine learning method, Kikuchi-Bayes

• A family of Bayesian priors for consistent modelling with interactions.

Theory• A meta-model of machine

learning.• A formal definition of a k-

way interaction, independent of the utility and hypothesis space.

• A thorough historic overview of related work.

• A novel view on interaction significance tests.