machine learning based on attribute interactions
DESCRIPTION
Machine Learning based on Attribute Interactions. Aleks Jakulin Advised by Acad. Prof. Dr. Ivan Bratko. 2003-2005. Learning = Modelling. data shapes the model model is made of possible hypotheses model is generated by an algorithm utility is the goal of a model. Utility = -Loss. - PowerPoint PPT PresentationTRANSCRIPT
Machine Learning Machine Learning based onbased on
Attribute InteractionsAttribute Interactions
Aleks Jakulin
Advised by Acad. Prof. Dr. Ivan Bratko
2003-2005
Learning = Learning = ModellingModelling
Utility = -Loss
MODEL
Learning Algorithm
DataHypothesisSpace
B { A: “A bounds B”
The fixed data sample restricts the model to be consistent with it.
- data shapes the model
- model is made of possible hypotheses
- model is generated by an algorithm
- utility is the goal of a model
Our Assumptions about ModelsOur Assumptions about Models
• Probabilistic Utility: logarithmic loss (alternatives: classification accuracy, Brier score, RMSE)
• Probabilistic Hypotheses: multinomial distribution, mixture of Gaussians (alternatives: classification trees, linear models)
• Algorithm: maximum likelihood (greedy), Bayesian integration (exhaustive)
• Data: instances + attributes
Expected Minimum Loss = EntropyExpected Minimum Loss = Entropy
C
Entropy given C’s empirical probability distribution (p = [0.2, 0.8]).
A
H(A)Information
which came with the knowledge of A
I(A;C)=H(A)+H(C)-H(AC)Mutual information or information gain ---
How much have A and C in common?
H(C|A) = H(C)-I(A;C)Conditional entropy - Remaining uncertaintyin C after learning A.
H(AC)Joint entropy
The diagram is a visualization of a probabilistic model P(A,C)
2-Way Interactions2-Way Interactions• Probabilistic models take the form of P(A,B)• We have two models:
– Interaction allowed: PY(a,b) := F(a,b)
– Interaction disallowed: PN(a,b) := P(a)P(b) = F(a)G(b)
• The error that PN makes when approximating PY:
D(PY || PN) := Ex ~ Py[L(x,PN)] = I(A;B) (mutual information)
• Also applies for predictive models:
• Also applies for Pearson’s correlation coefficient:
);())(||)|(( YAIYPAYPD
P is a bivariate Gaussian,obtained via max. likelihood
Rajski’s DistanceRajski’s Distance
• The attributes that have more in common can be visualized as closer in some imaginary Euclidean space.
• How to avoid the influence of many/few-valued attributes? (Complex attributes seem to have more in common.)
• Rajski’s distance:
• This is a metric (e.g.: the triangle inequality)
Interactions between Interactions between US SenatorsUS SenatorsDem
ocrats
Republicans
dark: strong interaction,high mutual information
light: weak interactionlow mutual information
Interaction matrix
A Taxonomy of A Taxonomy of Machine Learning AlgorithmsMachine Learning Algorithms
CMC dataset
Interaction dendrogram
3-Way Interactions3-Way Interactions
C
BA
label
attribute attribute
importance of attribute Bimportance of attribute A
3-Way Interaction: What is common to A, B and C together;
and cannot be inferred from any subset of attributes.
attribute correlation
2-Way Interactions
Interaction InformationInteraction Information
I(A;B;C) :=
I(AB;C) - I(B;C)- I(A;C)
= I(B;C|A) - I(B;C)= I(A;C|B) - I(A;C)
(Partial) history of independent reinventions: Quastler ‘53 (Info. Theory in Biology) - measure of
specificityMcGill ‘54 (Psychometrika) - interaction
informationHan ‘80 (Information & Control) - multiple mutual
informationYeung ‘91 (IEEE Trans. On Inf. Theory) - mutual
informationGrabisch&Roubens ‘99 (I. J. of Game Theory) - Banzhaf
interaction indexMatsuda ‘00 (Physical Review E) - higher-order
mutual inf.Brenner et al. ‘00 (Neural Computation) - average synergyDemšar ’02 (A thesis in machine learning) - relative
information gainBell ‘03 (NIPS02, ICA2003) - co-informationJakulin ‘02 - interaction gain
How informative are A and B together?
InteractionInteractionDendrogramDendrogram
Useful attributes
Useless attributes
farming
soil
vegetation
In classification taskswe are only interested inthose interactions that
involve the label
Interaction GraphInteraction Graph
• The Titanic data set– Label: survived?– Attributes: describe the
passenger or crew member• 2-way interactions:
– Sex then Class; Age not as important
• 3-way interactions:– negative: ‘Crew’ dummy is
wholly contained within ‘Class’; ‘Sex’ largely explains the death rate among the crew.
– positive:• Children from the first and
second class were prioritized.• Men from the second class
mostly died (third class men and the crew were better off)
• Female crew members had good odds of survival.
blue: redundancy, negative int. red: synergy, positive int.
An Interaction DrilledAn Interaction Drilled
Data for ~600 people
What’s the loss assuming no interaction between eyes in hair?
Area corresponds to probability:• black square: actual probability• colored square: predicted probability
Colors encode the type of error. The more saturated the color, the more “significant” the error. Codes:• blue: overestimate• red: underestimate• white: correct estimate
KL-d: 0.178
Rules = ConstraintsRules = Constraints• Rule 1:
Blonde hair is connected with blue or green eyes.
• Rule 2:Black hair is connected with brown eyes.
KL-d: 0.045 KL-d: 0.134
KL-d:0.022
Both rules:
KL-d: 0.178
No interaction:
ADULT/CENSUS
Attribute Value TaxonomiesAttribute Value Taxonomies
Interactions can also be computed between pairs of attribute (or label) values. This way, we can structure attributes with many values (e.g., Cartesian products ☺).
Attribute Selection with InteractionsAttribute Selection with Interactions
• 2-way interactions I(A;Y) are the staple of attribute selection– Examples: information gain, Gini ratio, etc.– Myopia! We ignore both positive and negative
interactions.
• Compare this with controlled 2-way interactions: I(A;Y | B,C,D,E,…)– Examples: Relief, regression coefficients– We have to build a model on all attributes anyway,
making many assumptions… What does it buy us?– We add another attribute, and the usefulness of a
previous attribute is reduced?
Attribute Subset Selection with NBCAttribute Subset Selection with NBC
The calibration of the classifier (expected likelihood of an instance’s label) first improves then deteriorates as we add attributes. The optimal number is ~8 attributes. The first few attributes are important, the rest is noise?
Attribute Subset Selection with NBCAttribute Subset Selection with NBC
NO! We sorted the attributes from the worst to the best. It is some of the best attributes that ruin the performance! Why? NBC gets confused by redundancies.
Accounting for RedundanciesAccounting for Redundancies
At each step, we pick the next best attribute, accounting for the attributes already in the model:– Fleuret’s procedure:
– Our procedure:
Example:Example:the naïve the naïve Bayesian Bayesian ClassifierClassifier
myopic →
↑ Interaction-proof
Predicting with InteractionsPredicting with Interactions• Interactions are meaningful self-contained views of the
data.• Can we use these views for prediction?• It’s easy if the views do not overlap: we just multiply
them together, and normalize: P(a,b)P(c)P(d,e,f)• If they do overlap:
• In a general overlap situation, Kikuchi approximation efficiently handles the intersections between interactions, and intersections-of-intersections.
• Algorithm: select interactions, use Kikuchi approximation to fuse them into a joint prediction, use this to classify.
)(
),(),(),|( 21
21 yP
yxPyxPxxyP
Interaction Interaction ModelsModels
• Transparent and intuitive• Efficient• Quick• Can be improved by replacing
Kikuchi with Conditional MaxEnt, and Cartesian product with something better.
Summary of the TalkSummary of the Talk
• Interactions are a good metaphor for understanding models and data. They can be a part of the hypothesis space, but do not have to.
• Probability is crucial for real-world problems.• Watch your assumptions (utility, model,
algorithm, data)• Information theory provides solid notation. • The Bayesian approach to modelling is very
robust (naïve Bayes and Bayes nets are not Bayesian approaches)
Summary of ContributionsSummary of Contributions
Practice• A number of novel
visualization methods.• A heuristic for efficient
non-myopic attribute selection.
• An interaction-centered machine learning method, Kikuchi-Bayes
• A family of Bayesian priors for consistent modelling with interactions.
Theory• A meta-model of machine
learning.• A formal definition of a k-
way interaction, independent of the utility and hypothesis space.
• A thorough historic overview of related work.
• A novel view on interaction significance tests.