[email protected] school of education university of tampere, finland introduction to discrete...

146
[email protected] School of Education University of Tampere, Finland Introduction to Discrete Bayesian Methods Petri Nokelainen

Upload: gabriella-clark

Post on 16-Dec-2015

224 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Petri.nokelainen@uta.fi School of Education University of Tampere, Finland Introduction to Discrete Bayesian Methods Petri Nokelainen

[email protected]

School of Education University of Tampere, Finland

Introduction to Discrete Bayesian Methods

Petri Nokelainen

Page 2: Petri.nokelainen@uta.fi School of Education University of Tampere, Finland Introduction to Discrete Bayesian Methods Petri Nokelainen

2

Outline

• Overview• Introduction to Bayesian Modeling• Bayesian Classification Modeling• Bayesian Dependency Modeling• Bayesian Unsupervised Model-based Visualization

Page 3: Petri.nokelainen@uta.fi School of Education University of Tampere, Finland Introduction to Discrete Bayesian Methods Petri Nokelainen

3

(Nokelainen, 2008.)

SPSS

AMOS

SPSS Extension

MPlus

SPSS

Overview

Page 4: Petri.nokelainen@uta.fi School of Education University of Tampere, Finland Introduction to Discrete Bayesian Methods Petri Nokelainen

4

B-Course

BayMiner

BDM = Bayesian Dependency Modeling

BCM = Bayesian Classification Modeling

BUMV = Bayesian Unsupervised Model-based Visualization

(Nokelainen & Ruohotie, 2009.)

(Nokelainen, Silander, Ruohotie & Tirri, 2007.)

Overview

Page 5: Petri.nokelainen@uta.fi School of Education University of Tampere, Finland Introduction to Discrete Bayesian Methods Petri Nokelainen

5

COMMON FACTORS:PUB_TCC_PRCC_HEPAC_SHOC_FAILCC_ABCC_ES

The classification accuracy of the best model found is 83.48% (58.57%).

Bayesian Classification Modelinghttp://b-course.cs.helsinki.fi

Page 6: Petri.nokelainen@uta.fi School of Education University of Tampere, Finland Introduction to Discrete Bayesian Methods Petri Nokelainen

6

Bayesian Dependency Modeling

http://b-course.cs.helsinki.fi

Page 7: Petri.nokelainen@uta.fi School of Education University of Tampere, Finland Introduction to Discrete Bayesian Methods Petri Nokelainen

7

Bayesian Unsupervised Model-based Visualization

http://www.bayminer.com

Page 8: Petri.nokelainen@uta.fi School of Education University of Tampere, Finland Introduction to Discrete Bayesian Methods Petri Nokelainen

8

Outline

• Overview• Introduction to Bayesian Modeling• Bayesian Classification modeling• Bayesian Dependency modeling• Bayesian Unsupervised Model-based Visualization

Page 9: Petri.nokelainen@uta.fi School of Education University of Tampere, Finland Introduction to Discrete Bayesian Methods Petri Nokelainen

9

Introduction to Bayesian Modeling

• In the social science researchers point of view, the requirements of traditional frequentistic statistical analysis are very challenging.

• For example, the assumption of normality of both the phenomena under investigation and the data is prerequisite for traditional parametric frequentistic calculations.

age, income, temperature, .. Continuous

0 ∞

FSIQ in the WAIS-III, Likert –scale, favourite colors, gender, ..

Discrete

0 1 2, ..

Page 10: Petri.nokelainen@uta.fi School of Education University of Tampere, Finland Introduction to Discrete Bayesian Methods Petri Nokelainen

10

Introduction to Bayesian Modeling

• In situations where – a latent construct cannot be appropriately represented as a

continuous variable, – ordinal or discrete indicators do not reflect underlying

continuous variables, – the latent variables cannot be assumed to be normally

distributed,

traditional Gaussian modeling is clearly not appropriate. • In addition, normal distribution analysis sets minimum

requirements for the number of observations, and the measurement level of variables should be continuous.

Page 11: Petri.nokelainen@uta.fi School of Education University of Tampere, Finland Introduction to Discrete Bayesian Methods Petri Nokelainen

11

Introduction to Bayesian Modeling

• Frequentistic parametric statistical techniques are designed for normally distributed (both theoretically and empirically) indicators that have linear dependencies.– Univariate normality

– Multivariate normality

– Bivariate linearity

Page 12: Petri.nokelainen@uta.fi School of Education University of Tampere, Finland Introduction to Discrete Bayesian Methods Petri Nokelainen

12

(Nokelainen, 2008, p. 119)

Page 13: Petri.nokelainen@uta.fi School of Education University of Tampere, Finland Introduction to Discrete Bayesian Methods Petri Nokelainen

13

• The upper part of the figure contains two sections, namely “parametric” and “non-parametric” divided into eight sub-sections (“DNIMMOCS OLD”).

• Parametric approach is viable only if – 1) Both the phenomenon modeled

and the sample follow normal distribution.

– 2) Sample size is large enough (at least 30 observations).

– 3) Continuous indicators are used.– 4) Dependencies between the

observed variables are linear. • Otherwise non-parametric

techniques should be applied.

D = Design (ce = controlled experiment, co = correlational study)N = Sample size IO = Independent observationsML = Measurement level (c = continuous, d = discrete, n = nominal)MD = Multivariate distribution (n = normal, similar)O = OutliersC = CorrelationsS = Statistical dependencies (l = linear, nl = non-linear)

Page 14: Petri.nokelainen@uta.fi School of Education University of Tampere, Finland Introduction to Discrete Bayesian Methods Petri Nokelainen

14

Introduction to Bayesian Modeling

N = 11 500

Page 15: Petri.nokelainen@uta.fi School of Education University of Tampere, Finland Introduction to Discrete Bayesian Methods Petri Nokelainen

15

Introduction to Bayesian Modeling

Bayesian method (1) is parameter-free and the user input is not required, instead, prior distributions of the model offer a theoretically justifiable method for affecting the model construction; (2) works with probabilities and can hence be expected to produce robust results with discrete data containing nominal and ordinal attributes; (3) has no limit for minimum sample size;

(4) is able to analyze both linear and non-linear dependencies; (5) assumes no multivariate normal model;(6) allows prediction.

Page 16: Petri.nokelainen@uta.fi School of Education University of Tampere, Finland Introduction to Discrete Bayesian Methods Petri Nokelainen

16

Introduction to Bayesian Modeling

• Probability is a mathematical construct that behaves in accordance with certain rules and can be used to represent uncertainty. – The classical statistical inference is based on a frequency

interpretation of probability, and the Bayesian inference is based on ”subjective” or ”degree of belief” interpretation.

• Bayesian inference uses conditional probabilities to represent uncertainty.

• P(H | E,I) - the probability of unknown things or ”hypothesis” (H), given the evidence (E) and background information (I).

Page 17: Petri.nokelainen@uta.fi School of Education University of Tampere, Finland Introduction to Discrete Bayesian Methods Petri Nokelainen

17

Introduction to Bayesian Modeling

• The essence of Bayesian inference is in the rule, known as Bayes' theorem, that tells us how to update our initial probabilities P(H) if we see evidence E, in order to find out P(H|E).

P(E|H) •P(H)P(H|E)=

P(E|H)•P(H) + P(E|~H) •P(~H)

• A priori probability• Conditional probability • Posteriori probability

Page 18: Petri.nokelainen@uta.fi School of Education University of Tampere, Finland Introduction to Discrete Bayesian Methods Petri Nokelainen

18

Introduction to Bayesian Modeling

• The theorem was invented by an english reverend Thomas Bayes (1701-1761) and published posthumously (1763).

Page 19: Petri.nokelainen@uta.fi School of Education University of Tampere, Finland Introduction to Discrete Bayesian Methods Petri Nokelainen

19

Introduction to Bayesian Modeling

• Bayesian inference comprises the following three principal steps:(1) Obtain the initial probabilities P(H) for the unknown

things. (Prior distribution.)(2) Calculate the probabilities of the evidence E (data)

given different values for the unknown things, i.e., P(E | H). (Likelihood or conditional distribution.)

(3) Calculate the probability distribution of interest P(H | E) using Bayes' theorem. (Posterior distribution.)

• Bayes' theorem can be used sequentially.

Page 20: Petri.nokelainen@uta.fi School of Education University of Tampere, Finland Introduction to Discrete Bayesian Methods Petri Nokelainen

20

Introduction to Bayesian Modeling

– If we first receive some evidence E (data), and calculate the posterior P(H | E), and at some later point in time receive more data E', the calculated posterior can be used in the role of prior to calculate a new posterior P(H | E,E') and so on.

– The posterior P(H | E) expresses all the necessary information to perform predictions.

– The more evidence we get, the more certain we will become of the unknowns, until all but one value combination for the unknowns have probabilities so close to zero that they can be neglected.

Page 21: Petri.nokelainen@uta.fi School of Education University of Tampere, Finland Introduction to Discrete Bayesian Methods Petri Nokelainen

21

C_Example 1: Applying Bayes’ Theorem

• Company A is employing workers on short term jobs that are well paid.

• The job sets certain prerequisites to applicants linguistic abilities.

• Earlier all the applicants were interviewed, but nowadays it has become an impossible task as both the number of open vacancies and applicants has increased enormously.

• Personnel department of the company was ordered to develop a questionnaire to preselect the most suitable applicants for the interview.

Page 22: Petri.nokelainen@uta.fi School of Education University of Tampere, Finland Introduction to Discrete Bayesian Methods Petri Nokelainen

22

C_Example 1: Applying Bayes’ Theorem

• Psychometrician who developed the instrument estimates that it would work out right on 90 out of 100 applicants, if they are honest.

• We know on the basis of earlier interviews that the terms (linguistic abilities) are valid for one per 100 person living in the target population.

• The question is: If an applicant gets enough points to participate in the interview, is he or she hired for the job (after an interview)?

Page 23: Petri.nokelainen@uta.fi School of Education University of Tampere, Finland Introduction to Discrete Bayesian Methods Petri Nokelainen

23

C_Example 1: Applying Bayes’ Theorem

• A priori probability P(H) is described by the number of those people in the target population that really are able to meet the requirements of the task (1 out of 100 = .01).

• Counter assumption of the a priori is P(~H) that equals to 1-P(H), thus it is = .99.

• Psychometricians beliefs about how the instrument works is called conditional probability P(E|H) = .9.

• Instruments failure to indicate non-valid applicants, i.e., those that are not able to succeed in the following interview, is stated as P(E|~H) that equals to .1.– These values need not to sum to one!

Page 24: Petri.nokelainen@uta.fi School of Education University of Tampere, Finland Introduction to Discrete Bayesian Methods Petri Nokelainen

24

(.9) • (.01)

P(H|E)=

(.9) • (.01) + (.1) • (.99)

= .08

P(E|H) • P(H)

P(H|E)=

P(E|H)• P(H) + P(E|~H) • P(~H)

• A priori probability• Conditional probability • Posterior probability

C_Example 1: Applying Bayes’ Theorem

Page 25: Petri.nokelainen@uta.fi School of Education University of Tampere, Finland Introduction to Discrete Bayesian Methods Petri Nokelainen

25

C_Example 1: Applying Bayes’ Theorem

Page 26: Petri.nokelainen@uta.fi School of Education University of Tampere, Finland Introduction to Discrete Bayesian Methods Petri Nokelainen

26

C_Example 1: Applying Bayes’ Theorem

• What if the measurement error of the psychometricians instrument would have been 20 per cent?– P(E|H)=0.8 P(E|~H)=0.2

Page 27: Petri.nokelainen@uta.fi School of Education University of Tampere, Finland Introduction to Discrete Bayesian Methods Petri Nokelainen

27

C_Example 1: Applying Bayes’ Theorem

Page 28: Petri.nokelainen@uta.fi School of Education University of Tampere, Finland Introduction to Discrete Bayesian Methods Petri Nokelainen

28

C_Example 1: Applying Bayes’ Theorem

• What if the measurement error of the psychometricians instrument would have been only one per cent?– P(E|H)=0.99 P(E|~H)=0.01

Page 29: Petri.nokelainen@uta.fi School of Education University of Tampere, Finland Introduction to Discrete Bayesian Methods Petri Nokelainen

29

C_Example 1: Applying Bayes’ Theorem

Page 30: Petri.nokelainen@uta.fi School of Education University of Tampere, Finland Introduction to Discrete Bayesian Methods Petri Nokelainen

30

• Quite often people tend to estimate the probabilities to be too high or low, as they are not able to update their beliefs even in simple decision making tasks when situations change dynamically (Anderson, 1995).

C_Example 1: Applying Bayes’ Theorem

Page 31: Petri.nokelainen@uta.fi School of Education University of Tampere, Finland Introduction to Discrete Bayesian Methods Petri Nokelainen

31C_Example 2: Comparison of Traditional Frequentistic and Bayesian Approach

• One of the most important rules educational science scientific journals apply to judge the scientific merits of any submitted manuscript is that all the reported results should be based on so called ‘null hypothesis significance testing procedure’ (NHSTP) and its featured product, p-value.

• Gigerenzer, Krauss and Vitouch (2004, p. 392) describe ‘the null ritual’ as follows: – 1) Set up a statistical null hypothesis of “no mean difference” or

“zero correlation.” Don’t specify the predictions of your research or of any alternative substantive hypotheses;

– 2) Use 5 per cent as a convention for rejecting the null. If significant, accept your research hypothesis;

– 3) Always perform this procedure.

Page 32: Petri.nokelainen@uta.fi School of Education University of Tampere, Finland Introduction to Discrete Bayesian Methods Petri Nokelainen

32C_Example 2: Comparison of Traditional Frequentistic and Bayesian Approach

– A p-value is the probability of the observed data (or of more extreme data points), given that the null hypothesis H0 is true, P(D|H0) (id.).

• The first common misunderstanding is that the p-value of, say t-test, would describe how probable it is to have the same result if the study is repeated many times (Thompson, 1994).

• Gerd Gigerenzer and his colleagues (id., p. 393) call this replication fallacy as “P(D|H0) is confused with 1—P(D).”

Page 33: Petri.nokelainen@uta.fi School of Education University of Tampere, Finland Introduction to Discrete Bayesian Methods Petri Nokelainen

33C_Example 2: Comparison of Traditional Frequentistic and Bayesian Approach

• The second misunderstanding, shared by both applied statistics teachers and the students, is that the p-value would prove or disprove H0. However, a significance test can only provide probabilities, not prove or disprove null hypothesis.

• Gigerenzer (id., p. 393) calls this fallacy an illusion of certainty: “Despite wishful thinking, p(D|H0) is not the same as P(H0|D), and a significance test does not and cannot provide a probability for a hypothesis.”

– A Bayesian statistics provide a way of calculating a probability of a hypothesis (discussed later in this section).

Page 34: Petri.nokelainen@uta.fi School of Education University of Tampere, Finland Introduction to Discrete Bayesian Methods Petri Nokelainen

34C_Example 2: Comparison of Traditional Frequentistic and Bayesian Approach

• My statistics course grades (Autumn 2006, n = 12) ranged from one to five as follows: 1) n = 3; 2) n = 2; 3) n = 4; 4) n = 2; 5) n = 1, showing that the lowest grade frequency (”1”) from the course is three (25.0%). – Previous data from the same course (2000-2005) shows that only five

students out of 107 (4.7%) had the lowest grade.

• Next, I will use the classical statistical approach (the likelihood principle) and Bayesian statistics to calculate if the number of the lowest course grades is exceptionally high on my latest course when compared to my earlier stat courses.

Page 35: Petri.nokelainen@uta.fi School of Education University of Tampere, Finland Introduction to Discrete Bayesian Methods Petri Nokelainen

35C_Example 2: Comparison of Traditional Frequentistic and Bayesian Approach

• There are numerous possible reasons behind such development, for example, I have become more critical on my assessment or the students are less motivated in learning quantitative techniques.

• However, I believe that the most important difference between the last and preceding courses is that the assessment was based on a computer exercise with statistical computations. – The preceding courses were assessed only with essay

answers.

Page 36: Petri.nokelainen@uta.fi School of Education University of Tampere, Finland Introduction to Discrete Bayesian Methods Petri Nokelainen

36C_Example 2: Comparison of Traditional Frequentistic and Bayesian Approach

• I assume that the 12 students earned their grade independently (independent observations) of each other as the computer exercise was conducted under my or my assistant’s supervision.

• I further assume that the chance of getting the lowest grade (), is the same for each student. – Therefore X, the number of lowest grades (1) in the scale from

1 to 5 among the 12 students in the latest stat course, has a binomial (12, ) distribution: X ~ Bin(12, ).

– For any integer r between 0 and 12,

rr

rnrP

12)1(12

),|(

Page 37: Petri.nokelainen@uta.fi School of Education University of Tampere, Finland Introduction to Discrete Bayesian Methods Petri Nokelainen

37C_Example 2: Comparison of Traditional Frequentistic and Bayesian Approach

• The expected number of lowest grades is 12(5/107) = 0.561.

• Theta is obtained by dividing the expected number of lowest grades with the number of students: 0.561 / 12 0.05.

• The null hypothesis is formulated as follows: H0: = 0.05, stating that the rate of the lowest grades from the current stat course is not a big thing and compares to the previous courses rates.

Page 38: Petri.nokelainen@uta.fi School of Education University of Tampere, Finland Introduction to Discrete Bayesian Methods Petri Nokelainen

38C_Example 2: Comparison of Traditional Frequentistic and Bayesian Approach

• Three alternative hypotheses are formulated to address the concern of the increased number of lowest grades (6, 7 and 8, respectively): H1: = 0.06; H2: = 0.07; H3: = 0.08.– H1: 12/(107/6) = .67 -> .67/12=.056 .06– H2: 12/(107/7) = .79 -> .79/12=.065 .07– H3: 12/(107/8) = .90 -> .90/12=.075 .08

Page 39: Petri.nokelainen@uta.fi School of Education University of Tampere, Finland Introduction to Discrete Bayesian Methods Petri Nokelainen

39

• To compare the hypotheses, we calculate binomial distributions for each value of .

• For example, the null hypothesis (H0) calculation yields

017.

)05.1(05.2177280

479001600

)05.1(05.)!312(!3

!12

)05.1(05.3

12),|(

3123

3123

3123

nrP

C_Example 2: Comparison of Traditional Frequentistic and Bayesian Approach

Page 40: Petri.nokelainen@uta.fi School of Education University of Tampere, Finland Introduction to Discrete Bayesian Methods Petri Nokelainen

40

• The results for the alternative hypotheses are as follows: – PH1(3|.06, 12) .027;

– PH2(3|.07, 12) .039;

– PH3(3|.08, 12) .053.

• The ratio of the hypotheses is roughly 1:2:2:3 and could be verbally interpreted with statements like “the second and third hypothesis explain the data about equally well”, or “the fourth hypothesis explains the data about three times as well as the first hypothesis”.

C_Example 2: Comparison of Traditional Frequentistic and Bayesian Approach

Page 41: Petri.nokelainen@uta.fi School of Education University of Tampere, Finland Introduction to Discrete Bayesian Methods Petri Nokelainen

41

• Lavine (1999) reminds that P(r|, n), as a function of r (3) and {.05; .06; .07; .08}, describes only how well each hypotheses explains the data; no value of r other than 3 is relevant. – For example, P(4|.05, 12) is irrelevant as it does not describe

how well any hypothesis explains the data.

– This likelihood principle, that is, to base statistical inference only on the observed data and not on a data that might have been observed, is an essential feature of Bayesian approach.

C_Example 2: Comparison of Traditional Frequentistic and Bayesian Approach

Page 42: Petri.nokelainen@uta.fi School of Education University of Tampere, Finland Introduction to Discrete Bayesian Methods Petri Nokelainen

42

• The Fisherian, so called ‘classical approach’ to test the null hypothesis (H0 : = .05) against the alternative hypothesis (H1 : > .05) is to calculate the p-value that defines the probability under H0 of observing an outcome at least as extreme as the outcome actually observed:

)05.|12(...)05.|4()05.|3( rPrPrPp

C_Example 2: Comparison of Traditional Frequentistic and Bayesian Approach

Page 43: Petri.nokelainen@uta.fi School of Education University of Tampere, Finland Introduction to Discrete Bayesian Methods Petri Nokelainen

43

• As an example, the first part of the formula is solved as follows:

017.)05.1(05.)!312(!3

!12)1(

)!(!

!)05.|3( 3123

rnr

rnr

nrP

C_Example 2: Comparison of Traditional Frequentistic and Bayesian Approach

Page 44: Petri.nokelainen@uta.fi School of Education University of Tampere, Finland Introduction to Discrete Bayesian Methods Petri Nokelainen

44

• After calculations, the p-value of .02 would suggest H0 rejection, if the rejection level of significance is set at 5 per cent. – Calculation of p-value violates the likelihood principle by

using P(r|, n) for values of r other than the observed value of r = 3 (Lavine, 1999):

• The summands of P(4|.05, 12), P(5|.05, 12), …, P(12|.05, 12) do not describe how well any hypothesis explains observed data.

C_Example 2: Comparison of Traditional Frequentistic and Bayesian Approach

Page 45: Petri.nokelainen@uta.fi School of Education University of Tampere, Finland Introduction to Discrete Bayesian Methods Petri Nokelainen

45

• A Bayesian approach will continue from the same point as the classical approach, namely probabilities given by the binomial distributions, but also make use of other relevant sources of a priori information. – In this domain, it is plausible to think that the computer test

(“SPSS exam”) would make the number of total failures more probable than in the previous times when the evaluation was based solely on the essays.

– On the other hand, the computer test has only 40 per cent weight in the equation that defines the final stat course grade: [.3(Essay_1) + .3(Essay_2) + .4(Computer test)]/3 = Final grade.

C_Example 2: Comparison of Traditional Frequentistic and Bayesian Approach

Page 46: Petri.nokelainen@uta.fi School of Education University of Tampere, Finland Introduction to Discrete Bayesian Methods Petri Nokelainen

46

– Another aspect is to consider the nature of the aforementioned tasks, as the essays are distance work assignments while the computer test is to be performed under observation.

– Perhaps the course grades of my earlier stat courses have a narrower dispersion due to violence of the independent observation assumption?

• For example, some students may have copy-pasted text from other sources or collaborated without a permission.

– As we see, there are many sources of a priori information that I judge to be inconclusive and, thus, define that null hypothesis is as likely to be true or false.

C_Example 2: Comparison of Traditional Frequentistic and Bayesian Approach

Page 47: Petri.nokelainen@uta.fi School of Education University of Tampere, Finland Introduction to Discrete Bayesian Methods Petri Nokelainen

47

• This a priori judgment is expressed mathematically as P(H0) 1/2 P(H1) + P(H2) + P(H3).

• I further assume that the alternative hypotheses H1, H2 or H3 share the same likelihood P(H1) P(H2) P(H3) 1/6.

• These prior distributions summarize the knowledge about prior to incorporating the information from my course grades.

C_Example 2: Comparison of Traditional Frequentistic and Bayesian Approach

Page 48: Petri.nokelainen@uta.fi School of Education University of Tampere, Finland Introduction to Discrete Bayesian Methods Petri Nokelainen

48

• An application of Bayes' theorem yields

)()|3()()|3()()|3()()|3(

)()|3()3|(

33221100

000 HPHrPHPHrPHPHrPHPHrP

HPHrPrHP

30.0

)6

1()053.|3()

6

1()039.|3()

6

1()027.|3()

2

1()017.|3(

)2

1()017.|3(

PrPPrPPrPPrP

PrP

C_Example 2: Comparison of Traditional Frequentistic and Bayesian Approach

Page 49: Petri.nokelainen@uta.fi School of Education University of Tampere, Finland Introduction to Discrete Bayesian Methods Petri Nokelainen

49

• Similar calculations for the alternative hypotheses yields P(H1|r=3) .16; P(H2|r=3) .29; P(H3|r=3) .31.

• These posterior distributions summarize the knowledge about after incorporating the grade information.

• The four hypotheses seem to be about equally likely (.30 vs. .16, .29, .31). – The odds are about 2 to 1 (.30 vs. .70) that the latest stat course

had higher rate of lowest grades than 0.05.

C_Example 2: Comparison of Traditional Frequentistic and Bayesian Approach

Page 50: Petri.nokelainen@uta.fi School of Education University of Tampere, Finland Introduction to Discrete Bayesian Methods Petri Nokelainen

50

• The difference between the classical and Bayesian statistics would be only philosophical (probability vs. inverse probability) if they would always lead to similar conclusions. – In this case the p-value would suggest

rejection of H0 (p = .02).– Bayesian analysis would also suggest

evidence against = .05 (.30 vs. .70, ratio of .43).

C_Example 2: Comparison of Traditional Frequentistic and Bayesian Approach

Page 51: Petri.nokelainen@uta.fi School of Education University of Tampere, Finland Introduction to Discrete Bayesian Methods Petri Nokelainen

51

• What if the number of the lowest grades in the last course would be two? – The classical approach would not

anymore suggest H0 rejection (p = .12).

– Bayesian result would still say that there is more evidence against than for the H0 (.39 vs. .61, ratio of .64).

C_Example 2: Comparison of Traditional Frequentistic and Bayesian Approach

Page 52: Petri.nokelainen@uta.fi School of Education University of Tampere, Finland Introduction to Discrete Bayesian Methods Petri Nokelainen

52

Outline

• Overview• Introduction to Bayesian Modeling• Bayesian Classification Modeling• Bayesian Dependency Modeling• Bayesian Unsupervised Model-based Visualization

Page 53: Petri.nokelainen@uta.fi School of Education University of Tampere, Finland Introduction to Discrete Bayesian Methods Petri Nokelainen

53

B-Course

BCM = Bayesian Classification Modeling

BDM = Bayesian Dependency Modeling

BUMV = Bayesian Unsupervised Model-based Visualization

Page 54: Petri.nokelainen@uta.fi School of Education University of Tampere, Finland Introduction to Discrete Bayesian Methods Petri Nokelainen

54

Bayesian Classification Modeling

• Bayesian Classification Modeling (BCM) is implemented in B-Course software that is based on discrete Bayesian methods.– This also applies to Bayesial Dependency Modeling that is

discussed later.

• ”Quantitative” indicators with high measurement lever (continuous, interval) lose more information in the discretization process than ”qualitative” indicators (ordinal, nominal) as they all are treated in the analysis as nominal (discrete) indicators.

Page 55: Petri.nokelainen@uta.fi School of Education University of Tampere, Finland Introduction to Discrete Bayesian Methods Petri Nokelainen

55

Bayesian Classification Modeling

• For example, variable ”gender” may include numerical values ”1” (Female) or ”2” (Male) or text values ”Female” and ”Male” in discrete Bayesian analysis.

• This will inevitably lead to a loss of power (Cohen, 1988; Murphy & Myors, 1998), however, ensuring that sample size is large enough is a simple way to address this problem.

Page 56: Petri.nokelainen@uta.fi School of Education University of Tampere, Finland Introduction to Discrete Bayesian Methods Petri Nokelainen

Sample size estimation

• N– Population size.

• n– Estimated sample size.

• Sampling error (e)– Difference between the true

(unknown) value and observed values, if the survey were repeated (=sample collected) numerous times.

• Confidence interval– Spread of the observed values that

would be seen if the survey were repeated numerous times.

• Confidence level– How often the observed values

would be within sampling error of the true value if the survey were repeated numerous times.

(Murphy & Myors, 1998.)

56

Page 57: Petri.nokelainen@uta.fi School of Education University of Tampere, Finland Introduction to Discrete Bayesian Methods Petri Nokelainen

57

Bayesian Classification Modeling

• Aim of the BCM is to select the variables that are best predictors for different class memberships (e.g., gender, job title, level of giftedness).

• In the classification process, the automatic search is looking for the best set of variables to predict the class variable for each data item.

Page 58: Petri.nokelainen@uta.fi School of Education University of Tampere, Finland Introduction to Discrete Bayesian Methods Petri Nokelainen

58

Bayesian Classification Modeling

• The search procedure resembles the traditional linear discriminant analysis (LDA, see Huberty, 1994), but the implementation is totally different. – For example, a variable selection problem that is addressed

with forward, backward or stepwise selection procedure in LDA is replaced with a genetic algorithm approach (e.g., Hilario, Kalousisa, Pradosa & Binzb, 2004; Hsu, 2004) in the Bayesian classification modeling.

Page 59: Petri.nokelainen@uta.fi School of Education University of Tampere, Finland Introduction to Discrete Bayesian Methods Petri Nokelainen

59

Bayesian Classification Modeling

• The genetic algorithm approach means that variable selection is not limited to one (or two or three) specific approach; instead many approaches and their combinations are exploited. – One possible approach is to begin with the presumption that

the models (i.e., possible predictor variable combinations) that resemble each other a lot (i.e., have almost same variables and discretizations) are likely to be almost equally good.

– This leads to a search strategy in which models that resemble the current best model are selected for comparison, instead of picking models randomly.

Page 60: Petri.nokelainen@uta.fi School of Education University of Tampere, Finland Introduction to Discrete Bayesian Methods Petri Nokelainen

60

Bayesian Classification Modeling

– Another approach is to abandon the habit of always rejecting the weakest model and instead collect a set of relatively good models.

– The next step is to combine the best parts of these models so that the resulting combined model is better than any of the original models.

• B-Course is capable of mobilizing many more viable approaches, for example, rejecting the better model (algorithms like hill climbing, simulated annealing) or trying to avoid picking similar model twice (tabu search).

Page 61: Petri.nokelainen@uta.fi School of Education University of Tampere, Finland Introduction to Discrete Bayesian Methods Petri Nokelainen

61

Bayesian Classification Modeling

Nokelainen, P., Ruohotie, P., & Tirri, H. (1999).

Page 62: Petri.nokelainen@uta.fi School of Education University of Tampere, Finland Introduction to Discrete Bayesian Methods Petri Nokelainen

62

For an example of practical use of BCM, see Nokelainen, Tirri, Campbell and Walberg (2007).

Page 63: Petri.nokelainen@uta.fi School of Education University of Tampere, Finland Introduction to Discrete Bayesian Methods Petri Nokelainen

63

The results of Bayesian classification modeling showed that the estimatedclassification accuracy of the best model found was 60%. The left-hand side ofFigure 3 shows that only three variables, Olympians Conducive Home Atmosphere(SA), Olympians School Shortcomings (C_SHO), and Computer literacy composite (COMP), were successful predictors for the A or C group membership. All the other variables that were not accepted in the model are to be considered as connective factors between the two groups. The middle section of Figure 3 shows that the two strongest predictors were Olympians Conducive Home Atmosphere (20.9%) and Olympians School Shortcomings (22.6%). The confusion matrix shows that most of the A (25 correct out of 39) and the C (29 out of 47) group members were correctly classified. The matrix also shows that nine participants of the group A were incorrectly classified into group C and vice versa.

Page 64: Petri.nokelainen@uta.fi School of Education University of Tampere, Finland Introduction to Discrete Bayesian Methods Petri Nokelainen

64

Page 65: Petri.nokelainen@uta.fi School of Education University of Tampere, Finland Introduction to Discrete Bayesian Methods Petri Nokelainen

65

Figure 4 presents predictive modeling of the A and C groups (‘‘A_C’’, A or C group membership) by Olympians Conducive Home Atmosphere (SA), Olympians School Shortcomings (C_SHO), and Computer Literacy Composite (COMP). The left-hand side of the figure presents the initial model with no values fixed. The model in the middle presents a scenario where all the A group members are selected. When we compare this model to the one on the right-hand side (i.e., presenting a situation where all the C group members are selected), we notice, for example, that conditional distribution of the Olympians Conducive Home Atmosphere (SA) has changed. It shows that highly productive Olympians have reported more Conducive home atmosphere (54.0%) than the members of the low productivity group C (23.0%).

Page 66: Petri.nokelainen@uta.fi School of Education University of Tampere, Finland Introduction to Discrete Bayesian Methods Petri Nokelainen

66

Page 67: Petri.nokelainen@uta.fi School of Education University of Tampere, Finland Introduction to Discrete Bayesian Methods Petri Nokelainen

Modeling of Vocational Excellence in Air Traffic Control

•This paper aims to describe the characteristics and predictors that explain air traffic controller’s (ATCO) vocational expertise and excellence.

•The study analyzes the role of natural abilities, self-regulative abilities and environmental conditions in ATCO’s vocational development.

67

(Pylväs, Nokelainen & Roisko, in press.)

Page 68: Petri.nokelainen@uta.fi School of Education University of Tampere, Finland Introduction to Discrete Bayesian Methods Petri Nokelainen

Modeling of Vocational Excellence in Air Traffic Control

•The target population of the study consisted of ATCOs in Finland (N=300) of which 28, representing four different airports, were interviewed.

•The research data also included interviewees’ aptitude test scoring, study records and employee assessments.

68

Page 69: Petri.nokelainen@uta.fi School of Education University of Tampere, Finland Introduction to Discrete Bayesian Methods Petri Nokelainen

Modeling of Vocational Excellence in Air Traffic Control

•The research questions were examined by using theoretical concept analysis.

•The qualitative data analysis was conducted with content analysis and Bayesian classification modeling.

69

Page 70: Petri.nokelainen@uta.fi School of Education University of Tampere, Finland Introduction to Discrete Bayesian Methods Petri Nokelainen

Modeling of Vocational Excellence in Air Traffic Control

70

Page 71: Petri.nokelainen@uta.fi School of Education University of Tampere, Finland Introduction to Discrete Bayesian Methods Petri Nokelainen

Modeling of Vocational Excellence in Air Traffic Control

(RQ1a)

What are the differences in characteristics between the air traffic controllers representing vocational expertise and vocational excellence?

71

Page 72: Petri.nokelainen@uta.fi School of Education University of Tampere, Finland Introduction to Discrete Bayesian Methods Petri Nokelainen

Modeling of Vocational Excellence in Air Traffic Control

"…the natural ambition of wanting to be good. Air traffic controllers have perhaps generally a strong professional pride."

”Interesting and rewarding work, that is the basis of wanting to stay in this work until retiring.”

72

Page 73: Petri.nokelainen@uta.fi School of Education University of Tampere, Finland Introduction to Discrete Bayesian Methods Petri Nokelainen

Modeling of Vocational Excellence in Air Traffic Control

•"I read all the regulations and instructions carefully and precisely, and try to think …the majority wave aside of them. It reflects on work."

"…but still I consider myself more precise than the majority […]a bad air traffic controller have delays, good air traffic controllers do not have delays which is something that also pilots appreciate because of the strict time limits.”

73

Page 74: Petri.nokelainen@uta.fi School of Education University of Tampere, Finland Introduction to Discrete Bayesian Methods Petri Nokelainen

Modeling of Vocational Excellence in Air Traffic Control

74

Page 75: Petri.nokelainen@uta.fi School of Education University of Tampere, Finland Introduction to Discrete Bayesian Methods Petri Nokelainen

Modeling of Vocational Excellence in Air Traffic Control

75

Page 76: Petri.nokelainen@uta.fi School of Education University of Tampere, Finland Introduction to Discrete Bayesian Methods Petri Nokelainen

Modeling of Vocational Excellence in Air Traffic Control

76

Page 77: Petri.nokelainen@uta.fi School of Education University of Tampere, Finland Introduction to Discrete Bayesian Methods Petri Nokelainen

Modeling of Vocational Excellence in Air Traffic Control

77

Page 78: Petri.nokelainen@uta.fi School of Education University of Tampere, Finland Introduction to Discrete Bayesian Methods Petri Nokelainen

Modeling of Vocational Excellence in Air Traffic Control

78

Page 79: Petri.nokelainen@uta.fi School of Education University of Tampere, Finland Introduction to Discrete Bayesian Methods Petri Nokelainen

79

Classification accuracy 89%.

Page 80: Petri.nokelainen@uta.fi School of Education University of Tampere, Finland Introduction to Discrete Bayesian Methods Petri Nokelainen

Modeling of Vocational Excellence in Air Traffic Control

80

Page 81: Petri.nokelainen@uta.fi School of Education University of Tampere, Finland Introduction to Discrete Bayesian Methods Petri Nokelainen

Modeling of Vocational Excellence in Air Traffic Control

81

Page 82: Petri.nokelainen@uta.fi School of Education University of Tampere, Finland Introduction to Discrete Bayesian Methods Petri Nokelainen

82

Outline

• Research Overview• Introduction to Bayesian Modeling• Investigating Non-linearities with Bayesian Networks• Bayesian Classification Modeling• Bayesian Dependency Modeling• Bayesian Unsupervised Model-based Visualization

Page 83: Petri.nokelainen@uta.fi School of Education University of Tampere, Finland Introduction to Discrete Bayesian Methods Petri Nokelainen

83

B-Course

BCM = Bayesian Classification Modeling

BDM = Bayesian Dependency Modeling

BUMV = Bayesian Unsupervised Model-based Visualization

Page 84: Petri.nokelainen@uta.fi School of Education University of Tampere, Finland Introduction to Discrete Bayesian Methods Petri Nokelainen

84

Bayesian Dependency Modeling

• Bayesian dependency modeling (BDM) is applied to examine dependencies between variables by both their visual representation and probability ratio of each dependency

• Graphical visualization of Bayesian network contains two components: – 1) Observed variables visualized as

ellipses. – 2) Dependences visualized as lines

between nodes.

Var 1 Var 2

Var 3

Page 85: Petri.nokelainen@uta.fi School of Education University of Tampere, Finland Introduction to Discrete Bayesian Methods Petri Nokelainen

85

C_Example 4: Calculation of Bayesian Score

• Bayesian score (BS), that is, the probability of the model P(M|D), allows the comparison of different models.

Figure 9. An Example of Two Competing Bayesian Network Structures

(Nokelainen, 2008, p. 121.)

Page 86: Petri.nokelainen@uta.fi School of Education University of Tampere, Finland Introduction to Discrete Bayesian Methods Petri Nokelainen

86

• Let us assume that we have the following data:x1 x21 11 12 21 21 1

• Model 1 (M1) represents the two variables, x1 and x2 respectively, without statistical dependency, and the model 2 (M2) represents the two variables with a dependency (i.e., with a connecting arc). – The binomial data might be a result of an experiment, where the five

participants have drinked a nice cup of tea before (x1) and after (x2) a test of geographic knowledge.

C_Example 4: Calculation of Bayesian Score

Page 87: Petri.nokelainen@uta.fi School of Education University of Tampere, Finland Introduction to Discrete Bayesian Methods Petri Nokelainen

87

• In order to calculate P(M1,2|D), we need to solve P(D|M1,2) for the two models M1 and M2.

– Probability of the data given the model is solved by using the following marginal likelihood equation (Congdon, 2001, p. 473; Myllymäki, Silander, Tirri, & Uronen, 2001; Myllymäki & Tirri, 1998, p. 63):

n

i

q

j

r

k ijk

ijkijk

ijij

iji i

N

NN

NN

NMDP

1 1 1'

'

'

'

)(

)(

)(

)()|(

C_Example 4: Calculation of Bayesian Score

Page 88: Petri.nokelainen@uta.fi School of Education University of Tampere, Finland Introduction to Discrete Bayesian Methods Petri Nokelainen

88

• In the Equation 4, following symbols are used: – n is the number of variables (i indexes

variables from 1 to n); – ri is the number of values in i:th variable (k

indexes these values from 1 to ri; – qi is the number of possible configurations

of parents of i:th variable; • The marginal likelihood equation

produces a Bayesian Dirichlet score that allows model comparison (Heckerman et al., 1995; Tirri, 1997; Neapolitan & Morris, 2004).

n

i

q

j

r

k ijk

ijkijk

ijij

iji i

N

NN

NN

NMDP

1 1 1'

'

'

'

)(

)(

)(

)()|(

- Nij describes the number of rows in the data that have j:th configuration for parents of i:th variable; - Nijk describes how many rows in the data have k:th value for the i:th variable also have j:th configuration for parents of i:th variable; - N’ is the equivalent sample size set to be the average number of values divided by two.

C_Example 4: Calculation of Bayesian Score

Page 89: Petri.nokelainen@uta.fi School of Education University of Tampere, Finland Introduction to Discrete Bayesian Methods Petri Nokelainen

89

• First, P(D|M1) is calculated given the values of variable x1:

)(

)(

)(

)(

)(

)(

)|('

'2

'

'

'1

'

'

'

11

qr

N

Nqr

N

qr

N

Nqr

N

Nq

N

q

N

MDPijkijk

iji

ix

)50.0(

)150.0(

)50.0(

)450.0(

)500.1(

)00.1(

027.0

500.0563.6008.0

x1x21111221211

(2/2)/1 (2/2)/2*1

C_Example 4: Calculation of Bayesian Score

Page 90: Petri.nokelainen@uta.fi School of Education University of Tampere, Finland Introduction to Discrete Bayesian Methods Petri Nokelainen

90

• Second, the values for the x2 are calculated:

)(

)(

)(

)(

)(

)(

)|('

'2

'

'

'1

'

'

'

12

qr

N

Nqr

N

qr

N

Nqr

N

Nq

N

q

N

MDPijkijk

iji

ix

)50.0(

)250.0(

)50.0(

)350.0(

)500.1(

)00.1(

012.0

750.0875.1008.0

x1x21111221211

C_Example 4: Calculation of Bayesian Score

Page 91: Petri.nokelainen@uta.fi School of Education University of Tampere, Finland Introduction to Discrete Bayesian Methods Petri Nokelainen

91

• The BS, probability for the first model P(M1|D), is 0.027 * 0.012 0.000324.

C_Example 4: Calculation of Bayesian Score

Page 92: Petri.nokelainen@uta.fi School of Education University of Tampere, Finland Introduction to Discrete Bayesian Methods Petri Nokelainen

92

• Third, P(D|M2) is calculated given the values of variable x1:

)(

)(

)(

)(

)(

)(

)|('

'2

'

'

'1

'

'

'

21

qr

N

Nqr

N

qr

N

Nqr

N

Nq

N

q

N

MDPijkijk

iji

ix

)50.0(

)150.0(

)50.0(

)450.0(

)500.1(

)00.1(

027.0

500.0563.6008.0

C_Example 4: Calculation of Bayesian Score

Page 93: Petri.nokelainen@uta.fi School of Education University of Tampere, Finland Introduction to Discrete Bayesian Methods Petri Nokelainen

93

• Fourth, the values for the first parent configuration (x1 = 1) are calculated:

)25.0(

)125.0(

)25.0(

)325.0(

)450.0(

)50.0(

027.0

250.0703.0152.0

C_Example 4: Calculation of Bayesian Score

Page 94: Petri.nokelainen@uta.fi School of Education University of Tampere, Finland Introduction to Discrete Bayesian Methods Petri Nokelainen

94

• Fifth, the values for the second parent configuration (x1 = 2) are calculated:

)25.0(

)125.0(

)25.0(

)025.0(

)150.0(

)50.0(

500.0

250.0000.1000.2

C_Example 4: Calculation of Bayesian Score

Page 95: Petri.nokelainen@uta.fi School of Education University of Tampere, Finland Introduction to Discrete Bayesian Methods Petri Nokelainen

95

• The BS, probability for the second model P(M2|D), is 0.027 * 0.027 * 0.500 0.000365.

C_Example 4: Calculation of Bayesian Score

Page 96: Petri.nokelainen@uta.fi School of Education University of Tampere, Finland Introduction to Discrete Bayesian Methods Petri Nokelainen

96

• Bayes’ theorem enables the calculation of the ratio of the two models, M1 and M2. – As both models share the same a priori probability, P(M1) =

P(M2), both probabilities are canceled out.

– Also the probability of the data P(D) is canceled out in the following equation as it appears in both formulas in the same position:

88.00.000365

0.000324

)(

)()|(

)(

)()|(

)|(

)|(

22

11

2

1

DP

MPMDP

DP

MPMDP

DMP

DMP

C_Example 4: Calculation of Bayesian Score

Page 97: Petri.nokelainen@uta.fi School of Education University of Tampere, Finland Introduction to Discrete Bayesian Methods Petri Nokelainen

97

• The result of model comparison shows that since the ratio is less than 1, the M2 is more probable than M1.

• This result becomes explicit when we investigate the sample data more closely.

• Even a sample this small (n = 5) shows that there is a clear tendency between the values of x1 and x2 (four out of five value pairs are identical).

x1x21111221211

C_Example 4: Calculation of Bayesian Score

Page 98: Petri.nokelainen@uta.fi School of Education University of Tampere, Finland Introduction to Discrete Bayesian Methods Petri Nokelainen

98

• How many models are there?

2/)1*(2 nn

Page 99: Petri.nokelainen@uta.fi School of Education University of Tampere, Finland Introduction to Discrete Bayesian Methods Petri Nokelainen

99

For an example of practical use of BDM, see Nokelainen and Tirri (2010).

Page 100: Petri.nokelainen@uta.fi School of Education University of Tampere, Finland Introduction to Discrete Bayesian Methods Petri Nokelainen

100

Our hypothesis regarding the first research question was that intrinsic goal orientation (INT) is positively related to moral judgment (Batson & Thompson, 2001; Kunda & Schwartz, 1983). It was also hypothesized, based on Blasi’s (1999) argumentation that emotions cannot be predictors of moral action, that fear of failure (affective motivational section) is not related to moral judgment. Research evidence showed support for both hypotheses: firstly, only intrinsic motivation was directly (positively) related to moral judgment, and secondly, affective motivational section was not present in the predictive model.

(Nokelainen & Tirri, 2010.)

Page 101: Petri.nokelainen@uta.fi School of Education University of Tampere, Finland Introduction to Discrete Bayesian Methods Petri Nokelainen

101

Conditioning the three levels of moral judgment showed that there is a positive statistical relationship between moral judgment and intrinsic goal orientation. The probability of belonging to the highest intrinsically motivated group three (M = 3.7 – 5.0) increases from 15 per cent to 90 per cent alongside with the moral judgment abilities. There is also similar but less steep increase in extrinsic goal orientation (from 5% to 12%), but we believe that it is mostly tied to increase in extrinsic goal orientation.

(Nokelainen & Tirri, 2010.)

Page 102: Petri.nokelainen@uta.fi School of Education University of Tampere, Finland Introduction to Discrete Bayesian Methods Petri Nokelainen

102

For an example of practical use of BDM see Nokelainen and Tirri (2007).

Page 103: Petri.nokelainen@uta.fi School of Education University of Tampere, Finland Introduction to Discrete Bayesian Methods Petri Nokelainen

103

(Nokelainen & Tirri, 2007.)

Page 104: Petri.nokelainen@uta.fi School of Education University of Tampere, Finland Introduction to Discrete Bayesian Methods Petri Nokelainen

104In conflict situations, my superior is able to draw out all parties and understand the differing perspectives.

My superior sees other people in positiverather than in negative light.

My superior has an optimistic "glass half full" outlook.

(Nokelainen & Tirri, 2007.)

Page 105: Petri.nokelainen@uta.fi School of Education University of Tampere, Finland Introduction to Discrete Bayesian Methods Petri Nokelainen

105

EL_iv_17_49 “In conflict situations, my superior is able to draw out all parties and understand the differing perspectives.”EL_ii_09_26 “My superior sees other people in positive rather than in negative light.” EL_ii_09_25 “My superior has an optimistic "glass half full" outlook.”

21% vs. 78%

2% vs. 90%

(Nokelainen & Tirri, 2007.)

Page 106: Petri.nokelainen@uta.fi School of Education University of Tampere, Finland Introduction to Discrete Bayesian Methods Petri Nokelainen

106

EL_iv_17_49 “In conflict situations, my superior is able to draw out all parties and understand the differing perspectives.”EL_ii_09_26 “My superior sees other people in positive rather than in negative light.” EL_ii_09_25 “My superior has an optimistic "glass half full" outlook.”

66%

69%

(Nokelainen & Tirri, 2007.)

Page 107: Petri.nokelainen@uta.fi School of Education University of Tampere, Finland Introduction to Discrete Bayesian Methods Petri Nokelainen

107

EL_iv_17_49 “In conflict situations, my superior is able to draw out all parties and understand the differing perspectives.”EL_ii_09_26 “My superior sees other people in positive rather than in negative light.” EL_ii_09_25 “My superior has an optimistic "glass half full" outlook.”

85%

95%

(Nokelainen & Tirri, 2007.)

Page 108: Petri.nokelainen@uta.fi School of Education University of Tampere, Finland Introduction to Discrete Bayesian Methods Petri Nokelainen

108

Outline

• Overview• Introduction to Bayesian Modeling• Bayesian Classification Modeling• Bayesian Dependency Modeling• Bayesian Unsupervised Model-based Visualization

Page 109: Petri.nokelainen@uta.fi School of Education University of Tampere, Finland Introduction to Discrete Bayesian Methods Petri Nokelainen

109

BayMiner

BCM = Bayesian Classification Modeling

BDM = Bayesian Dependency Modeling

BUMV = Bayesian Unsupervised Model-based Visualization

Page 110: Petri.nokelainen@uta.fi School of Education University of Tampere, Finland Introduction to Discrete Bayesian Methods Petri Nokelainen

110Bayesian Unsupervised Model-based

Visualization

SUPERVISED UNSUPERVISED

VISUALIZATION TECH. CLUSTER ANALYSIS EFA DISC. MULTIV. ANAL.

PROJECTION TECH.

NON-LINEARLINEAR

REDUCINGNON-REDUC.

NEUR.N.MDSPROJ.PUR.PCA SOM PRIN.C. ICA

LDA

BSMV

BUMV

Page 111: Petri.nokelainen@uta.fi School of Education University of Tampere, Finland Introduction to Discrete Bayesian Methods Petri Nokelainen

111Bayesian Unsupervised Model-based

Visualization

• Supervised techniques, for example, linear discriminant analysis (LDA) and supervised Bayesian networks (BSMV, see Kontkanen, Lahtinen, Myllymäki, Silander & Tirri, 2000) assume a given structure (Venables & Ripley, 2002, p. 301).

• Unsupervised techniques, for example, exploratory factor analysis (EFA) discover variable structure from the evidence of the data matrix.

• Unsupervised techniques are further divided into four sub categories: 1) Visualization techniques; 2) Cluster analysis; 3) Factor analysis; 4) Discrete multivariate analysis.

Page 112: Petri.nokelainen@uta.fi School of Education University of Tampere, Finland Introduction to Discrete Bayesian Methods Petri Nokelainen

112Bayesian Unsupervised Model-based

Visualization

SUPERVISED UNSUPERVISED

VISUALIZATION TECH. CLUSTER ANALYSIS EFA DISC. MULTIV. ANAL.

LDA

BSMV

Page 113: Petri.nokelainen@uta.fi School of Education University of Tampere, Finland Introduction to Discrete Bayesian Methods Petri Nokelainen

113

• According to Venables and Ripley (id.), visualization techniques are often more effective than clustering techniques discovering interesting groupings in the data, and they avoid the danger of over-interpretation of the results as researcher is not allowed to input the number of expected latent dimensions.

• In cluster analysis the centroids that represent the clusters are still high-dimensional, and some additional illustration techniques are needed for visualization (Kaski, 1997), for example MDS (Kim, Kwon & Cook, 2000).

Bayesian Unsupervised Model-based Visualization

Page 114: Petri.nokelainen@uta.fi School of Education University of Tampere, Finland Introduction to Discrete Bayesian Methods Petri Nokelainen

114

• Several graphical means have been proposed for visualizing high-dimensional data items directly, by letting each dimension govern some aspect of the visualization and then integrating the results into one figure.

• These techniques can be used to visualize any kinds of high-dimensional data vectors, either the data items themselves or vectors formed of some descriptors of the data set like the five-number summaries (Tukey, 1977).

Bayesian Unsupervised Model-based Visualization

Page 115: Petri.nokelainen@uta.fi School of Education University of Tampere, Finland Introduction to Discrete Bayesian Methods Petri Nokelainen

115

• Simplest technique to visualize a data set is to plot a “profile” of each item, that is, a two-dimensional graph in which the dimensions are enumerated on the x-axis and the corresponding values on the y-axis.

• Other alternatives are scatter plots and pie diagrams.

Bayesian Unsupervised Model-based Visualization

Page 116: Petri.nokelainen@uta.fi School of Education University of Tampere, Finland Introduction to Discrete Bayesian Methods Petri Nokelainen

116

• The major drawback that applies to all these techniques is that they do not reduce the amount of data. – If the data set is large, the display consisting of all the data items portrayed

separately will be incomprehensible. (Kaski, 1997.)

• Techniques reducing the dimensionality of the data items are called projection techniques.

Bayesian Unsupervised Model-based Visualization

Page 117: Petri.nokelainen@uta.fi School of Education University of Tampere, Finland Introduction to Discrete Bayesian Methods Petri Nokelainen

117Bayesian Unsupervised Model-based

Visualization

SUPERVISED UNSUPERVISED

VISUALIZATION TECH. CLUSTER ANALYSIS EFA DISC. MULTIV. ANAL.

PROJECTION TECH.

REDUCINGNON-REDUC.

LDA

BSMV

Page 118: Petri.nokelainen@uta.fi School of Education University of Tampere, Finland Introduction to Discrete Bayesian Methods Petri Nokelainen

118

• The goal of the projection is to represent the input data items in a lower-dimensional space in such a way that certain properties of the structure of the data set are preserved as faithfully as possible. – The projection can be used to visualize the data set if a sufficiently small

output dimensionality is chosen. (id.) • Projection techniques are divided into two major groups, linear

and non-linear projection techniques.

Bayesian Unsupervised Model-based Visualization

Page 119: Petri.nokelainen@uta.fi School of Education University of Tampere, Finland Introduction to Discrete Bayesian Methods Petri Nokelainen

119Bayesian Unsupervised Model-based

Visualization

SUPERVISED UNSUPERVISED

VISUALIZATION TECH. CLUSTER ANALYSIS EFA DISC. MULTIV. ANAL.

PROJECTION TECH.

NON-LINEARLINEAR

REDUCINGNON-REDUC.

LDA

BSMV

Page 120: Petri.nokelainen@uta.fi School of Education University of Tampere, Finland Introduction to Discrete Bayesian Methods Petri Nokelainen

120

• Linear projection techniques consist of principal component analysis (PCA) and projection pursuit. – In exploratory projection pursuit (Friedman, 1987) the data is projected

linearly, but this time a projection, which reveals as much of the non-normally distributed structure of the data set as possible is sought.

– This is done by assigning a numerical “interestingness” index to each possible projection, and by maximizing the index.

– The definition of interestingness is based on how much the projected data deviates from normally distributed data in the main body of its distribution.

Bayesian Unsupervised Model-based Visualization

Page 121: Petri.nokelainen@uta.fi School of Education University of Tampere, Finland Introduction to Discrete Bayesian Methods Petri Nokelainen

121Bayesian Unsupervised Model-based

Visualization

SUPERVISED UNSUPERVISED

VISUALIZATION TECH. CLUSTER ANALYSIS EFA DISC. MULTIV. ANAL.

PROJECTION TECH.

NON-LINEARLINEAR

REDUCINGNON-REDUC.

PROJ.PUR.PCA

LDA

BSMV

Page 122: Petri.nokelainen@uta.fi School of Education University of Tampere, Finland Introduction to Discrete Bayesian Methods Petri Nokelainen

122

• Non-linear unsupervised projection techniques consist of multidimensional scaling, principal curves and various other techniques including SOM, neural networks and Bayesian unsupervised networks (Kontkanen, Lahtinen, Myllymäki & Tirri, 2000).

Bayesian Unsupervised Model-based Visualization

Page 123: Petri.nokelainen@uta.fi School of Education University of Tampere, Finland Introduction to Discrete Bayesian Methods Petri Nokelainen

123Bayesian Unsupervised Model-based

Visualization

SUPERVISED UNSUPERVISED

VISUALIZATION TECH. CLUSTER ANALYSIS EFA DISC. MULTIV. ANAL.

PROJECTION TECH.

NON-LINEARLINEAR

REDUCINGNON-REDUC.

NEUR.N.MDSPROJ.PUR.PCA SOM PRIN.C. ICA

LDA

BSMV

BUMV

Page 124: Petri.nokelainen@uta.fi School of Education University of Tampere, Finland Introduction to Discrete Bayesian Methods Petri Nokelainen

124

• Aforementioned PCA technique, despite its popularity, cannot take into account non-linear structures, structures consisting of arbitrarily shaped clusters or curved manifolds since it describes the data in terms of a linear subspace.

• Projection pursuit tries to express some non-linearities, but if the data set is high-dimensional and highly non-linear it may be difficult to visualize it with linear projections onto a low-dimensional display even if the “projection angle” is chosen carefully (Friedman, 1987).

Bayesian Unsupervised Model-based Visualization

Page 125: Petri.nokelainen@uta.fi School of Education University of Tampere, Finland Introduction to Discrete Bayesian Methods Petri Nokelainen

125

• Several approaches have been proposed for reproducing non-linear higher-dimensional structures on a lower-dimensional display.

• The most common techniques allocate a representation for each data point in the lower-dimensional space and try to optimize these representations so that the distances between them would be as similar as possible to the original distances of the corresponding data items.

• The techniques differ in how the different distances are weighted and how the representations are optimized. (Kaski, 1997.)

Bayesian Unsupervised Model-based Visualization

Page 126: Petri.nokelainen@uta.fi School of Education University of Tampere, Finland Introduction to Discrete Bayesian Methods Petri Nokelainen

126

• Multidimensional scaling (MDS) is not one specific tool, instead it refers to a group of techniques that is widely used especially in behavioral, econometric, and social sciences to analyze subjective evaluations of pairwise similarities of entities.

• The starting point of MDS is a matrix consisting of the pairwise dissimilarities of the entities.

• The basic idea of the MDS technique is to approximate the original set of distances with distances corresponding to a configuration of points in a Euclidean space.

Bayesian Unsupervised Model-based Visualization

Page 127: Petri.nokelainen@uta.fi School of Education University of Tampere, Finland Introduction to Discrete Bayesian Methods Petri Nokelainen

127

• MDS can be considered to be an alternative to factor analysis.

• In general, the goal of the analysis is to detect meaningful underlying dimensions that allow the researcher to explain observed similarities or dissimilarities (distances) between the investigated objects.

• In factor analysis, the similarities between objects (e.g., variables) are expressed in the correlation matrix.

Bayesian Unsupervised Model-based Visualization

Page 128: Petri.nokelainen@uta.fi School of Education University of Tampere, Finland Introduction to Discrete Bayesian Methods Petri Nokelainen

128

• With MDS we may analyze any kind of similarity or dissimilarity matrix, in addition to correlation matrices, specifying that we want to reproduce the distances based on n dimensions.

• After formation of matrix MDS attempts to arrange “objects” (e.g., factors of growth-oriented atmosphere) in a space with a particular number of dimensions so as to reproduce the observed distances.

• As a result, the distances are explained in terms of underlying dimensions.

Bayesian Unsupervised Model-based Visualization

Page 129: Petri.nokelainen@uta.fi School of Education University of Tampere, Finland Introduction to Discrete Bayesian Methods Petri Nokelainen

129

• MDS based on Euclidean distance do not generally reflect properly to the properties of complex problem domains.

• In real-world situations the similarity of two vectors is not a universal property; in different points of view they in the end may appear quite dissimilar (Kontkanen, Lahtinen, Myllymäki, Silander & Tirri, 2000).

• Another problem with the MDS techniques is that they are computationally very intensive for large data sets.

Bayesian Unsupervised Model-based Visualization

Page 130: Petri.nokelainen@uta.fi School of Education University of Tampere, Finland Introduction to Discrete Bayesian Methods Petri Nokelainen

130

• Bayesian unsupervised model-based visualization (BUMV) is based on Bayesian Networks (BN).

• BN is a representation of a probability distribution over a set of random variables, consisting of a directed acyclic graph (DAG), where the nodes correspond to domain variables, and the arcs define a set of independence assumptions which allow the joint probability distribution for a data vector to be factorized as a product of simple conditional probabilities. Two vectors are considered similar if they lead to similar predictions, when given as input to the same Bayesian network model. (Kontkanen, Lahtinen, Myllymäki, Silander & Tirri, 2000.)

Bayesian Unsupervised Model-based Visualization

Page 131: Petri.nokelainen@uta.fi School of Education University of Tampere, Finland Introduction to Discrete Bayesian Methods Petri Nokelainen

131

• Naturally, there are numerous viable options to BUMV, such as Self-Organizing Map (SOM) and Independent Component Analysis (ICA).

• SOM is a neural network algorithm that has been used for a wide variety of applications, mostly for engineering problems but also for data analysis (Kohonen, 1995). – SOM is based on neighborhood preserving topological map tuned

according to geometric properties of sample vectors.

• ICA minimizes the statistical dependence of the components trying to find a transformation in which the components are as statistically independent as possible (Hyvärinen & Oja, 2000). – The usage of ICA is comparable to PCA where the aim is to present the

data in a manner that facilitates further analysis.

Bayesian Unsupervised Model-based Visualization

Page 132: Petri.nokelainen@uta.fi School of Education University of Tampere, Finland Introduction to Discrete Bayesian Methods Petri Nokelainen

132

• First major difference between Bayesian and neural network approaches for educational science researcher is that the former operates with a familiar symmetrical probability range from 0 to 1 while the upper limit of asymmetrical probability scale in the latter approach is unknown.

• The second fundamental difference between the two types of networks is that a perceptron in the hidden layers of neural networks does not in itself have an interpretation in the domain of the system, whereas all the nodes of a Bayesian network represent concepts that are well defined with respect to the domain (Jensen, 1995).

Bayesian Unsupervised Model-based Visualization

Page 133: Petri.nokelainen@uta.fi School of Education University of Tampere, Finland Introduction to Discrete Bayesian Methods Petri Nokelainen

133

• The meaning of a node and its probability table can be subject to discussion, regardless of their function in the network, but it does not make any sense to discuss the meaning of the nodes and the weights in a neural network: Perceptrons in the hidden layers only have a meaning in the context of the functionality of the network.

• Construction of a Bayesian network requires detailed knowledge of the domain in question. – If such knowledge can only be obtained through a series of examples (i.e., a

data base of cases), neural networks seem to be an easier approach. This might be true in cases such as the reading of handwritten letters, face recognition, and other areas where the activity is a 'craftsman like' skill based solely on experience.

Bayesian Unsupervised Model-based Visualization

(Jensen, 1995.)

Page 134: Petri.nokelainen@uta.fi School of Education University of Tampere, Finland Introduction to Discrete Bayesian Methods Petri Nokelainen

134

• It is often criticized that in order to construct a Bayesian network you have to ‘know’ too many probabilities. – However, there is not a considerable difference between this number and

the number of weights and thresholds that have to be ‘known’ in order to build a neural network, and these can only be learnt by training.

• A weakness of neural networks tis hat you are unable to utilize the knowledge you might have in advance.

• Probabilities, on the other hand, can be assessed using a combination of theoretical insight, empiric studies independent of the constructed system, training, and various more or less subjective estimates.

Bayesian Unsupervised Model-based Visualization

(Jensen, 1995.)

Page 135: Petri.nokelainen@uta.fi School of Education University of Tampere, Finland Introduction to Discrete Bayesian Methods Petri Nokelainen

135

• In the construction of a neural network, it is decided in advance about which relations information is gathered, and which relations the system is expected to compute (the route of inference is fixed).

• Bayesian networks are much more flexible in that respect.

Bayesian Unsupervised Model-based Visualization

(Jensen, 1995.)

Page 136: Petri.nokelainen@uta.fi School of Education University of Tampere, Finland Introduction to Discrete Bayesian Methods Petri Nokelainen

136

For an example of practical use of BUMV, see Nokelainen and Ruohotie (2009).

Page 137: Petri.nokelainen@uta.fi School of Education University of Tampere, Finland Introduction to Discrete Bayesian Methods Petri Nokelainen

137

Results showed that managers and teachers had higher growth motivation and level of commitment to work than other personnel, including job titles such as cleaner, caretaker, accountant and computer support. Employees across all job titles in the organization, who have temporary or part-time contracts, had higher self-reported growth motivation and commitment to work and organization than their established colleagues.

Page 138: Petri.nokelainen@uta.fi School of Education University of Tampere, Finland Introduction to Discrete Bayesian Methods Petri Nokelainen

138

Page 139: Petri.nokelainen@uta.fi School of Education University of Tampere, Finland Introduction to Discrete Bayesian Methods Petri Nokelainen

139

Links

• B-Course http://b-course.cs.helsinki.fi

• BayMiner http://www.bayminer.com

Page 140: Petri.nokelainen@uta.fi School of Education University of Tampere, Finland Introduction to Discrete Bayesian Methods Petri Nokelainen

140

References

• Anderson, J. (1995). Cognitive Psychology and its Implications. New York: Freeman.

• Bayes, T. (1763). An essay towards solving a problem in the doctrine of chances. Philosophical Transactions of the Royal Society, 53, 370-418.

• Bernardo, J., & Smith, A. (2000). Bayesian theory. New York: Wiley.

• Congdon, P. (2001). Bayesian Statistical Modelling. Chichester: John Wiley & Sons.

• Friedman, J. (1987). Exploratory Projection Pursuit. Journal of American Statistical Association, 82, 249-266.

• Gigerenzer, G. (2000). Adaptive thinking. New York: Oxford University Press.

• Gigerenzer, G., Krauss, S., & Vitouch, O. (2004). The null ritual: What you always wanted to know about significance testing but were afraid to ask. In D. Kaplan (Ed.), The SAGE handbook of quantitative methodology for the social sciences (pp. 391-408). Thousand Oaks: Sage.

Page 141: Petri.nokelainen@uta.fi School of Education University of Tampere, Finland Introduction to Discrete Bayesian Methods Petri Nokelainen

References

• Gill, J. (2002). Bayesian methods. A Social and Behavioral Sciences Approach. Boca Raton: Chapman & Hall/CRC.

• Heckerman, D., Geiger, D., & Chickering, D. (1995). Learning Bayesian networks: The combination of knowledge and statistical data. Machine Learning, 20(3), 197-243.

• Hilario, M., Kalousisa, A., Pradosa, J., & Binzb, P.-A. (2004). Data mining for mass-spectra based diagnosis and biomarker discovery. Drug Discovery Today: BIOSILICO, 2(5), 214-222.

• Huberty, C. (1994). Applied Discriminant Analysis. New York: John Wiley & Sons.

• Hyvärinen, A., & Oja, E. (2000). Independent Component Analysis: Algorithms and Applications. Neural Networks, 13(4-5), 411-430.

• Jensen, F. V. (1995). Paradigms of Expert Systems. HUGIN Lite 7.4 User Manual.

141

Page 142: Petri.nokelainen@uta.fi School of Education University of Tampere, Finland Introduction to Discrete Bayesian Methods Petri Nokelainen

References

• Kaski, S. (1997). Data exploration using self-organizing maps. Doctoral dissertation. Acta Polytechnica Scandinavica, Mathematics, Computing and Management in Engineering Series No. 82. Espoo: Finnish Academy of Technology.

• Kim, S., Kwon, S., & Cook, D. (2000). Interactive Visualization of Hierarchical Clusters Using MDS and MST. Metrika, 51(1), 39–51.

• Kohonen, T. (1995). Self-Organizing Maps. Berlin: Springer.

• Kontkanen, P., Lahtinen, J., Myllymäki, P., Silander, T., & Tirri, H. (2000). Supervised Model-based Visualization of High-dimensional Data. Intelligent Data Analysis, 4, 213-227.

• Kontkanen, P., Lahtinen, J., Myllymäki, P., & Tirri, H. (2000). Unsupervised Bayesian Visualization of High-Dimensional Data. In R. Ramakrishnan, S. Stolfo, R. Bayardo, & I. Parsa (Eds.), Proceedings of the Sixth International Conference on Knowledge Discovery and Data Mining (pp. 325-329). New York, NY: The Association for Computing Machinery.

142

Page 143: Petri.nokelainen@uta.fi School of Education University of Tampere, Finland Introduction to Discrete Bayesian Methods Petri Nokelainen

References

• Lavine, M. L. (1999). What is Bayesian Statistics and Why Everything Else is Wrong. The Journal of Undergraduate Mathematics and Its Applications, 20, 165-174.

• Lindley, D. V. (1971). Making Decisions. London: Wiley. Lindley, D. V. (2001). Harold Jeffreys. In C. C. Heyde & E. Seneta (Eds.), Statisticians of the Centuries, (pp. 402-405). New York: Springer.

• Murphy, K. R., & Myors, B. (1998). Statistical Power Analysis. A Simple and General Model for Traditional and Modern Hypothesis Tests. Mahwah, NJ: Lawrence Erlbaum Associates.

• Myllymäki, P., Silander, T., Tirri, H., & Uronen, P. (2002). B-Course: A Web-Based Tool for Bayesian and Causal Data Analysis. International Journal on Artificial Intelligence Tools, 11(3), 369-387.

• Myllymäki, P., & Tirri, H. (1998). Bayes-verkkojen mahdollisuudet [Possibilities of Bayesian Networks]. Teknologiakatsaus 58/98. Helsinki: TEKES.

143

Page 144: Petri.nokelainen@uta.fi School of Education University of Tampere, Finland Introduction to Discrete Bayesian Methods Petri Nokelainen

References

• Neapolitan, R. E., & Morris, S. (2004). Probabilistic Modeling Using Bayesian Networks. In D. Kaplan (Ed.), The SAGE handbook of quantitative methodology for the social sciences (pp. 371-390). Thousand Oaks, CA: Sage.

• Nokelainen, P. (2008). Modeling of Professional Growth and Learning: Bayesian Approach. Tampere: Tampere University Press.

• Nokelainen, P., & Ruohotie, P. (2009). Investigating Growth Prerequisites in a Finnish Polytechnic for Higher Education. Journal of Workplace Learning, 21(1), 36-57.

• Nokelainen, P., Silander, T., Ruohotie, P., & Tirri, H. (2007). Investigating the Number of Non-linear and Multi-modal Relationships Between Observed Variables Measuring A Growth-oriented Atmosphere. Quality & Quantity, 41(6), 869-890.

• Nokelainen, P., & Tirri, K. (2007). Empirical Investigation of Finnish School Principals' Emotional Leadership Competencies. In S. Saari & T. Varis (Eds.), Professional Growth (pp. 424-438). Hämeenlinna: RCVE.

144

Page 145: Petri.nokelainen@uta.fi School of Education University of Tampere, Finland Introduction to Discrete Bayesian Methods Petri Nokelainen

References

• Nokelainen, P., Ruohotie, P., & Tirri, H. (1999). Professional Growth Determinants-Comparing Bayesian and Linear Approaches to Classification. In P. Ruohotie, H. Tirri, P. Nokelainen, & T. Silander (Eds.), Modern Modeling of Professional Growth, vol. 1 (pp. 85-120). Hämeenlinna: RCVE.

• Nokelainen, P., & Tirri, K. (2010). Role of Motivation in the Moral and Religious Judgment of Mathematically Gifted Adolescents. High Ability Studies, 21(2), 101-116.

• Nokelainen, P., Tirri, K., Campbell, J. R., & Walberg, H. (2004). Cross-cultural Factors that Account for Adult Productivity. In J. R. Campbell, K. Tirri, P. Ruohotie, & H. Walberg (Eds.), Cross-cultural Research: Basic Issues, Dilemmas, and Strategies (pp. 119-139). Hämeenlinna: RCVE.

• Nokelainen, P., Tirri, K., & Merenti-Välimäki, H.-L. (2007). Investigating the Influence of Attribution Styles on the Development of Mathematical Talent. Gifted Child Quarterly, 51(1), 64-81.

• Pylväs, L., Nokelainen, P., & Roisko, H. (in press). Modeling of Vocational Excellence in Air Traffic Control. Submitted for review.

145

Page 146: Petri.nokelainen@uta.fi School of Education University of Tampere, Finland Introduction to Discrete Bayesian Methods Petri Nokelainen

References

• Tirri, H. (1997). Plausible Prediction by Bayesian Interface. Department of Computer Science. Series of Publications A. Report A-1997-1. Helsinki: University of Helsinki.

• Tukey, J. (1977). Exploratory Data Analysis. Reading, MA: Addison-Wesley.

• Venables, W. N., & Ripley, B. D. (2002). Modern Applied Statistics with S. Fourth edition. New York: Springer.

146