tree analysis – a method for constructing edit groups work session on statistical data editing...

11
Tree Analysis – A Method for Constructing Edit Groups Work Session on Statistical Data Editing Oslo, Norway, 24-26 September 2012 By Anders Norberg, Statistics Sweden 1.

Upload: colin-bryan

Post on 04-Jan-2016

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Tree Analysis – A Method for Constructing Edit Groups Work Session on Statistical Data Editing Oslo, Norway, 24-26 September 2012 By Anders Norberg, Statistics

Tree Analysis – A Method for Constructing Edit Groups

Work Session on Statistical Data EditingOslo, Norway, 24-26 September 2012By Anders Norberg, Statistics Sweden

1.

Page 2: Tree Analysis – A Method for Constructing Edit Groups Work Session on Statistical Data Editing Oslo, Norway, 24-26 September 2012 By Anders Norberg, Statistics

Said about Tree Analysis

• Trees do not supersede other modeling techniques

• Different techniques do better with different data and in the hands of different analysts

• However, the winning technique is generally not known until all the contenders get a chance

• Trees are easy

Page 3: Tree Analysis – A Method for Constructing Edit Groups Work Session on Statistical Data Editing Oslo, Norway, 24-26 September 2012 By Anders Norberg, Statistics

A Tree

Trees provide a series of if-then rules.

Each rule asigns an observation to one segment of a the tree, at which point another if-then rule is applied.

The initial segment, containing the entire data set, is the root node for the tree. The final nodes are called leaves. Intermideate nodes (a node plus all its successors) form a bransch of the tree.

Page 4: Tree Analysis – A Method for Constructing Edit Groups Work Session on Statistical Data Editing Oslo, Norway, 24-26 September 2012 By Anders Norberg, Statistics

The Root

We have a dataset, preferably large, here a sample of white collar workers 2008.

One variable is considered dependent, here Salary per hour

AllN = 684 366Average = 196,21 SEK

Page 5: Tree Analysis – A Method for Constructing Edit Groups Work Session on Statistical Data Editing Oslo, Norway, 24-26 September 2012 By Anders Norberg, Statistics

First Split

The dataset is split into two by a simple rule, containing one auxiliary/explanatory variable.If 1<=Occup<=3 then Node=2; else Node=1;

Expl = 10,5%

AllN = 684 366Ave. = 196,21

Occup = 1-3N = 487 812Ave. = 219,67

Occup = 4-9N = 196 554Ave. = 138,00

Page 6: Tree Analysis – A Method for Constructing Edit Groups Work Session on Statistical Data Editing Oslo, Norway, 24-26 September 2012 By Anders Norberg, Statistics

Second Split

One of the two new datasets is split by the same method. If 1<=Occup<=3 then do; if Occup=1 then Node=4; else Node=3; end;

Expl = 10,5%

Expl =5,9%

AllN = 684 366Ave. = 196,21

Occup = 1-3N = 487 812Ave. = 219,67

Occup = 1N =72 957Ave. = 298,00

Occup = 2-3N = 297 995Ave. = 205,90

Occup = 4-9N = 196 554Ave. = 138,00

Page 7: Tree Analysis – A Method for Constructing Edit Groups Work Session on Statistical Data Editing Oslo, Norway, 24-26 September 2012 By Anders Norberg, Statistics

Do it again and again…

Expl = 10,5%

Expl = 5,9%

Expl = 1,2% Expl = 2,9%

0,8% 1,1% 1,0% 0,5%

AllN = 684 366Ave. = 196,21

Occup = 1-3N = 487 812Ave. = 219,67

Occup = 1N =72 957Ave. = 298,00

Occup = 2-3N = 297 995Ave. = 205,90

Occup = 3N = 232 453Ave. = 191,71

Occup = 2N = 182 402Ave. = 223,98

Occup = 223, 224, 231-235, 243-246N = 38 044Ave. = 178,85

Occup = 'rest'N = 144 358Ave. = 235,87

Occup = 'rest'N = 69 383Ave. = 284,38

Occup = 121N = 3 574Ave. = 562,37

NUTS = 1N = 1 332Ave. = 700,31

NUTS > 1N = 2 242Ave. = 480,42

SNI1 = G, A, O, E, I, S, R, H, Q , P, NN = 26 864Ave. = 239,50

SNI1 = K, J, B, C, F,M, D ,L N = 42 519Ave. = 312,73

Occup = 4-9N = 196 554Ave. = 138,00

Gender = WomenN = 96 427Ave. = 170,47

Gender=MenN = 136 026Ave. = 206,76

Page 8: Tree Analysis – A Method for Constructing Edit Groups Work Session on Statistical Data Editing Oslo, Norway, 24-26 September 2012 By Anders Norberg, Statistics

…and again

Occup = 2-3N = 297 995Ave. = 205,90

AllN = 684 366Ave. = 196,21Expl = 10,5%

Occup = 1-3N = 487 812Ave. = 219,67Expl = 5,9%

Occup = 1N =72 957Ave. = 298,00Expl = 2,9%

Occup = 2-3N = 297 995Ave. = 205,90Expl = 1,2%

Occup = 3N = 232 453Ave. = 191,71Expl = 0,8%

Occup = 2N = 182 402Ave. = 223,98Expl = 1,1%

Occup = 223, 224, 231-235, 243-246N = 38 044Ave. = 178,85Expl = 0,1%

Occup = 'rest'N = 144 358Ave. = 235,87Expl= 1,0%

Occup = 'rest'N = 69 383Ave. = 284,38Expl = 1,0%

Occup = 121N = 3 574Ave. = 562,37Expl = 0,5%

NUTS = 1N = 1 332Ave. = 700,31Expl = 0,4%

NUTS > 1N = 2 242Ave. = 480,42Expl = 0,1%

Age = 18-29N = 46 965Ave. = 120,39Expl = 0,0%

Age = 30-65N = 149 589Ave. = 143,53Expl = 0,2%

Age = 30-65N = 127 073Ave. = 244,89Expl = 0,6%

Age = 18-29N = 17 285Ave. = 169,58Expl = 0,0%

SNI1 = G, A, O, E, I, S, R, H, Q , P, NN = 26 864Ave. = 239,50Expl = 0,3%

SNI1 = K, J, B, C, F,M, D ,L N = 42 519Ave. = 312,73Expl = 0,8%

SNI1 not 'K'N = 1 210Ave. = 646,34Expl = 0,1%

SNI1= 'K'N = 122Ave. = 1 235,59Expl = 0,1%

Occup = 4-9N = 196 554Ave. = 138,00Expl = 0,2%

Gender = WomenN = 96 427Ave. = 170,47Expl = 0,2%

Gender=MenN = 136 026Ave. = 206,76Expl = 1.1%

SNI1 not 'K'N = 119 114Ave. = 2196,77Expl = 0,4%

SNI1 = 'K'N = 16 912Ave. = 277,14Expl = 0,6%

NUTS > 1N = 71 878Ave. = 226,65Expl = 0,2%

NUTS = 1N = 55 195Ave. = 268,64Expl = 0,2%

Age = 18-29N = 13 578Ave. 150,42Expl = 0,1%

Age = 30-65N = 105 536Ave. = 202,73Expl = 0,3%

NUTS > 1N = 8 603Ave. = 222,34Expl= 0,1%

NUTS = 1N = 8 309Ave. = 333,89Expl= 0,4%

NUTS > 1N = 30 198Ave. = 286,17Expl = 0,2%

NUTS = 1N = 12 321Ave. = 377,84Expl = 0,3%

Age = 18-39N = 3 358Ave. 303,17Expl = 0,0%

Age = 40-65N = 8 963Ave. 405,81Expl = 0,2%

Occup = 123N = 15 920Ave. 306,14Expl = 0,1%

Occup = 'rest'N = 14 278Ave. 263,91Expl = 0,1%

NUTS > 1N = 16 532Ave. = 216,85Expl = 0,1%

NUTS = 1N = 10 332Ave. = 275,73Expl = 0,2%

Page 9: Tree Analysis – A Method for Constructing Edit Groups Work Session on Statistical Data Editing Oslo, Norway, 24-26 September 2012 By Anders Norberg, Statistics

This Tree

• Criterion for best split is minimization of within groups sums of squares around mean

• Has 20 leaves

• Explains 30% of the total sum of squares in the data

• Leaves can be used as edit groups

Page 10: Tree Analysis – A Method for Constructing Edit Groups Work Session on Statistical Data Editing Oslo, Norway, 24-26 September 2012 By Anders Norberg, Statistics

Method

• Auxiliary variables are of four scales;– Nominal– Ordinal– Bivariate– Interval

• Splitting should be stopped when the analysis detects that no further gain can be made, or some pre-set stopping rules are met.

• Alternatively, the data are split as much as possible and then the tree is later pruned.

• Manual intervention is possible

Page 11: Tree Analysis – A Method for Constructing Edit Groups Work Session on Statistical Data Editing Oslo, Norway, 24-26 September 2012 By Anders Norberg, Statistics

1963