optimal allocation algorithm for a mulilti-way stratifi i ... · optimal allocation algorithm for a...

Optimal allocation algorithm for a l i ifi i d imulti-way stratification design

P.D. Falorsi, P. Righi,, g ,Italian National Statistical Institute

NTTS 2011 Conference 22 – 24 February 2011, Brussels

Outline

Overview

Multi-way Sampling Design Multi-way optimal allocation y p

procedure Monte Carlo simulation

NTTS 2011 22-24 February 2011 P.D. Falorsi and P. Righi 2

Overview

Large scale surveys in Official Statisticsll d ti t f t f

usually produce estimates for a set ofparameters by a huge number of highlydetailed estimation domains

These domains generally define notnested partitions of the target population

When the domain indicator variables areavailable at framework level, we may plana sample covering each domaina sample covering each domain


Overview

Why fix a sample size in each domain:

Allows to apply direct estimators When planning the sample an evaluation

of the sampling errors on the mainof the sampling errors on the mainestimates is possible

When direct estimator is not reliable(small area problem) having units in thedomains allows to:

b d th bi f ll i di t bound the bias of small area indirectestimators;

use models with specific small area effects.


Overview

Standard solution for fixing the

sample sizes in domainsbelonging to two or more

titipartitions: Stratified the sample with strata

gi en b o l ifi tion ofgiven by cross-classification ofvariables defining the differentpartitions(cross-classification orpartitions(cross-classification orone-way stratified design)


Overview

Main drawbacks:

Too detailed stratification Risk of sample size explosion Inefficient sample allocation Risk of statistical burden


Overview

Some examples (1):

Inefficient sample allocation


Overview

Some examples (2): statistical burden

Strata distrib tion b n mber of enterprises in the Small and Medi mStrata distribution by number of enterprises in the Small and Medium Enterprises Survey (2003)Number stratum enterprises

Absolute frequency

Cumulative frequency % Frequency % Cumulative

frequency1 4 700 4 700 18 7 18 71 4,700 4,700 18.7 18.72 2,512 7,212 10.0 28.7

3-5 3,816 11,028 15.2 43.96-10 2,815 13,843 11.2 55.1>10 11 286 25 129 44 9 100 0


>10 11,286 25,129 44.9 100.0

Overview

Some examples (3):

Italian Graduates’ Career Survey. 2010sample size about 90,000 units

Number of domain in not nested partition and number of cross-classified strata

Type of degreeSample size explosionI ffi i t l ll tiType of degree

3 years LongFirst partition 448 425Second partition 94 198

Inefficient sample allocation

Second partition 94 198

Strata 2,981 4,778


Multi-way Sampling Design

Multi-way (or incomplete)

stratification design (MWD)satisfies sample allocation atd i l l ith tdomain level without cross-classification the sizes of the combining strata are the sizes of the combining strata are

random variables Main problem of MWD: a random Main problem of MWD: a random

selection procedure



Use of Cube Method (Deville and Tillé,2004)

2004) The method select balanced samples in the

model assisted framework

A sample s is balanced on a set of auxiliaryvariables z (balancing variables) if the

MWD is a special case of balanced sample

zz zz ttUk ksk kkht

,ˆ

MWD is a special case of balanced sample The method works well with a large

population and a lot of domains


Multi-way optimal allocation procedure

o The aim of the work is define a proceduredefining the optimal allocation and a selection

defining the optimal allocation and a selectionmethod suitable for large scale surveys

o The procedure is based on three main stepso The procedure is based on three main steps1. Sample allocation (optimization step, vector )

o minimizes the overall sample size n guaranteeing that the sampling variances are lower than

π

prefixed level of precision thresholdso Deal with a multivariate-multidomain problem

2 Definition of the final incl sion p obabilities 2. Definition of the final inclusion probabilities (calibration step)

3. Sample selection (balancing step)



Notation and essential terms :

Domain b partition d : ; Domain indicator variable: ; Parameter of interest and estimator

B l i i bl Balancing variables



Variance approximation of

balanced sampling :

With



1. Theoretical Constrained

Optimization problem(optimization step):

Constraints



Equivalent problem:

GivenWith The inequality constraints are

equivalents toq

with



Issues of the theoretical ti i ti bl

optimization problem: Solution by means of modified

Chromy algorithm taking into account Chromy algorithm taking into account the constraints

Iterative procedure because the unknown terms are in the left unknown terms are in the left and right side of the inequality constraints



Optimal allocation algorithm:

Give values to the unknown terms on the right side of the inequality (initialization values or values obtain (in the previous iteration)

Keep fixed these values and use modified Chromy algorithm to obtain modified Chromy algorithm to obtain the values in the left side

Iterate the modified Chromy algorithm til th it i i until the convergence criterion is

satisfied using the left values of the previous iteration for the right side



Predicted Constrained Optimization

problem: In practice we do not know the term

nd m t e p edi tionand must use a prediction Given a superpopulation model

express



For taking into account the uncertainty f th d l l th i

of the model we replace the variance with the Anticipated Variances An upward approximation is given byp pp g y

being obtained by means of the g ypredicted value



Remark: cross-classification stratified d i fi d l ti

design assumes a fixed superpopulation model defined in each stratum

hkyE hrk stratum)( , 2)(yVar 0)( yyCov, )( hrkyVar 0)( ,, rlrk yyCov



2. Definition of the final inclusion probabilities

probabilities (calibration step) :

Given the vector by means of a π ycalibration procedure calculate

S h th t h i i t

ππ

Such that each is an integer

3. Sample selection (balancing step) with cube method


Monte Carlo Simulation

Objectives of simulation:

Test the convergence of the optimization algorithm (optimization step)step)

Verify the expect AV with respect to the Monte Carlo empirical AVp

Comparison with standard cross-classified stratified design



Data:Subpopulation of the Istat Italian Graduates’

Subpopulation of the Istat Italian Graduates’ Career Survey (3,427 units)

Driving allocation variables:emplo ed stat s ( es/no) employed status (yes/no) ;

actively seeking work (yes/no) . We generate the values of the two variables

by means a logistic additive model (Prediction by means a logistic additive model (Prediction model)

Explicative variables: degree mark, sex, age class and aggregation of subject area degree class and aggregation of subject area degree (different for and )

The parameters are estimated with the data from the previous survey


from the previous survey


Survey target estimates: 8 types of estimation domains;

Two partitions define the most disaggregate domains: First partition: university by subject area

d (9 l )degree (9 classes); Second partition degree by sex; Domains:448+94; Strata 2,981

(university, degree, sex) In the simulation: domains 20+15;strata 91



Errors thresholds fixed in terms of CV(%)

Results: assuming as known Iterations modified Chromy algorithm: 6 Iterations modified Chromy algorithm: 6 Optimal sample size 171, after calibration 182

Results: assuming predicted Iterations modified Chromy algorithm: 3Iterations modified Chromy algorithm: 3 Optimal sample size 699, after calibration 707



Analysis of the allocation with the predicted values

predicted values The sample allocation procedure uses an

approximation of the AV

Average of Expectected Anticipated CV(%) Partition

1 8 1 17 81y 2y

Average of Empirical (10,000 Monte Carlo simulations) Anticipated CV(%) Partition

1 6 7 14 71y 2y

1 8.1 17.82 9.2 19.1

1 6.7 14.72 7.4 15.5

The simulation confirms the input AV is an upward approximation of the real AV



Comparison with the standard approach:

approach: The implicit model is similar to the model

used in our approach;The allocation differences depend on the The allocation differences depend on the unit minimum number constraint (2) in each stratum.

The sample size is 751 units (+7 4%) The sample size is 751 units (+7.4%) Taking into account the domains with

small population strata (<10 units in average per stratum) standard approach produces +14.4% sample size


optimal allocation algorithm for a mulilti-way stratifi i ... · optimal allocation algorithm for a...

Documents