oversampling the capital cities in the eu safety survey (eu-sasu) task force on victimization...

Oversampling the capital cities in the EU SAfety SUrvey (EU-SASU)

Task Force on VictimizationEurostat, 17-18 February 2010

Guillaume OsierService Central de la Statistique et des Etudes Economiques (STATEC)

Social Statistics [email protected]

Outline

I. Some theory1 . Definitions and concepts

2 . How to over-sample?3 . Why over-sample?4 . Impact on national accuracy

II. Over-sampling the capital cities in the EU-SASU1 . Is this proposal (statistically) relevant?

2 . How to determine the over-sampling rates?3 . Impact on the national accuracy

III. Specific issues in relation to over-sampling

Definitions and concepts(i) A sub-group (d) in the population is said to be over-sampled (or

over-represented) when the proportion of units from the sub-group is, on average, higher in the sample than in the reference population:

(ii) Conversely, a sub-group is said to be under-sampled (or under-represented) when the proportion of units from the sub-group is, on average, lower in the sample than in the reference population:

(iii) When a sub-group is neither over-sampled nor under-sampled, it is said to be well-sampled (or well-represented)

N

N

n

nE

dd

Proportion of units from (d) in the population

Average proportion of units from (d) in the sample

How to over-sample?

In order to get implemented, over-sampling requires the units in the sub-group to be identified in advance of sampling (issue with telephone surveys)

Two main techniques to over-sample:

• Stratification using unequal sampling fractions in the strata

• More general « proportional-to-size » sampling (ps, pps…)

Over-sampling rate for (d):

NN

nE

nEOR d

d

d

Expected sample size in (d) under

no over-sampling (i.e. under Simple Random Sampling)

Expected sample size in (d)

Why over-sample? 1/2

By selecting more people from certain groups than would typically be done if everyone in the sample had an equal chance of being selected, over-sampling leads to more accurate estimates for those groups.

The technique has proven particularly suitable to:• Small sub-populations;• Sub-populations having severe non-response

problems;• Sub-populations with large internal variability on the

key variables (e.g., household wealth)

Why over-sample? 2/2

More generally, one can resort to over-sampling whenever the sample size doesn’t allow us to reach specified precision targets over certain sub-populations.

Besides, in cross-national surveys (like the EU-SASU), over-sampling is essential for precision and hypothesis testing in cross-country comparisons.

The choice of the sub-groups to over-sample is policy-driven (political matter)

Impact on national accuracy 1/3

Optimal (Neyman) allocation: in order to maximize the precision of the national sample under stratified simple random sampling, the sample size in stratum h depends both on the stratum population Nh and the standard deviation Sh of the study variable

Stratum 1Size N1

St. deviation S1

Stratum 2Size N2

St. deviation S2

Stratum HSize NH

St. deviation SH

…

Total population aged 16+

H

kkk

hhopth

SN

SNnn

1


According to the previous formula, a larger sample should be taken if:* the stratum is larger* the stratum is more variable internally

These national considerations may conflict with more “local” considerations: as said, from a local point of view, over-sampling often focus on small sub-populations, while national considerations lead to taking larger samples from the largest strata. Nevertheless, the loss in national accuracy is often limited:

211 g

σ

σopt

opt

h

hopt

h

Hh n

nnmaxg 1


Thus, if g=20%, we have /(opt) 1.02, which makes an increase in accuracy (as measured by the standard error) of 2%. Similarly, if g=30%, we have /(opt) 1.04, which makes an increase of 4%. In this sense the optimum can be described as flat.

As a result, the impact of over-sampling on national accuracy should be limited, provided the sample sizes are not “extremely” different from the optimal ones. The impact is all the more limited given that the national sample sizes are generally large (thousands of units). Besides, by using powerful auxiliary information at national level, one may hope to increase sample precision a posteriori.

Over-sampling the capital cities in the EU-SASU: is this proposal relevant?

Capital city = most populated city of the country

Always the same as the political capital (except for Switzerland)

Is the proposal (statistically) relevant?• Sample size of individuals over the capital cities: is it enough to

draw reliable conclusions?• Victimization rates in the capital cities: are they generally higher

than those for the rest of the country?• Higher non-response in the capital cities? (often correct)

Minimum sample sizes for the capital cities

276

329

341

351

355

364

402

474

558

572

574

594

641

684

712

725

769

804

916

921

966

992

1025

1345

1375

1453

1462

1902

2600

0 500 1000 1500 2000 2500 3000

France

Germany

Switzerland

Italy

Poland

Netherlands

Portugal

Slovakia

Denmark

Greece

Spain

Sweden

Finland

Norway

Romania

Ireland

Belgium

Slovenia

Czech Republic

Luxembourg

Lithuania

United Kingdom

Bulgaria

Hungary

Austria

Estonia

Cyprus

Latvia

Malta

NONCALIBCALIB EVarYVar

0

10

20

30

40

50

60

70

Victimization rate (%) - National Victimization rate (%) - Capital city

Source: International Crime and Victimization Survey (ICVS), 2005

Victimization rates in capital cities

Victimization rates are higher in the capital cities than in the rest of the countries

How to determine the over-sampling rates? 1/4

Step 1: set up a precision target for every capital citiesStep 2: determine the minimum sample size needed to achieve the

level of precision specified at Step 1

Precision target (1): under simple random sampling, a relative margin of error of % in each capital city for any victimization rate higher than P%

1

11962

Pαnmin

0

5000

10000

15000

20000

25000

30000

35000

40000

0 10 20 30 40 50 60

P

nmin

= 10%


0

2000

4000

6000

8000

10000

12000

14000

16000

18000

0 10 20 30 40 50 60

alpha

nmin

P = 20%


Precision target (2): under simple random sampling, an absolute margin of error of % points in each capital city for any victimization rate higher than P%

PPα

nmin

1196

2


Consider the national victimization rate for the 10 main crimes as used in the International Crime and Victimization Survey (ICVS):

Impact on the national accuracy 1/8

NCNC

CC P

~N

NP~

N

NP~

Victimization rate in the capital city Victimization rate in

the rest of the country

P~

nP~

P~

NN

nP~

P~

NN

RMENC

NCNCNC

C

CCC

11

19622

Variance:


NC

NCNCNC

C

CCC

n

P~

P~

N

N

n

P~

P~

N

NV

1122

Relative margin of error:

NC

NCNCNC

C

CCC

n

P~

P~

N

N

n

P~

P~

N

NAME

11196

22

Absolute margin of error:

Case 1: fixed national sample size


CNC

C

nnn

n,Pα

Minn 111196

2

Impact on the national accuracy 4/8Table 3: Relative margin of error (%) for the national victimization rate – fixed sample size at national level (Case 1)

CountryOver-sampling

No over-sampling P=0.1 P=0.2 P=0.3 P=0.4 P=0.5

France 7.5 6.3 6.0 5.9 5.9 5.9

Germany 7.0 5.9 5.7 5.6 5.6 5.6

Switzerland 6.6 5.3 5.1 5.0 5.0 5.0

Italy 7.2 6.1 5.8 5.8 5.7 5.8

Poland 6.5 5.5 5.3 5.2 5.2 5.2

Netherlands 5.5 4.6 4.4 4.4 4.4 4.4

Portugal 8.0 6.8 6.5 6.4 6.4 6.4

Denmark 7.1 5.4 5.2 5.2 5.3 5.2

Greece 7.1 6.0 5.8 5.8 5.9 5.8

Spain 8.1 7.0 6.8 6.9 7.1 6.9

Sweden 6.6 5.4 5.2 5.3 5.4 5.3

Finland 8.4 6.6 6.4 6.5 6.8 6.5

Norway 7.4 5.8 5.7 5.8 6.0 5.7

Ireland 6.2 4.8 4.7 4.7 4.9 4.7

Belgium 5.5 4.7 4.7 4.7 4.9 4.7

United Kingdom 4.5 3.9 4.0 4.1 4.4 3.9

Hungary 6.9 6.4 6.9 7.6 8.5 6.5

Austria 6.6 6.2 6.8 7.6 8.7 6.3

Estonia 5.5 4.9 5.6 6.5 7.7 4.9


Table 4: Absolute margin of error (% points) for the national victimization rate – fixed sample size at national level (Case 1)



France 0.9 0.8 0.7 0.7 0.7 0.7

Germany 0.9 0.8 0.7 0.7 0.7 0.7

Switzerland 1.2 1.0 0.9 0.9 0.9 0.9

Italy 0.9 0.8 0.7 0.7 0.7 0.7

Poland 1.0 0.8 0.8 0.8 0.8 0.8

Netherlands 1.1 0.9 0.9 0.9 0.9 0.9

Portugal 0.8 0.7 0.7 0.7 0.7 0.7

Denmark 1.3 1.0 1.0 1.0 1.0 1.0

Greece 0.9 0.7 0.7 0.7 0.7 0.7

Spain 0.7 0.6 0.6 0.6 0.6 0.6

Sweden 1.1 0.9 0.8 0.8 0.9 0.8

Finland 1.1 0.8 0.8 0.8 0.9 0.8

Norway 1.2 0.9 0.9 0.9 1.0 0.9

Ireland 1.4 1.1 1.0 1.0 1.1 1.0

Belgium 1.0 0.8 0.8 0.8 0.9 0.8

United Kingdom 0.9 0.8 0.8 0.9 0.9 0.8

Hungary 0.7 0.6 0.7 0.8 0.8 0.6

Austria 0.8 0.7 0.8 0.9 1.0 0.7

Estonia 1.1 1.0 1.1 1.3 1.6 1.0

Case 2: national sample size not fixed


N

Nn

N

Nnnn

N

Nn,

PαMaxn

NCCNC

CC 1

11962

Impact on the national accuracy 7/8Table 5: Relative margin of error (%) for the national victimization rate – national sample size not fixed (Case 2)



France 5.7 5.8 5.8 5.8 5.9 5.9

Germany 5.4 5.4 5.5 5.5 5.6 5.6

Switzerland 4.8 4.8 4.9 4.9 5.0 5.0

Italy 5.6 5.6 5.6 5.7 5.7 5.8

Poland 5.0 5.0 5.1 5.1 5.2 5.2

Netherlands 4.2 4.2 4.3 4.3 4.4 4.4

Portugal 6.2 6.3 6.3 6.4 6.4 6.4

Denmark 4.9 4.9 5.0 5.2 5.2 5.2

Greece 5.5 5.6 5.7 5.8 5.8 5.8

Spain 6.4 6.5 6.7 6.9 6.9 6.9

Sweden 4.9 5.0 5.1 5.3 5.3 5.3

Finland 5.9 6.0 6.3 6.5 6.5 6.5

Norway 5.2 5.4 5.6 5.7 5.7 5.7

Ireland 4.3 4.5 4.6 4.7 4.7 4.7

Belgium 4.4 4.5 4.6 4.7 4.7 4.7

United Kingdom 3.6 3.8 3.9 3.9 3.9 3.9

Hungary 5.8 6.3 6.5 6.5 6.5 6.5

Austria 5.5 6.2 6.3 6.3 6.3 6.3

Estonia 4.0 4.8 4.9 4.9 4.9 4.9

Impact on the national accuracy 8/8Table 6: Absolute margin of error (% points) for the national victimization rate – national sample size not fixed (Case 2)



France 0.7 0.7 0.7 0.7 0.7 0.7

Germany 0.7 0.7 0.7 0.7 0.7 0.7

Switzerland 0.9 0.9 0.9 0.9 0.9 0.9

Italy 0.7 0.7 0.7 0.7 0.7 0.7

Poland 0.8 0.8 0.8 0.8 0.8 0.8

Netherlands 0.8 0.8 0.8 0.8 0.9 0.9

Portugal 0.6 0.7 0.7 0.7 0.7 0.7

Denmark 0.9 0.9 0.9 1.0 1.0 1.0

Greece 0.7 0.7 0.7 0.7 0.7 0.7

Spain 0.6 0.6 0.6 0.6 0.6 0.6

Sweden 0.8 0.8 0.8 0.8 0.8 0.8

Finland 0.7 0.8 0.8 0.8 0.8 0.8

Norway 0.8 0.8 0.9 0.9 0.9 0.9

Ireland 1.0 1.0 1.0 1.0 1.0 1.0

Belgium 0.8 0.8 0.8 0.8 0.8 0.8

United Kingdom 0.8 0.8 0.8 0.8 0.8 0.8

Hungary 0.6 0.6 0.6 0.6 0.6 0.6

Austria 0.6 0.7 0.7 0.7 0.7 0.7

Estonia 0.8 1.0 1.0 1.0 1.0 1.0

Specific issues• The initial difficulty is in obtaining the sampling frame

appropriate for the over-sampling the inhabitants of the capital cities. For the countries conducting a face-to-face survey, this should not be a serious issue. On the other hand, the countries which plan to conduct the survey by telephone might be unable to do so; unless specific phone numbers are allocated to the households in the capital city (e.g., when the first digits of a phone number represent the city code)

• Since individuals in capital cities are in general more difficult to contact, over-sampling them will necessitate more attempted contacts; which will likely imply higher costs and more time to reach the minimum sample size required for the survey.

• Finally, over-sampling might make the problem of anonymisation of the data more acute

Questions for the TF

1. Is over-sampling the habitants of the capital cities policy relevant? Which geographical areas might be over-sampled instead?

• NUTS2 or NUTS3 regions• Groups of cities (like in Eurostat’s Urban Audit)• Densely populated areas (based on degree or urbanization)• City areas….

2. What level of accuracy is needed for the capital cities/other geographical areas?

3. What about higher non-response?

4. What about telephone surveys?

oversampling the capital cities in the eu safety survey (eu-sasu) task force on victimization...

Documents

dwhy oversample

national sample

oversampling rates

larger sample

sample size doesnt

size sampling ps

subgroup d

unequal sampling fractions