report on wp4 case studies - european commission · contents 7 the main contributors to this report...

173
Report on WP4 Case studies ISTAT, CBS, GUS, INE, SSB, SFSO, EUROSTAT ESSnet on Data Integration

Upload: others

Post on 19-Apr-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Report on WP4 Case studies - European Commission · CONTENTS 7 The main contributors to this report are Bart Bakker, Vincent de Heij, Daniela Ichim, Frank Linder, Filippo Oropallo,

Report on WP4Case studies

ISTAT, CBS, GUS, INE, SSB, SFSO, EUROSTAT

ESSnet on Data Integration

Page 2: Report on WP4 Case studies - European Commission · CONTENTS 7 The main contributors to this report are Bart Bakker, Vincent de Heij, Daniela Ichim, Frank Linder, Filippo Oropallo,

Contents

Preface 5

1 Register-based employment statistics. Micro-integration andquality-perspective life-cycle. A case study. 81.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

1.1.1 Terminology and notation . . . . . . . . . . . . . . . . 91.2 Considerations when choosing a micro-integration process . . . 101.3 Administrative registers involved . . . . . . . . . . . . . . . . 111.4 The main parts of the micro-integration process . . . . . . . . 121.5 Part 1 of the micro-integration: constructing sets of potential

employees and self-employed . . . . . . . . . . . . . . . . . . . 121.5.1 Constructing set of potential employees . . . . . . . . . 12

1.5.1.1 Details on the merging keys . . . . . . . . . . 141.5.2 Constructing the set of potential self-employed . . . . . 14

1.6 Part 2 of the micro-integration: selecting the most importantwork relation . . . . . . . . . . . . . . . . . . . . . . . . . . . 151.6.1 The most important employee relation . . . . . . . . . 15

1.6.1.1 Distributing the WSR wages onto the group1 employee relations . . . . . . . . . . . . . . 15

1.6.1.2 Harmonising the dates for employee relations 161.6.1.3 Selecting the most important employee relation 16

1.6.2 The most important self-employment relation . . . . . 171.6.3 Selecting the most important work relation . . . . . . . 17

1.7 Part 3 of the micro-integration: classification of employmentstatus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

1.8 Life cycle of register process from a quality perspective . . . . 181.8.1 Alternative life cycle formulation . . . . . . . . . . . . 21

Page 3: Report on WP4 Case studies - European Commission · CONTENTS 7 The main contributors to this report are Bart Bakker, Vincent de Heij, Daniela Ichim, Frank Linder, Filippo Oropallo,

CONTENTS 2

2 The approach to quality evaluation of the microintegratedemployment statistics 252.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252.2 The estimated standard deviation of the LFS-proportion . . . 262.3 Modelling the bias of the register-based statistics . . . . . . . 28

2.3.1 An alternative model . . . . . . . . . . . . . . . . . . . 282.3.2 Fitting the multilevel model . . . . . . . . . . . . . . . 31

2.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 322.4.1 Parameter estimates . . . . . . . . . . . . . . . . . . . 322.4.2 Comparison of REG-employment and REG-employment

using the distribution of the underlying bias estimator 332.4.3 Comparison of REG-employment and REG-employment

using the EBLUP estimator . . . . . . . . . . . . . . . 342.5 Adjusting the EBLUP estimator . . . . . . . . . . . . . . . . . 342.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

3 Combining data from administrative sources and sample sur-veys; the single-variable case. Case study: Educational At-tainment 393.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 413.2 Sources on education and their quality . . . . . . . . . . . . . 44

3.2.1 The Labour Force Survey . . . . . . . . . . . . . . . . 453.2.2 Administrative education registers . . . . . . . . . . . . 46

3.2.2.1 Primary education . . . . . . . . . . . . . . . 463.2.2.2 Secondary education . . . . . . . . . . . . . . 473.2.2.3 Higher or tertiary education . . . . . . . . . . 483.2.2.4 Other administrative registers . . . . . . . . . 48

3.2.3 Sources on education and framework for errors . . . . . 503.3 Micro-integration . . . . . . . . . . . . . . . . . . . . . . . . . 55

3.3.1 From education sources to Education Archive . . . . . 553.3.2 From Education Archive to Educational Attainment File 56

3.3.2.1 Selection from the Education Archive . . . . . 573.3.2.2 Determination of education levels at refer-

ence date . . . . . . . . . . . . . . . . . . . . 583.3.2.3 Assessing the validity of education levels at

reference date . . . . . . . . . . . . . . . . . . 593.3.2.4 Determining educational attainment at refer-

ence date . . . . . . . . . . . . . . . . . . . . 633.4 Weighting strategy . . . . . . . . . . . . . . . . . . . . . . . . 65

3.4.1 Population weights for the Educational Attainment File 65

Page 4: Report on WP4 Case studies - European Commission · CONTENTS 7 The main contributors to this report are Bart Bakker, Vincent de Heij, Daniela Ichim, Frank Linder, Filippo Oropallo,

CONTENTS 3

3.4.2 Selection strategy for weighting models, theoretical frame-work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

3.4.3 Selection strategy for weighting models applied to Ed-ucational Attainment File . . . . . . . . . . . . . . . . 80

3.5 Measurement of accuracy . . . . . . . . . . . . . . . . . . . . . 933.6 Some concluding remarks . . . . . . . . . . . . . . . . . . . . . 94

4 First Steps in Profiling Italian Patenting Enterprises 984.1 A general description of patenting administrative flows . . . . 994.2 Data sources . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1014.3 Data pre-processing and standardisation . . . . . . . . . . . . 1044.4 The record linkage process . . . . . . . . . . . . . . . . . . . . 107

4.4.1 Search space reduction . . . . . . . . . . . . . . . . . . 1084.4.2 Deterministic record linkage . . . . . . . . . . . . . . . 114

4.5 Preliminary Results . . . . . . . . . . . . . . . . . . . . . . . . 1154.6 Conclusions and future plans . . . . . . . . . . . . . . . . . . . 117

5 The framework for error in an integrated survey 1205.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1205.2 Integration process steps of SME survey according to a frame-

work on errors in register based statistics . . . . . . . . . . . . 1215.2.1 Data sources of integration process . . . . . . . . . . . 1215.2.2 Measurement side . . . . . . . . . . . . . . . . . . . . . 1235.2.3 Representation side . . . . . . . . . . . . . . . . . . . . 125

5.3 ”Life cycle and errors” in an integrated survey . . . . . . . . . 129

6 Statistical matching: Polish case study 1326.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1326.2 Data integration and statistical matching . . . . . . . . . . . . 1326.3 Methods of statistical matching . . . . . . . . . . . . . . . . . 1346.4 Integrating databases from the Microcensus and LFS . . . . . 136

6.4.1 Description of the datasets . . . . . . . . . . . . . . . . 1366.4.2 The integration algorithm . . . . . . . . . . . . . . . . 1376.4.3 The assessment of the integration . . . . . . . . . . . . 139

6.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144

Bibliography 146

List of Figures 153

List of Tables 155

Page 5: Report on WP4 Case studies - European Commission · CONTENTS 7 The main contributors to this report are Bart Bakker, Vincent de Heij, Daniela Ichim, Frank Linder, Filippo Oropallo,

CONTENTS 4

A Appendix A 156

B Appendix B 161

Page 6: Report on WP4 Case studies - European Commission · CONTENTS 7 The main contributors to this report are Bart Bakker, Vincent de Heij, Daniela Ichim, Frank Linder, Filippo Oropallo,

Preface

Johan Fosen

Statistics Norway

The use of administrative registers in production of official statistics is be-coming more and more widespread. It is cost saving and it leads to lessresponse burden than sample surveys. Administrative registers have beenestablished for administrative and not statistical purposes, and therefore weusually have to link and harmonise information from several data sources forproducing a certain official statistics based on administrative registers. Someof these data sources can also be sample surveys.

The data integration involved in the process above is commonly divided intorecord linkage, statistical matching, and micro integration. Record linkageis about linking the record of one unit in one data source (register or samplesurvey) with the records for the same unit in another data source. In statis-tical matching, we assume that the data sources contain different units, andtry to link one unit from one data source with the most similar unit fromanother data source. Finally, micro-integration addresses the challenges metafter the units successfully have been linked: often one variable of interest isnot defined as desired by one data source, but can be defined so by carefullycombining different variables from the different data sources.

The Essnet data integration project was set up to meet the challenges indata integration. The state-of-the art on data integration methodology isdescribed in Work Package 1 of the Essnet Data Integration project, whereasnew methodology is presented in Work Package 2. The current documentis the report from the case studies, which is Work Package 4 in the project.The main objective of the case studies is ’To foster knowledge transfer bydeveloping a case study and associated recommendations on representative

Page 7: Report on WP4 Case studies - European Commission · CONTENTS 7 The main contributors to this report are Bart Bakker, Vincent de Heij, Daniela Ichim, Frank Linder, Filippo Oropallo,

CONTENTS 6

problems in data integration in the ESS’ (quoted from the Essnet Data In-tegration project application). The case studies illustrate methodologicaltopics in data integration.

There are five case studies presented in this report. Case study 1 is aboutthe register-based employment statistics in Norway. This statistics is dis-seminated annually, and is produced by using micro-integration. The firstpaper in this report describes the micro-integration and relates this microin-tegration process to a quality-perspective life cycle of micro-integration. Thesecond paper considers the important topic of evaluating the quality of theoutput of the micro-integration at small area level, using a small area model.

The third paper describes case study 2, which is about the derivation of ed-ucational attainment by combining administrative sources and sample sur-veys. The administrative registers and Labour Force survey together containmuch information on education programmes (including exam results) of thepopulation in the Netherlands. By micro-integration, Statistics Netherlands(CBS) constructs an educational attainment file with the people’s highesteducation. The case study describes the micro-integration and the solutionof methodological challenges involved.

Case study 3 is described in the fourth paper, and is about record linkage. Inthe Italian National Statistical Institute (ISTAT), patent applications froman international database have been linked with the Italian Official Businessregister, which gives a statistical database enabling e.g. analyses of the con-nection between patenting activities and the profit of enterprises. The linkagehad to be based mainly on the comparison between the applicant names andthe enterprise names. The case study illustrates the linkage methodologythat was used for this purpose.

Case study 4 is treated in the fifth paper. In ISTAT, an integration pro-cess has been developed between the Sample Survey of Small and MediumEnterprises, and administrative registers. The paper focuses on relating theintegration process to a quality-perspective life cycle of micro-integration. Inthis scheme, sources of errors (from measurement and coverage sides) havebeen identified and it has been possible to adopt some actions in order toreduce their impact on final estimates.

Paper 6 presents case study 5 which is a simulation study where the Polish2005 Microcensus (a sample survey) and the Labour Force Survey from 2005are matched. The first dataset has about 2 million records on economic anddemographic variables, the second one about 220,000 records about economicactivity.

Page 8: Report on WP4 Case studies - European Commission · CONTENTS 7 The main contributors to this report are Bart Bakker, Vincent de Heij, Daniela Ichim, Frank Linder, Filippo Oropallo,

CONTENTS 7

The main contributors to this report are Bart Bakker, Vincent de Heij,Daniela Ichim, Frank Linder, Filippo Oropallo, Giulio Perani, Francesca Ro-mana Pogelli, Dominique van Roon, Wojciech Roszka, Giovanni Seri, MarcinSzymkowiak, Li-Chun Zhang, and Johan Fosen.

Special thanks are due to Marcin Szymkowiak for the efforts in transformingall the files in this document in LATEX.

Page 9: Report on WP4 Case studies - European Commission · CONTENTS 7 The main contributors to this report are Bart Bakker, Vincent de Heij, Daniela Ichim, Frank Linder, Filippo Oropallo,

Chapter 1Register-based employment statistics.Micro-integration andquality-perspective life-cycle.A case study.

Johan Fosen

Statistics Norway

1.1 Introduction

In the Essnet Data-integration (ESSnet-DI) project, the Norwegian register-based employment statistics has been selected as a case study on micro-integration. The employment statistics is based on micro-integration of sev-eral administrative registers (Aukrust et al. 2010; in Norwegian). It is pub-lished annually and covers employment for the third week of November1.

Below I will not try to give a complete and accurate account of the micro-integration of the register-based employment statistics. As an example ofmicro-integration, I will give a simplified overview of what I think are theessential ideas and steps of this micro-integration, emphasising simplicity andclarity.

After defining concepts and describing notation in Section 1.1.1 we will in

1http://www.ssb.no/english/subjects/06/01/regsys en/

Page 10: Report on WP4 Case studies - European Commission · CONTENTS 7 The main contributors to this report are Bart Bakker, Vincent de Heij, Daniela Ichim, Frank Linder, Filippo Oropallo,

1.1 Introduction 9

Section 1.2 look at which considerations has to be taken into account whendeciding on a micro-integration system. Then in Section 1.3 we will describesome of the registers involved before we present an overview of the micro-integration in Section 1.4. In Section 1.5 to Section 1.7 we present the threemain parts of the micro-integration. The last part, Section 1.8 is devoted torelating the micro-integration to the life-cycle register processes described byBakker (2010) and the life cycle described by Zhang (2011a).

1.1.1 Terminology and notation

An administrative register is a register made for administrative purposes. Onthe other hand, a statistical register is made for statistical purposes, usuallyby a statistical agency and from several administrative registers.

Employment status is measuring whether a person is employed or not. Em-ployed persons can be further divided into employees and self-employed.

Heuristically a work relation is job, and more precisely it is a relation betweena person and the employer, either a company or an enterprise2.

The unit (or record) in the administrative registers considered in the presentdocument is either person or work relation. We will denote all such units asregister relations, and these relations can be further divided into employeerelations concerning work as an employee, and self-employment relations.

A register relation is active between the start date and the termination dateof the register relation. Briefly, if a register relation is active at the referenceweek, we just say that the register relation is active.

The most important work (relation) refers to the person’s main work. Allwork-related variables such as industry etc. usually refers to the most im-portant work relation. In the same manner we can talk of most importantemployee relation and most important self-employment relation. The mostimportant work relation is the most important of a person’s most importantemployee relation and his/her most important self-employment relation.

Employment status is measured at a certain week, the reference week orreference time. The year of the reference week is reference year. A person isemployed at the reference week if he/she has worked for pay or profit at leastone hour during the reference week, or has been temporarily absent fromsuch a work. This definition is the ILO (International Labour Organisation)

2Enterprise is at the level above company, i.e. an enterprise can consist of severalcompanies.

Page 11: Report on WP4 Case studies - European Commission · CONTENTS 7 The main contributors to this report are Bart Bakker, Vincent de Heij, Daniela Ichim, Frank Linder, Filippo Oropallo,

1.2 Considerations when choosing a micro-integration process 10

definition, and it is beyond the scope of the present document to go furtherinto this definition by e.g. looking at the precise definition of work.

1.2 Considerations when choosing a micro-integration process

The choice of micro-integration process in a system such as that of theregister-based employment statistic is typically based on several consider-ations.

The classification should not just be into employed and not employed, butinto employees, self-employed, and not employed. The micro-integration endsup with classification into these three stages, and then the employees andthe self-employed are per definition employed. Thus, the classification ofemployment is done indirectly.

Experience on the quality of different administrative registers is an importanttool for keeping the unit misclassification low. Also, to the extent that theanswers in the Labour Force Survey (LFS) can be assumed to be a goodapproximation to the correct employment status, the unit misclassificationcan be measured by comparing the individual register-based employmentstatus with the individual LFS-employment status within the LFS-sample.The unit misclassification should also be kept low for the classification ofemployees and self-employed.

Micro-integration is done so that both the total national level of employment,employees, and self-employed should be equal to the estimated level basedon the quarterly Labour Force Survey (LFS).

The micro-integration should end up with a statistical register of employmentstatus, employee/self-employed and other register variables, meaning thatall tables on employment can be made by simple tabulation based on thisstatistical register, just as if the statistical register was a census.

The classification should be robust towards administrative changes in thefuture, and it should be simple.

Additional work-related variables such as industry should also be of a qualityas good as possible.

Page 12: Report on WP4 Case studies - European Commission · CONTENTS 7 The main contributors to this report are Bart Bakker, Vincent de Heij, Daniela Ichim, Frank Linder, Filippo Oropallo,

1.3 Administrative registers involved 11

1.3 Administrative registers involved

• The Central population register (CPR). The register is maintained bythe National Tax Administration and contains e.g. all persons withresidence in Norway.

• The Employer/employee register (ER), being a short name for ”TheRegister of Employers and Employees”. The register is maintained byThe Norwegian Labour and Welfare Administration and is the mainsource for classifying employment status. It contains e.g. start andtermination dates of all employee relations scheduled to last more thanone week and averagely to be more than four hours a week. However,the register does not include work performed by contractors.

• Wage Sum Register (WSR). The actual name of the register is ”Theend of year certificate register”, but we will use ”Wage Sum Register”since this name better reflects the content of the register. The registeris maintained by the National Tax Administration, and contains annualwage for each work relation at the enterprise level.

• Register of Conscripts.

• Register of Civilian National Service.

• Tax Register. The real name is ”the Register of Personal Tax Payers”,and it is maintained by the national Tax Administration. It containse.g. information on income, both from work as employee and work asself-employed.

• The Central Sickness Absence Register. The register is maintained bythe The Norwegian Labour and Welfare Administration and containsstart and termination dates of all absences from work that is certifiedby a physician.

• The Parental Benefit Register. The register is maintained by The Nor-wegian Labour and Welfare Administration, and contains informationon start and termination of the maternity and paternity leave.

• The Unemployment Register (UR).

• The Central Coordinating Register for Legal Entities

There are also additional registers involved in the micro-integration.

Page 13: Report on WP4 Case studies - European Commission · CONTENTS 7 The main contributors to this report are Bart Bakker, Vincent de Heij, Daniela Ichim, Frank Linder, Filippo Oropallo,

1.4 The main parts of the micro-integration process 12

1.4 The main parts of the micro-integrationprocess

The population is defined as all persons living in Norway at the referenceweek according to the Central population register.

The micro-integration process consists of three main parts. In the first part,a set of potential employees is constructed, and each person’s types of registerrelations are identified. Based on the origin of a register relation, e.g. whetherthe work relation was found in one or both of the registers ER and WSR,the register relations form four groups. A person can have work relations inall of these groups.

A set of potential self-employed is constructed, and here a person’s registerrelations form into two groups. Just as with employee relations, a person canhave work relations in both groups.

The purpose of part 2 is to find each person’s most important work relation.As we will see later, wage information has to be distributed and this is donefirst. Then the date information from the registers are harmonised, resultingin activation of some register relations and inactivation of other relations.Afterwards, each person’s most important employee relation is selected.

For self-employment relations, date information are harmonised before se-lecting the most important self-employment relation.

Finally, the most important one of these two most important relations isidentified, giving the most important work relation.

In part 3, each person is classified as employee, self-employed, or not em-ployed, mainly based on their most important work relation. Those beingemployees or self-employed are classified as employed.

1.5 Part 1 of the micro-integration: construct-ing sets of potential employees and self-employed

1.5.1 Constructing set of potential employees

Any active ER-relation implies that the person receives a wage for the workrelation. Therefore, any work relation found in ER should also be traced in

Page 14: Report on WP4 Case studies - European Commission · CONTENTS 7 The main contributors to this report are Bart Bakker, Vincent de Heij, Daniela Ichim, Frank Linder, Filippo Oropallo,

1.5 Part 1 of the micro-integration: constructing sets of potentialemployees and self-employed 13

Figure 1.1. Flowcharter illustrating the steps described in Section 1.5–Section 1.7

WSR. In order to identify these connections, the work relations in ER andWSR are merged by the personal id and the enterprise id. The reason forusing enterprise id in the merge instead of company id is described in Section1.5.1.1.

The merge generates three groups of register relations:

• Group 1 – work relations found in both registers (ER and WSR),

Page 15: Report on WP4 Case studies - European Commission · CONTENTS 7 The main contributors to this report are Bart Bakker, Vincent de Heij, Daniela Ichim, Frank Linder, Filippo Oropallo,

1.5 Part 1 of the micro-integration: constructing sets of potentialemployees and self-employed 14

• Group 2 – work relations only found in ER (ER only), and

• Group 3 – work relations only found in WSR (WSR only).

We also have another group,

• Group 4 – active register relations found in the Register of Conscriptsand in the Register of Civilian National Service.

These four groups constitute the set of potential employees, and will becentral in the classification described later. Briefly, active relations in group1 and group 4, is a rather certain indication of employment. Active relationsin group 3 are uncertain, and some of these persons will be classified asemployed and some as not employed.

1.5.1.1 Details on the merging keys

An active ER-relation is identified by the personal id together with the com-pany id. On the other hand, the corresponding WSR-relation is identified bythe person id together with the enterprise3 id. The ER-relations then has tobe merged with the WSR-relations at the enterprise level, i.e. using personalid and enterprise id as merging key.

Since the enterprise population changes continuously, the enterprise id ofER and of WSR for the same enterprise may deviate. There is a routinethat manages to merge some these ER- and WSR relations, resulting in anincrease of group 1.

For the public sector in Norway, the situation is a bit more complicated sincethe WSR relation could is identified either by personal id and the unit at thelevel above the company (which corresponds to enterprise in private sector),but sometimes it is identified by the unit two levels above the company.There are some routines handling this.

1.5.2 Constructing the set of potential self-employed

The set of potential self-employed is those having income (including negativeincome) from self-employment according to Tax register for the year beforethe reference year due to a long production time.

3Every company has to belong to an enterprise. An enterprise could contain manycompanies, but sometimes only one company.

Page 16: Report on WP4 Case studies - European Commission · CONTENTS 7 The main contributors to this report are Bart Bakker, Vincent de Heij, Daniela Ichim, Frank Linder, Filippo Oropallo,

1.6 Part 2 of the micro-integration: selecting the most important workrelation 15

For each of the potential self-employed, an algorithm tries to attach an enter-prise to the person in order to find the industrial code of the self-employmentactivity. We will not go into this routine although it is necessary for the sep-aration between farming/forestry/fishing and other industries below. Noticethat the routine also is important for finding the values of the additionalvariable industry in the statistical register.

The set of potential self-employed is divided into two groups, one consisting ofpersons believed to have their self-employment income from farming, forestry,or fishing, and another group consisting of the other persons.

1.6 Part 2 of the micro-integration: selectingthe most important work relation

In part 2 of the micro-integration, each person’s most important employee-relation and most important self-employment relation is selected. Then oneof these is selected as the most important work relation.

1.6.1 The most important employee relation

Before the most important employee relation can be identified, some harmon-isation of information has to be done. Firstly, a person’s wage informationhas to be distributed. Then, a harmonisation is done between dates of startand termination of register relations. Afterwards the selection between em-ployee relation starts.

1.6.1.1 Distributing the WSR wages onto the group 1 employeerelations

For the group 1 of potential employee relations from Section 1.5.1 there isone work relation for each company that the person is working for. However,since the wage information comes from WSR which only contains informationat the enterprise level (Section 1.5.1.1), the wage information is a sum overall work relations for companies within the enterprise. In the cases wherea person has more than one ER-relation within the same enterprise,4 analgorithm distributes his/her wage from the enterprise onto each of thesework relations.

4i.e. active work relations for more than one company within the enterprise.

Page 17: Report on WP4 Case studies - European Commission · CONTENTS 7 The main contributors to this report are Bart Bakker, Vincent de Heij, Daniela Ichim, Frank Linder, Filippo Oropallo,

1.6 Part 2 of the micro-integration: selecting the most important workrelation 16

1.6.1.2 Harmonising the dates for employee relations

The employee relations of group 1 and group 2 have dates of start and ter-mination coming from ER (c.f. Section 1.5.1). These are harmonised withinformation on sickness absence and on unemployment.

A person be absent from work due to sickness or maternity and paternityleave, is only temporarily absent from work and is employed according tothe employment definition. However, for administrative purposes a personbeing absent from work is defined as inactive in ER from the day whenthe absence has lasted two weeks and until the person returns to work. Toreactivate these relations, ER is harmonised with the sickness absence register(from the Central Sickness Absence Register) and maternity and paternityleave (the Parental Benefit Register): if a person has only inactive group 1-relations but has an active sickness absence relation (or active maternity andpaternity leave relation), then the most recent (measured by date of start)group 1-relation is re-activated.

If a person has both an active group 1/group 2- relation and an active un-employment relation (from the Unemployment Register UR) all group 1 re-lations whose start-date is prior to or less than a week after the start date ofthe unemployment, is terminated at the start date of the unemployment.

If a person in addition to a group 1-/group 2-relation, has both an activesickness absence relation (or maternity and paternity leave relation) and anactive unemployment relation, then harmonisation is done as if the unem-ployment relation was not existing.

For group 3- and group 4- relations, no harmonisation is done with sicknessabsence information or maternity and paternity leave information. However,the harmonisation with unemployment information is done in the same wayas above.

1.6.1.3 Selecting the most important employee relation

A person’s most important employee relation is selected according to thefollowing priority list:

• Priority 1: active employee relation in group 4.

• Priority 2: active employee relation in group 1. If several: the one withlongest expected working hours is selected. In the case of a tie, the onehaving the latest date of start has highest priority.

Page 18: Report on WP4 Case studies - European Commission · CONTENTS 7 The main contributors to this report are Bart Bakker, Vincent de Heij, Daniela Ichim, Frank Linder, Filippo Oropallo,

1.6 Part 2 of the micro-integration: selecting the most important workrelation 17

• Priority 3: active employee relations in group 2. If several: the samesorting as for group 1.

• Priority 4: active employee relation in group 3. If several: the one withhighest wage5 is selected.

1.6.2 The most important self-employment relation

The self-employment relations have no dates of start and end, and these datesare therefore set to 1 January and 31 December. These dates are harmonisedwith the Unemployment Register in the same way as for employee relationsin Section 1.6.1.2.

A person can have one self-employment relation from agriculture/forestry/fi-shing and one from other industries (c.f. Section 1.5.2). The one of these twowith the highest income is selected as the most important self-employmentrelation.

1.6.3 Selecting the most important work relation

The most important employee relation becomes the most important workrelation if a person has no self-employment relations, and the most impor-tant self-employment relation becomes the most important work relation ifa person has no employee relations. Otherwise, the most important employeerelation is compared to the most important self-employment relation, and theone selected as the most important work relation is done according to thispriority list:

• Priority i: most important employee relation from group 4.

• Priority ii: most important employee relation from group 1 where ex-pected working hours exceed 30 hours a week.

• Priority iii: most important employee relation from group 1 where sumof the wages as an employee exceeds the sum of income from self-employment relations.

• Priority iv: most important self-employment relation

• Priority v: most important employee relation from group 2 or 3.

5more specifically what in Norwegian is denoted ”kontantlønn”

Page 19: Report on WP4 Case studies - European Commission · CONTENTS 7 The main contributors to this report are Bart Bakker, Vincent de Heij, Daniela Ichim, Frank Linder, Filippo Oropallo,

1.7 Part 3 of the micro-integration: classification of employment status18

1.7 Part 3 of the micro-integration: classifi-cation of employment status

A person is classified as employed if the most important work relation6 is

I from group 1 or 4. Then the person is also classified as employee.

II a self-employment relation and in addition we require that the person’sincome from self-employment is higher than the self-employment cut-off value. Then the person is also classified as self-employed. Thecut-off takes the value where the number of self-employed is equal tothe number of self-employed estimated by the quarterly Labour ForceSurvey (LFS).

III a self-employment relation (with income lower than the self-employmentcut-off value), and where the person also has a most important em-ployee relation from group 1. Then the person is also classified asemployee.

IV an employee relation from group 2 or 3, and in addition we requirethat the total annual wage exceeds the employee cut-off value. Thenthe person is also classified as employee. The employee cut-off valueis chosen as the value where the total number of employees (from I,II, and IV) equals the number of employees according to the quarterlyLabour Force Survey (LFS).

1.8 Life cycle of register process from a qual-ity perspective

Figure 1.2 is the life cycle of the register process from a quality perspectivesuggested by Bakker (2010). The process follows two lines, a measurementline and a representation line, and each administrative register involved ina micro-integration, follows a process along these two lines. We will relatethe register-based employment statistic process to this life-cycle, and we willfocus on the employment register ER, which is by far the most importantregister in the process.

6Notice: this means that the relation is active during the reference week, otherwise itwould not be the most important work relation.

Page 20: Report on WP4 Case studies - European Commission · CONTENTS 7 The main contributors to this report are Bart Bakker, Vincent de Heij, Daniela Ichim, Frank Linder, Filippo Oropallo,

1.8 Life cycle of register process from a quality perspective 19

The first part of the representation line and measurement line, will be de-noted single source part and the second part will be referred to as the micro-integration part of the life cycle. The single source part covers the life cycle ofeach administrative register until they are considered for micro-integration.

The single source part of representation line starts with the specification ofthe target population. We can for ER for simplicity state the target popu-lation as all jobs making the employee eligible for receiving benefit in caseof sickness absence. The next stage ’registered population elements’ is thenall jobs scheduled to both more than four hours a week and to last morethan 14 days. The latter limitation does not imply less coverage of the tar-get population7. However, the ’four hours a week’ limitation is a practicallimitation which implies a deviation between the target population and reg-istered population elements. Another deviation comes from the differencebetween what should be and what is registered population elements, and isdue to e.g. delayed registrations from the employers. These deviations areexamples of the coverage error.

The registered elements from each administrative register involved are linkedto each other at an appropriate unit level, a process that can create linkingerrors. In our case, the registers are linked partly at the job level, but mostlyat the person level. The rest of the representation line is the micro-integrationpart, which we will return to below.

At the single source part of the measurement line, we start with a definitionof the target concept, denoted administrative concept. In ER, ’employment’is taken care of during the representation line since the presence of an el-ement (job) means that the person attached to the element (employee) isemployed according to ER. Concepts left for the measurement line is thenthe necessary additional concept ’person identification’ and other additionalconcepts such as the magnitude of the job, the identification of the employeretc. At the operationalisation stage these concepts have been operationalisedinto scheduled working hours, date of start of job, identification number ofthe employee and employer of the job. This operationalisation may causevalidity errors.

The next stage after the operationalisation is the response, and here mea-surement errors can appear for all variables involved. As an example, theperson id or number of scheduled working hours can be wrong either from

7 Sickness absence benefit is only paid by the National Labour and Welfare adminis-tration for sickness absence after 14 days of absence. For the first 14 days, the employerwill provide the sickness absence benefit.

Page 21: Report on WP4 Case studies - European Commission · CONTENTS 7 The main contributors to this report are Bart Bakker, Vincent de Heij, Daniela Ichim, Frank Linder, Filippo Oropallo,

1.8 Life cycle of register process from a quality perspective 20

the company payroll system or from the registration process in ER after themessage of a new job has been sent from the payroll system of the company.The rest of the measurement line is the micro-integration part.

When we enter the micro-integration part of the life cycle, there has in ourcase in fact been a switch between the representation line and the measure-ment line because of the change of elements from jobs to persons. Then’person identification’ belongs to the representation line, whereas ’employ-ment’ belongs to the measurement line since the new element ’person’ carriesno information of employment.

For the representation line of the micro-integration part, the first stage islinked population elements where the person information from all the relevantsources are linked, in our case by a person id number. If linking would haveto be done by other means, we would have to consider the linking errors.The final stage is postlinking corrections, and here correction errors can bedone.

For the measurement line of the micro-integration part, the next stage afterthe response is the corrected response for the statistical concept, which inour case is the final register-based employment status after comparing theemployment information from all the linked registers and doing the finalclassification, as described in Section 1.4 - Section 1.7.

The target populations and target concepts are typically different in differ-ent administrative registers. And the target population and concept for themicro-integrated register can be different from what we find in the adminis-trative registers. In our case, the target population of the micro-integratedregister is all inhabitants of Norway between 15 and 74 years old, which aswe have seen differs from that of ER which was jobs of a certain kind. Thetarget concept is ’employed’ according to the ILO-definition, which is not thetarget concept in any of the administrative registers.

The change of target concept and target population can be incorporatedwithin the Figure 1.2 life cycle by first considering this life cycle for each ad-ministrative register only, and then the last stages are not micro-integrationstages. Then the register outcome from each of these administrative regis-ters are used as input into a similar 2nd phase life cycle where this cycle isabout the micro-integration process, starting with target concept and targetpopulation.

Page 22: Report on WP4 Case studies - European Commission · CONTENTS 7 The main contributors to this report are Bart Bakker, Vincent de Heij, Daniela Ichim, Frank Linder, Filippo Oropallo,

1.8 Life cycle of register process from a quality perspective 21

Figure 1.2. The life cycle suggested by Bakker (2010; Figure 5, page 74).

1.8.1 Alternative life cycle formulation

Zhang (2011a) has described a two-phase life cycle which also explicitly opensfor the switch of concepts between the measurement line and representationline that we experience in our case of register-based employment, c.f. Fig-ure 1.3. The first phase covers the life cycle of each administrative registerinvolved, and the second describes that of micro-integration.

Along the representation line of the first phase, the first stage is target set(of objects), where ’objects’ is used instead of ’units’ and ’set’ instead of’population’ to distinguish from the second phase describing the life cycle ofthe micro-integration. After the target set there is a new stage ’accessibleset’ not present in Figure 1.2 being in our case the set of jobs which ER issupposed to contain, e.g. jobs scheduled to more than four hours a weekand more than 14 days. Errors occurring between target and accessible setare frame errors. The next stage is ’accessed set’ which is identical to the’Registered population elements’ in Figure 1.2. On the way to the accessedset, there is the possibility of making selection error, in our case when jobs

Page 23: Report on WP4 Case studies - European Commission · CONTENTS 7 The main contributors to this report are Bart Bakker, Vincent de Heij, Daniela Ichim, Frank Linder, Filippo Oropallo,

1.8 Life cycle of register process from a quality perspective 22

have not been registered because of e.g. delayed registration. The next stageis ’observed/validated set’, being in our case the final set of jobs that laterwill be input to the second phase. The errors made when producing theobserved/validated set is ’missing/redundancy’ which means errors due toe.g. partial non-response. For ER this could mean that a job was erroneouslyconsidered to be just a duplication of another job.

Along the measurement line of phase 1, the stages are similar to those ofFigure 1.2.

For the 2nd phase, representation line, we start with the target population,persons 15-74 years old. For the measurement line we start with the tar-get concept which is employment according to the ILO definition. We seethat the variables/concepts on the measurement line and objects on the rep-resentation line from 1st phase, are opposite to the concept and unit thatwe are considering in the 2nd phase. Therefore, the 2nd phase life-cycle hasa stage of transformation where we can make this switch. During this trans-formation, the input from the representation line and measurement line ofthe 1st phase are transformed to measurement line and representation linerespectively when the input are included in the 2nd phase. In our case, thejob objects are transformed into the variable ’employment’, and the variable’person id’ is transformed into the unit ’person id’.

For the measurement part of phase 2, the stage after the target concept isharmonisation, in which we compare the transformed employment concepts(from objects) of the different single sources (administrative registers) ofphase 1. Here we identify that that the transformed concept from ER isnarrower than the target employment concept since smaller jobs are excludedand only employees are included. We then decrease the gap by e.g. including’income from business activity (including farming,forestry and fishing)’ fromthe Tax Register, but this latter concept is too wide since we then don’trestrict ourselves to income earned during the reference week. The sameapplies to ’salary from a job’ from the Wage Sum Register.

After harmonisation we have reached the classification stage, where we usethe knowledge from the harmonisation stage to create a set of classificationrules that assign an employment status to each person based on his/herregistered information from the all the single sources combined. (Section 1.7c.f. also Section 1.4-Section 1.6). The errors made at this stage is denotedmapping errors.

Finally we have the adjustment stage, which in our case is the part of theclassification where a cut-off value of income and salary is set to obtain the

Page 24: Report on WP4 Case studies - European Commission · CONTENTS 7 The main contributors to this report are Bart Bakker, Vincent de Heij, Daniela Ichim, Frank Linder, Filippo Oropallo,

1.8 Life cycle of register process from a quality perspective 23

LFS employment level (Section 1.7). The errors made here are compatibilityerrors.

For the representation part of the 2nd phase, the stage after the target popu-lation (persons 15-74 years old is data linkage, with linkable units from all thesingle sources (administrative registers) are gathered. In our case this stageis skipped8 and linkable units are linked directly, but the set of all linkableunits (on person level) in our case basically is all persons with an active jobon the reference date in ER, the persons with salary from job(s) accordingto the Wage Sum Register, the persons with income from business activity(including farming,forestry and fishing) from the Tax Register, all persons15-74 years old from the Central population register (CPR) etc. Of thesepersons, only persons from CPR are considered. This set of units will notcover the target population completely, and this error is the coverage error.

The alignment stage considers how the linkable units are related to the targetpopulation. In our case this is a trivial discussion, but if the target populationwere households and the linkable units were persons, some alignment wouldbe necessary (Zhang 2011a). This stage resembles harmonisation, but whereunits are considered instead of the measurement line variable.

After the alignment, the statistical units are produced. Here again, this isnon-trivial in the case where the target population were households, andthen unit errors would be made whenever there were a false match betweenlinkable units (persons) and the statistical units (households).

Acknowledgments

Thanks to Li-Chun Zhang for very valuable discussions. Thanks also to IngeAukrust and Tonje Køber for discussions on details in the micro-integrationprocess, and to Helge Næsheim and Ole Villund for comments.

8since the next stage ’aligment’ turns out to be trivial.

Page 25: Report on WP4 Case studies - European Commission · CONTENTS 7 The main contributors to this report are Bart Bakker, Vincent de Heij, Daniela Ichim, Frank Linder, Filippo Oropallo,

1.8 Life cycle of register process from a quality perspective 24

Figure 1.3. The two-phase life-cycle suggested by Zhang (2011a).

Page 26: Report on WP4 Case studies - European Commission · CONTENTS 7 The main contributors to this report are Bart Bakker, Vincent de Heij, Daniela Ichim, Frank Linder, Filippo Oropallo,

Chapter 2The approach to quality evaluation ofthe microintegrated employmentstatistics1

Johan Fosen, Li-Chun Zhang

Statistics Norway

2.1 Introduction

The register-based employment statistics of Norway is a statistics dissem-inated annually by Statistics Norway. The employment variable measureswhether or not a person is employed during the third week of November, andis constructed for each person in the target population by means of micro-integration. During the micro-integration, several administrative registersare linked on micro-level and the information is harmonised before the in-tegration ends in a classification of a person as employed or not employed.The micro-integration is described in Chapter 1.

We shall evaluate the accuracy of register-based employment statistics, REG-employment, for small areas such as municipalities, of which there are about430 in Norway, and more than half of them have less than 5000 inhabitants.In the terminology of Zhang (2011b), the REG-employment data can be

1As a part of the Essnet Data Integration project, a slightly shorter version of thischapter was presented at the ISI 2011 conference in Dublin, with title ’Quality evaluationof employment status in register-based census’.

Page 27: Report on WP4 Case studies - European Commission · CONTENTS 7 The main contributors to this report are Bart Bakker, Vincent de Heij, Daniela Ichim, Frank Linder, Filippo Oropallo,

2.2 The estimated standard deviation of the LFS-proportion 26

considered as surrogate data for a traditional census, and in Norway thetraditional census was replaced by REG-employment for Census 2001. Inthe presence of a traditional census, the census would then be target data,since it is a natural target or ”benchmark” for evaluating the quality ofREG-employment. In the absence of a traditional census, the Labour ForceSurvey (LFS) is a natural target data for the REG-employment data. Wewill compare the mean squared error (MSE) of REG-employment, MSE-REG, and of LFS-employment, MSE-LFS. Our approach does not requirelinkage of the two data sources on the individual level. This can be useful insituations where such individual-level linkages are either impossible, difficult,or prohibited by law.

The main source of MSE for the REG-employment proportion is bias, andfor simplicity we will assume no variance at all. For LFS we will assume nonon-sampling error such as e.g. nonresponse error. Then, the contributionto MSE of LFS-employment proportion comes only from variance, and onlyfrom bias in the case of REG-employment proportion. For short we will justwrite LFS-variance and REG-bias, being the LFS-MSE and square root ofREG-MSE respectively. As opposed to the LFS-variance, the REG-bias isnot by nature a function of the sample size n. On national level, the squaredREG-bias is expected to be higher than the LFS-variance. However, if weconsider smaller and smaller regions, the LFS-variance can be expected toeventually dominate over the REG-bias.

In Section 2.2 we estimate LFS-MSE using a smoothing method. The esti-mation of MSE-REG is devoted to Section 2.3 where we will use small areamodelling. In Section 2.4 we look at the results of the MSE comparison forthe two models, using two different approaches. Finally in Section 2.5 we lookat some properties of the bias estimators and suggest some improvements.

2.2 The estimated standard deviation of theLFS-proportion

Let i denote the municipality, and Yi the number of employed according toLFS in this sub population. We consider the simple sample average yi as theLFS-employment proportion estimate. The standard deviation of yi is givenby√ψi =

√θi(1− θi)/ni, where the parameter θi is the true employment

proportion. Due to small municipalities, the estimates θi have a large vari-

ation and then also the direct estimator sddir(yi) =

√θi(1− θi)/ni. Instead

we use the generalised variance function GVF

Page 28: Report on WP4 Case studies - European Commission · CONTENTS 7 The main contributors to this report are Bart Bakker, Vincent de Heij, Daniela Ichim, Frank Linder, Filippo Oropallo,

2.2 The estimated standard deviation of the LFS-proportion 27

sdGVF(yi) = e−0.749n−1.030/2i , (2.1)

found by regressing log[sddir(yi)

]onto log(

√ni).

Figure 2.1 shows the result. We see that the GVF interpolates the data well,and is somewhat better than the upper limit

sdupper(yi) =

√0.5 (1− 0.5)

ni.

Another alternative to GVF, less extreme than the upper limit could havebeen

sdmean(yi) =

√√√√ θ(

1− θ)

ni,

where θ is the average employment proportion over the m municipalities.

Figure 2.1. The standard deviation of Yi against the square root of the LFS sample size. Direct estimatesas filled circles, the upper limit function as dashed line and the GVF-function as solid line.

Page 29: Report on WP4 Case studies - European Commission · CONTENTS 7 The main contributors to this report are Bart Bakker, Vincent de Heij, Daniela Ichim, Frank Linder, Filippo Oropallo,

2.3 Modelling the bias of the register-based statistics 28

2.3 Modelling the bias of the register-basedstatistics

We can write

yi = θi + ei, i = 1, 2, . . .m, (2.2)

where ei = yi − θi. We notice that V ar(ei) = V ar(Yi) = ψi. Let Zi be thenumber of employed according to REG in municipality i. We assume themodel

Zi = θi + bi, i = 1, 2, . . .m, (2.3)

where Zi is the employment proportion, bi is the bias of Zi, and m is thenumber of municipalities in the data set. With Xi = Zi − yi we have, whencombining (2.2) and (2.3),

Xi = bi + ei, (2.4)

where ei = −ei. The directly observed bias is bi,naive = Xi, which is nota suitable estimator for small municipalities due to the large variance of yi.We assume the bias to be a linear model

bi = β + vi, (2.5)

of the underlying bias β and the random variable vi representing the unex-plained variation between the biases. When not otherwise specified we willby bias refer to bi. Putting (2.4) and (2.5) together we then have the linearmixed model

Xi = uTi β + vi + ei, ui = (1, . . . , 1)T , (2.6)

where the observation Xi, the covariate matrix uTi , and the error term ei hasbeen aggregated from level 1 (person) to level 2 (area). This linear mixedmodel is known as an area level model, (cf. Rao 2003).

2.3.1 An alternative model

For each person, we have register information about the register source forbeing classified as employed or not employed (cf. Chapter 1). Based onprior knowledge on the quality of different register sources, we can divide thepopulation into a high quality group 1, and a group 2 containing the rest ofthe population.

Page 30: Report on WP4 Case studies - European Commission · CONTENTS 7 The main contributors to this report are Bart Bakker, Vincent de Heij, Daniela Ichim, Frank Linder, Filippo Oropallo,

2.3 Modelling the bias of the register-based statistics 29

We use subscript g to denote group. Thus, Zgi and Ygi are the number ofREG-employed and LFS-employed in group g in municipality i. The numberof persons in this group is Ngi, and the proportion being REG-employed andLFS-employed is Zgi and Ygi.

For group 1 we have two special properties: firstly, it contains only personsbeing classified as employed, i.e. Z1i = N1i and thus Z1i = 1. Secondly, weassume no bias in this group: the group contains only persons being employedaccording to LFS. Then we can estimate Y1i by

Y1i = Z1i, (2.7)

and further that model (2.6) for group 1 reduces to X1i = e1i.

For group 2, we assume model (2.6), i.e.

X2i = β2 + v2i + e2i. (2.8)

We then have

Xi = Zi − Yi =N1i

Ni

+ Z2iN2i

Ni

− Y1iN1i

Ni

− Y2iN2i

Ni

, (2.9)

which simplifies into

Xi =N1i

Ni

− Y1iN1i

Ni

+ X2iN2i

Ni

, (2.10)

since X2i = Z2i − Y2i.

We model Z2i in the same way as before and let Z2i = Y2i + b2i, whereb2i = β2 + v2i, and v2i is a random effect at the municipality level. Using(2.7), an estimator of Y2i is given by

ˆY2i = (Yi − Z1i)/N2i = (Niyi − Z1i)/N2i, (2.11)

where yi is the LFS-employment level in the sample and Yi = Niyi. If weadd and subtract Y2i, we get

ˆY2i = Y2i +Yi − Y2i − Z1i

N2i

= Y2i +Yi − YiN2i

= Y2i + e2i, (2.12)

where e2i is the associated sampling error(Yi − Yi

)/N2i, which is our best

estimate since we are unable to identify the groups within LFS. The expectedvalue of e2i is zero and the variance is V (yi)(Ni/N2i)

2.

Page 31: Report on WP4 Case studies - European Commission · CONTENTS 7 The main contributors to this report are Bart Bakker, Vincent de Heij, Daniela Ichim, Frank Linder, Filippo Oropallo,

2.3 Modelling the bias of the register-based statistics 30

From (2.11) we now have

Z2i − ˆY2i = (Zi − Z1i)/N2i − (Yi − Z1i)/N2i =

= (Zi −Niyi)/N2i = (Zi − yi)(Ni/N2i) = xi(Ni/N2i), (2.13)

where we notice that xi is the observed difference between the REG-emplo-yment proportion in the municipality population and the LFS-employmentproportion in the sample.

Similarly as for (2.4), We now write

Z2i − ˆY2i = b2i − e2i, (2.14)

which we insert into the left-hand side of (2.13). Then we have the followinglinear mixed model

xi =N2i

Ni

b2i −N2i

Ni

e2i =N2i

Ni

b2i + εi, (2.15)

where

E(b2i) = β2 and V (b2i) = σ2v2

and

E(εi) = 0 and V (εi) = V (yi)

Compared to the model of xi earlier, the knowledge of Z1i = N1i = Y1i

is incorporated as a shrinkage factor (i.e. always between 0 and 1) of therandom effect b2i. This is corresponding with our intuition: the assumptionof no bias in group 1, implies a smaller bias in the complete municipality ithan in its subset belonging to group 2.

We assume the linear relation b2i = β2 + v2i, similar to (2.5). Then we canwrite (2.15) as

xi = uiβ2 + civ2i + εi, (2.16)

where ui = ci = N2i/Ni and bi = uiβ2 + civ2i. This linear mixed model is analternative to (2.6).

Page 32: Report on WP4 Case studies - European Commission · CONTENTS 7 The main contributors to this report are Bart Bakker, Vincent de Heij, Daniela Ichim, Frank Linder, Filippo Oropallo,

2.3 Modelling the bias of the register-based statistics 31

2.3.2 Fitting the multilevel model

We assume vi and ei to be independent with expectation zero. In the casewhere the distribution of ei is known, we have a special case of a basic TypeA area level small area model of Rao (2003, Chapter 5), which can be writtenas

φi(X) = uTi β + civi + ei.

We identify our models (2.6) and (2.16) as special cases, with φ(·) = I(·).For the simpler model (2.6) we have ci = ui = 1, whereas ci = ui = N2i/Ni

under model (2.16).

We let ψi denote the variance of ei, i.e. of Yi, and we assume it known andgiven by (2.1). We want to estimate the bias

bi = uiβ + civi (2.17)

where uiβ is the underlying bias and civi is the unexplained variation betweenthe areas. The best linear unbiased prediction (BLUP) estimator2 is givenby uiβ + civi, and for our model the BLUP estimator becomes (Rao 2003;Section 7.1.1)

bi = ˆXi = γiXi + (1− γi)uiβ, (2.18)

where

γi =σ2vc

2i

ψi + σ2vc

2i

and β =∑i

uiXi

ψi + σ2vc

2i

[∑i

u2i

ψi + σ2vc

2i

]−1

. (2.19)

The BLUP is a weighted average of the directly observed difference Xi andthe model-induced bias uiβ. The relative trust γi put in the direct observationequals the proportion of the variance being between-area variation σ2

vc2i . For

areas with few observations ni, the sampling variance, i.e. the within-areavariation ψi is large, hence γi is small and little trust is put into Xi. This isintuitively very reasonable: remember that Xi is the difference between theregister-based employment and the LFS employment. So when LFS is veryuncertain, little trust should be put into the LFS estimate. On the other

2’Estimator’ is now commonly used instead of ’predictor’ about BLUP. The term ’pre-diction’ in BLUP originated because BLUP estimates the values of the random values vi,but these values are realised and therefore BLUP is today usually referred to as estimatorinstead of predictor.

Page 33: Report on WP4 Case studies - European Commission · CONTENTS 7 The main contributors to this report are Bart Bakker, Vincent de Heij, Daniela Ichim, Frank Linder, Filippo Oropallo,

2.4 Results 32

hand, when ni is large causing LFS to be more precise, we get larger γi andthen more weight should be put on Xi. When n → ∞, we will have γi → 1and BLUP equals the direct observation Xi. By inserting the estimatedvariances ψi and σ2

v into (2.18), we get the EBLUP estimator.

The estimation of σ2v is done using the iterative Fay-Herriot method sug-

gested in Rao (2003; Section 7.1.2), where the (a+ 1)−th iteration is

σ2(a+1)v = σ2(a)

v +m− p− h

2(a)v

)h′(σ

2(a)v

) , (2.20)

where

h(σ2(a)v

)=∑i

(Xi − uiβ

)2 (ψi + σ2

vc2i

)−1

and

h′(σ2(a)v

)= −

∑i

c2i

(Xi − uiβ

)2 (ψi + σ2

vc2i

)−2.

Usually 10 iterations are sufficient for convergence of the algorithm.

2.4 Results

When fitting the multilevel models (2.6) and (2.16) we use the data set of allmunicipalities where the LFS net sample size is at least two persons.

2.4.1 Parameter estimates

For model (2.6) we have ui = ci = 1, and the parameter estimates becomeβ = 0.00302 and σv= 0.03825. Since this model has no covariates, the un-derlying bias uiβ of the bias (2.17) is β = 0.00302.

For the heterogeneity model (2.16) we have ui = ci = N2i/Ni. We use su-perscript H to distinguish from model (2.6), and get βH2 = 0.00914 andσH2v = 0.09228. The estimated underlying bias β2ui of the bias (2.17) is aver-agely 0.0039 (and median 0.0038), thus 30 percent higher than the underlyingbias above of model (2.6).

Page 34: Report on WP4 Case studies - European Commission · CONTENTS 7 The main contributors to this report are Bart Bakker, Vincent de Heij, Daniela Ichim, Frank Linder, Filippo Oropallo,

2.4 Results 33

2.4.2 Comparison of REG-employment and REG-emplo-yment using the distribution of the underlyingbias estimator

The small area model assumes that the underlying REG-bias uiβ (as wellas the single biases bi) does not depend on sample size. However, the LFS-variance is decreasing with sample size. If we decrease the sample size, weexpect at some point that REG-employment outperforms LFS-employmentin terms of MSE. For model (2.6) this is illustrated in the left hand panelof Figure 2.2 and represents one approach to MSE-comparisons. The esti-mated underlying REG-bias uiβ = β is positive and its 95 percent confidenceinterval is also a 95 percent confidence interval of the square root of the REG-MSE. On the other hand, the estimated LFS-employment standard deviationis the estimated square root of LFS-MSE. We see that when log sample size isless than 3.5, i.e. sample size less than 33, we can be 95 percent certain thatREG-employment is better since MSE of LFS-employment is then higherthan the 95% confidence interval of MSE of REG-employment. For largersample sizes, we can not conclude one way or the other by this plot.

Figure 2.2. Square root of MSE for LFS-employment (dashed line), upper 95% confidence interval forsquare root of MSE for REG-employment (dotted lines), and the estimated underlying REG-bias uiβ(solid line); lower confidence interval truncated to zero. Model (2.6) in left panel and model (2.16) in rightpanel.

For model (2.16) we see in the right panel of Figure 2.2 that when log(ni) <3.1, i.e. sample size up to 22 and municipality size roughly below 3500, wecan conclude that REG-employment has a lower MSE than LFS-employment.For 3.1 < log(ni) < 3.8, i.e. sample size between 22 and 44 (municipalitysize roughly below 6500) we can draw this conclusion for some of the munic-

Page 35: Report on WP4 Case studies - European Commission · CONTENTS 7 The main contributors to this report are Bart Bakker, Vincent de Heij, Daniela Ichim, Frank Linder, Filippo Oropallo,

2.5 Adjusting the EBLUP estimator 34

ipalities, whereas for larger sizes this way of comparison is inconclusive. Formodel (2.6) we remember that log(ni) < 3.5 makes REG better, otherwisethis comparison is inconclusive.

2.4.3 Comparison of REG-employment and REG-em-ployment using the EBLUP estimator

We now compare the individual municipality EBLUP estimates of REG-biasagainst the LFS GVF-function. The left panel of Figure 2.3 shows that whenwe use the EBLUP estimator bi of model (2.6) for the REG-employment bias,the REG-MSE is smaller than LFS-MSE for almost all the municipalities. Inthe right panel we see that for model (2.16) we have a similar pattern.

Figure 2.3. Results when fitting Type A model: the square root of MSE of REG-employment against thatof LFS-employment. MSE of REG-employment based on EBLUP estimates. Left panel is model (2.6)and right panel is model (2.16).

2.5 Adjusting the EBLUP estimator

Under the assumed model, which in this section will be (2.6), the EBLUPestimator bi gives the best area specific estimates among all linear estimates,i.e. the best estimates for the municipalities when regarded one by one.According to the same model, the bias of REG-employment has expectationβ and variance σ2

v . However, the EBLUPs in general do not possess thisensemble property, in which we often may be interested as in the comparisonof MSE between REG and LFS in Figure 2.3. The EBLUP estimator (2.18)shrinks Xi towards the underlying common bias β, causing the empirical

Page 36: Report on WP4 Case studies - European Commission · CONTENTS 7 The main contributors to this report are Bart Bakker, Vincent de Heij, Daniela Ichim, Frank Linder, Filippo Oropallo,

2.5 Adjusting the EBLUP estimator 35

variance of bi to be smaller than σ2v , for which reason the problem is known

as overshrinkage.

Assume that bi ∼ N(β, σζ). We can construct the overshrinkage-adjustedestimator b∗Gi by first sorting bi into b(i), i = 1, . . .m. Then we replace the

i-th smallest b(i) of the{bj

}by the i

m-quantile in the N(β, σv)-distribution,

giving a new set of predicted biases{b∗Gi

}having the desired distribution.

Such a simultaneous approach has been considered by Zhang (2003). Thisalgorithm limits the deviations b∗Gi − bi from the best area-specific estimate,by keeping the order of the municipality-estimates before and after the ad-justment.

Since the bi are over-shrunk towards the global expectation, the largest REG-MSEs are underestimated, i.e. in favour of the REG-employment in thecomparison with LFS-MSE in such cases. After overshrinkage reduction, wesee in Figure 2.4 that there are some more municipalities where LFS-MSEbecomes smaller than REG-MSE, despite the overall conclusion remains.

Figure 2.4. The square root of MSE of REG-employment against that of LFS-employment, for munici-palitites. REG-bias based on the Gaussian approach, with bias estimates b∗Gi . Model (2.6)

Figure 2.5 shows that the variation of bi increases with the municipalitysample size. However, we have no reason to believe that the true variationshould depend on the sample size in this way. From (2.18) we see thatγi is an increasing function in the municipality sample size. Therefore, tomake the variation of bi increase less rapidly, we may use a tranformed γi,such as γofi = (γi)

1/λ. By selecting λ = 2, we get an estimator bofi whoseempirical variance S2

bofbecomes almost identical to the estimated variance

Page 37: Report on WP4 Case studies - European Commission · CONTENTS 7 The main contributors to this report are Bart Bakker, Vincent de Heij, Daniela Ichim, Frank Linder, Filippo Oropallo,

2.5 Adjusting the EBLUP estimator 36

Figure 2.5. The bias of REG-employment based on the EBLUP estimates. Model (2.6).

σ2v , hence also this approach results in an overshrinkage corrected estimator.

A constrained empirical Bayesian (CEB) justification of this particular valueλ = 2 was given by Spjøtvoll and Thomsen (1987). We notice that γofi islarger than γi for all sample sizes, and greater emphasis is put on the directestimator Xi compared to (2.18) for all municipalities. Figure 2.6 illustratesthe difference between γofi and γi.

Figure 2.6. The parameter estimate γi (2.19) (solid line), the overshrinkage-factor adjusted γofi = (γi)1/2

(dashed line).

Page 38: Report on WP4 Case studies - European Commission · CONTENTS 7 The main contributors to this report are Bart Bakker, Vincent de Heij, Daniela Ichim, Frank Linder, Filippo Oropallo,

2.5 Adjusting the EBLUP estimator 37

Figure 2.7. Bias of REG-employment using EBLUP estimates (’1’) and when using adjusted (’2’). Forthe following methods: Gaussian-based overshrinkage-adjusted b∗Gi (left panel), and constrained empiricalBayesian (CEB) overshrinkage-adjusted bofi (right panel). Model (2.6).

Figure 2.7 shows that bofi is more uniformly distributed with regard to themunicipality sample size (left panel), compared to b∗Gi (right panel).

We notice from the right panel of Figure 2.7 that bofi hardly adjusts theEBLUP estimator bi for the 15–20 largest municipalities.

Figure 2.8. The square root of MSE of REG-employment against that of LFS-employment, for munici-palitites. REG-bias based on the CEB overshrinkage adjusted bias estimate βofi . Model (2.6).

This is in contrast to b∗Gi which clearly adjusts bi also for these municipalities,and even for the largest municipality Oslo. Intuitively, we would want the

Page 39: Report on WP4 Case studies - European Commission · CONTENTS 7 The main contributors to this report are Bart Bakker, Vincent de Heij, Daniela Ichim, Frank Linder, Filippo Oropallo,

2.6 Conclusions 38

adjustment for Oslo and other larger municipalities to be limited since thebest area specific estimates bi are more precise for these municipalities. ForOslo, bi takes as much as 96 percent of its value from the direct estimator andonly 4 percent from the common β. The emphasis on the direct estimatoris even stronger for bofi which for Oslo takes 99 percent of its value from thedirect estimate. Figure 2.7 reveals that the overshrinkage adjustment methodb∗Gi has a drawback in that bi is modified with no regard to the sample sizeof the municipalities. Meanwhile, in Figure 2.8 we see that the number ofmunicipalities where REG-MSE exceeds LFS-MSE is approximately by theCEB overshrinkage adjustment method as by the approach underlying Figure2.4.

2.6 Conclusions

We have described an approach for comparing register-based statistics withsurvey-based statistics that does not require linkage of the data across thesources on the individual level. Essentially, this comes down to the trade-offbetween the bias of the register data and the sampling variance of the surveydata. Small area estimation techniques are used to estimate the bias of theregister-based statistics. Adjustment for overshrinkage of the area-specificbest estimates has been considered, which may affect the comparison. Themethodology was illustrated using the Norwegian register-based employmentand LFS data.

Page 40: Report on WP4 Case studies - European Commission · CONTENTS 7 The main contributors to this report are Bart Bakker, Vincent de Heij, Daniela Ichim, Frank Linder, Filippo Oropallo,

Chapter 3Combining data from administrativesources and sample surveys; thesingle-variable case. Case study:Educational Attainment

Frank Linder, Dominique van Roon, Bart Bakker1

Statistics Netherlands

Preface

The introduction in the last few decades of administrative registers in var-ious ESS countries offers new opportunities to National Statistical Insti-tutes (NSIs) to develop register-based statistics. Although registers originallyserved administrative purposes, their quality often turns out to be sufficientfor statistical purposes as well. As a modern way of data-collection, ad-ministrative registers can even replace the traditional sample surveys of thepast, in particular if there is a proper equivalent for each sample variable inthe register. On the other hand, administrative sources do not always pro-vide sufficient coverage of the target population. In that case an NSI couldconsider retrieving statistical information for a variable from administrativeregisters as well as from sample surveys, and to combine that information.

1The authors would like to thank Vincent de Heij for his major contribution to section3.4 and Appendix B.

Page 41: Report on WP4 Case studies - European Commission · CONTENTS 7 The main contributors to this report are Bart Bakker, Vincent de Heij, Daniela Ichim, Frank Linder, Filippo Oropallo,

40

There are several methodological issues involved when data from administra-tive sources and sample surveys are combined for the purpose of constructingone single variable. These will be discussed in this report referring to a specialcase study: the method of deriving educational attainment.

In the Netherlands, statistical information on educational attainment was fora long time exclusively the domain of the Labour Force Survey (LFS). Therecent introduction of administrations on education allows the constructionof a variable educational attainment that is substantially based on registers.A quality-check confirms that most of these sources are appropriate for thispurpose. An important advantage of the new method is that in general,estimates on education level are more reliable than those exclusively based onthe LFS, in particular when smaller populations are involved. As educationregisters do not cover the entire population, the LFS still plays an importantpart in filling the gaps. Most older citizens, for example, have completedtheir education prior to the administrations. For this part of the population,the use of the LFS data source is indispensable.

The innovating aspect in this micro-integration approach is that data fromregisters and sample surveys are combined to produce one single variable.Statistics Netherlands has a great deal of relevant experience with the com-bination of administrative sources and sample surveys. However, so far datafor one single variable originated from either registers or sample surveys,never from both sources at the same time. With the new approach somethorny methodological issues need to be solved, such as the question of howto deal with out-of-date information on education and the weighting mecha-nism.

The proceedings of this case study are of practical value for those who con-sider describing a phenomenon for a population in statistical terms by com-bining data from registers (with partial coverage of population and full cover-age for some subpopulations) and sample surveys. For a NSI it is interestingthat alongside increasing the accuracy, the additional value of the registersis that they also allow severe cuts in the expenses for conducting a survey,in particular in the areas where registers already give sufficient coverage.

Keywords: Accuracy, administrative register, bias indicator, bootstrap-method,calibration, combined estimator, consistency, domain, educational attain-ment, framework for errors, Generalized Regression Estimator, micro-inte-gration, micro-linkage, model selection, nonresponse, Not Missing At Ran-dom, sample survey, sampling design, scaling, selective loss, Social StatisticalDatabase, (sub)population, variance indicator, weighting strategy

Page 42: Report on WP4 Case studies - European Commission · CONTENTS 7 The main contributors to this report are Bart Bakker, Vincent de Heij, Daniela Ichim, Frank Linder, Filippo Oropallo,

3.1 Introduction 41

3.1 Introduction

In the last decade, combining data from administrative sources and surveyshas become more and more customary in the field of social statistics in theNetherlands. It is a development for which the Dutch Virtual Census of2001 was somehow a pacesetter. The traditional population censuses withtheir national field enumeration came to an end due to their high costs andbecause citizens felt less inclined to participate. A much cheaper alternativewas found in the virtual census, by using the Social Statistical Database asa key source (Linder, 2004, and Schulte Nordholt and Linder, 2007).

The Social Statistical Database (SSD) is a statistical framework that providesinformation on demographic and socio-economic issues, and is constructed bymicro-linkage and micro-integration of administrative registers and householdsample surveys. Micro-integration is applied in order to ensure coherence,consistency2 and completeness of the SSD-data (Linder, 2004).

A general rule for the SSD is that it gives priority to administrative sources,whenever these are available. Sample surveys are explored to compensatefor information that is not (yet) in registers, which means that their role ismainly supplementary. Sample survey questionnaires tend to be shortenedand in some cases surveys are even abolished3, as administrative registers aregetting more and more serious competitors for them. It is not difficult to un-derstand this as sample surveys are a relative costly means of data-collecting,and they impose a high response burden on the interviewees. Moreover sam-ple survey results suffer from sampling errors (in particular for small subpop-ulations) and non-sampling errors (e.g. nonresponse). So sample surveys arein fact only needed there where registers do not provide proper and accurateinformation.

Educational attainment is an example of a variable which at the time of theVirtual Census of 2001 was available in the Labour Force Survey (LFS) sam-ples, but not in administrative sources accessible to Statistics Netherlands.So, when the Census 2001 asked for the level of educational attainment ofemployed persons in the labour market, employment data on employees andself-employed persons from the SSD jobs register had to be combined withdata on education level from the LFS. Consistency between the SSD jobs reg-

2In the present-day SSD the requirement of consistency is not imposed on provisionalfigures for up-to-date SSD information, since it is a time-consuming part of the micro-integration process which may prevent the figures being ready on time.

3At Statistics Netherlands the Survey on Employment and Earnings has vanishedcompletely and has been replaced by a register with information on jobs and wages.

Page 43: Report on WP4 Case studies - European Commission · CONTENTS 7 The main contributors to this report are Bart Bakker, Vincent de Heij, Daniela Ichim, Frank Linder, Filippo Oropallo,

3.1 Introduction 42

ister (sub)population totals and the LFS (sub)population totals in the censustables was achieved by using the method of Consistent Repeated Weighting,for more details see Bakker (2010). It is important to notice that with thismethod of combining data, one data source is fully responsible for the in-formation content of one variable, whereas another data source is the soleprovider of data for the other variable - in our example: on the one hand,education levels exclusively from the LFS and on the other hand, the jobsregister to define the population of employed persons. This multi-variable usefrom different sources was until then the standard way of combining registerand survey information at Statistics Netherlands.

Times have changed, however. From the beginning of the new millenniuma wide variety of administrative education registers have become available toStatistics Netherlands. This enabled us to construct the educational attain-ment variable in an alternative way, with the use of register data. Unfor-tunately, it has not (yet) been possible to replace the LFS-based version ofeducational attainment by one which is fully register-based, the main reasonbeing that recording of education activities of a person for register purposesstarted not so long ago4. This means that most older citizens completedtheir education career before education registers came into consideration forofficial statistics. Therefore, education registers suffer from undercoverageand to gain an insight into the education levels of people above their thirtiesand forties we are still dependent on the LFS. Besides, in the present-daysituation education registers only cover publicly-financed education, and soneglect education in private schools or outside the country.

In the particular case of constructing an educational attainment variable,education registers are therefore an important data source. However, samplesurveys with education data are still needed to compensate for the lack ofinformation in the administrative source. The innovative aspect in this ap-proach of micro-integration is that data are combined from registers as wellas from sample surveys for the purpose of producing just one single variable.Constructing a combined estimator of the variable educational attainmentfrom registers and the LFS supplement is far from simple. It faces us withvarious methodological problems that have to be solved by applying sophis-ticated methodology. This involves for example duration analysis in orderto utilize out-of-date information on education levels, search strategies foradequate weighting models to cope with the complex sampling design, and

4In that respect the Netherlands lags behind the Nordic countries (Denmark, Finland,Iceland, Norway and Sweden), which have a long experience with register-based statistics.For education, as far back as the seventies and eighties (UNECE, 2007).

Page 44: Report on WP4 Case studies - European Commission · CONTENTS 7 The main contributors to this report are Bart Bakker, Vincent de Heij, Daniela Ichim, Frank Linder, Filippo Oropallo,

3.1 Introduction 43

advanced methods to measure the accuracy of the estimator.

The general value of this case study is that it illustrates what can be done incase one considers deriving a single variable from multiple sources, registersas well as sample surveys. It also implicitly gives some insights to NSIs intohow to collect statistical input in an economic way, the message in fact beingthat surveys should be only conducted in the areas where registers provideinsufficient coverage.

Section 3.2 describes the various sources that are explored for the variableeducational attainment and their quality. Section 3.3 deals with the micro-integration aspects of deriving educational attainment. Section 3.4 goes intothe weighting strategy for representativeness purposes. Section 3.5 focuseson the accuracy of the combined estimator. Section 3.6 completes the reportwith some concluding remarks.

Page 45: Report on WP4 Case studies - European Commission · CONTENTS 7 The main contributors to this report are Bart Bakker, Vincent de Heij, Daniela Ichim, Frank Linder, Filippo Oropallo,

3.2 Sources on education and their quality 44

3.2 Sources on education and their quality

Dominique van Roon, Frank Linder, Bart Bakker

Statistics Netherlands

Nowadays, ever more statistical information comes from administrative reg-isters. These registers originally served administrative purposes, which doesnot automatically imply that the data are fit for statistical purposes. To beable to decide which sources contain valuable information about the educa-tion of people in the Netherlands for statistical use, it is essential to gaininsight into the data quality of these sources, or in other words the errorsin the data sources. What should be examined is: which data are reliableenough for statistical use, and which source is to be preferred in case ofcontradicting information?

In practice, it is difficult to trace errors in administrative sources, as a sub-stantial number of them occur in the office of the register keeper. Theseerrors may be less well documented because register keepers are not alwaysconcerned about diminished quality, certainly if the information is not so rel-evant for the purpose of the registration. Apart from that, it is also difficultto implement systems to measure errors in the processes at the statisticaloffices due to a lack of suitable indicators. However, even if it were possibleto develop such indicators, operationalising and implementing them may re-quire large investments. Nevertheless, one can always start research projectsto examine the quality of sources that are to be combined. A more elaboratetreatment of quality issues with respect to data sources (e.g. a frameworkfor errors in statistics based on combined sources) is to be found in Bakker(2010) and Bakker et al. (2008).

The following sections present the data sources with information on educationused for the construction of the variable educational attainment. This alsoincludes a discussion on the quality aspects related to these sources. Section3.2.1 deals with the only non-administrative source for that purpose: theLabour Force Survey. Section 3.2.2 focuses on administrative sources withsuitable data on education. To conclude, section 3.2.3 gives an overview inwhich the data sources are presented in relationship to the ’life cycle’ and’framework for errors’ mentioned in Bakker (2010).

Page 46: Report on WP4 Case studies - European Commission · CONTENTS 7 The main contributors to this report are Bart Bakker, Vincent de Heij, Daniela Ichim, Frank Linder, Filippo Oropallo,

3.2 Sources on education and their quality 45

3.2.1 The Labour Force Survey

The Dutch Labour Force Survey (LFS) is an annual household sample surveythat started in 1987. Its target population are people of 15 years and olderliving in private households. Each year well over 100 thousand individualsare included, just under 1 percent of the total population. However, as from2009 there has been a major reduction in the LFS sample size with some15 percent. The size of the sample can be considered sufficient for reliableestimates on educational attainment in larger subpopulations, such as theunemployed in the eastern region of the Netherlands or male foreigners inRotterdam.

Problems may arise with smaller subpopulations: small municipalities, forexample. Normally, this is solved by presenting outcomes based on the uni-fied sample survey for two or three consecutive years. Standard errors willgenerally decrease as result of the higher number of observations. As long asthe variable concerned changes slowly over time this is an acceptable method.Education levels are rather stable within a period of one or two years, andso may be considered suitable for such an approach. For this reason thecensus variable educational attainment in the 2001 Census (reference date 1January) was based on the LFS of 2000 and 2001.

The LFS provides the complete education career of a respondent until thedate of the interview, including education abroad. Education programmesare coded according to the Standard Classification of Education (SCED2006),see Schaart et al. (2008). These SCED codes can easily be converted intoISCED levels of the International Standard Classification of Education 1997(ISCED1997), issued by UNESCO.

Unlike most of the registers, the Labour Force Survey lacks a personal iden-tifier such as the Citizen Service Number. Linkage of the LFS with othersources often occurs with a key that consists of the identifiers sex, date ofbirth, postal code and house number (Schulte Nordholt and Linder, 2008)5.

5A linkage key as used for linkage with the LFS will usually be successful in distin-guishing people. However, it is not a hundred percent unique combination of identifiers.Linking may result in a mismatch in the case of twins of the same sex. False matchesmay also occur when part of the date of birth or the postal code and house number isunknown or wrong. Another drawback is that the linkage key is not person but addressrelated, which may cause linkage problems if someone has recently moved. When linkingthe Population Register and the LFS with this alternative key, and tolerating a variationbetween sources in a maximum of one of the variables sex, year of birth, month of birth orday of birth, the result is that close to hundred percent of the LFS records will be linked.In its linkage strategy, Statistics Netherlands tries to maximize the number of matches

Page 47: Report on WP4 Case studies - European Commission · CONTENTS 7 The main contributors to this report are Bart Bakker, Vincent de Heij, Daniela Ichim, Frank Linder, Filippo Oropallo,

3.2 Sources on education and their quality 46

When linked to the Population Register linkage rates as high as 98 percentare achieved, at least for the LFS from 1995 onwards. For earlier years linkagewith the Population Register is more problematic.

The quality of the measurement of the education data in the LFS differs overtime. Problems, in terms of comparability in the long run, may arise whenusing education data older than 1996. For the period 1996-2003 the qualitycan be considered as reasonable. Further, there is a quality problem withregard to foreign education programmes in 2004-2007. In those years a lesselaborated questionnaire was used than was desirable. This resulted in anoverestimation of the level of education of these programmes. In general, themeasurement of the start and end dates of the programmes lacks accuracy.

Nonresponse is high in the LFS (De Leeuw and De Heer, 2002), and whatis even worse, it is selective. From 1987 onwards, the response fluctuatesaround 60 percent (Cobben, 2009). The response rates that sample surveysused to have in the early seventies, up to 88 percent, have never been reachedagain.

In our effort to determine educational attainment from a combination ofregisters with education data and the LFS, we make use of all the availableLFS from 1996 onwards.

3.2.2 Administrative education registers

3.2.2.1 Primary education

Until recently there was no administrative register on primary education inthe Netherlands. In 2010 a first version of the Primary Education NumberRegister (school year 2008/’09) was launched. Since then, first releases ofregister data for the school years 2009/’10 and 2010/’11 have become avail-able, although their quality is not yet fully satisfactory. Before presentingresults from this register, first a quality-check has to be done. For instance,does it completely cover the target population? As long as this is underinvestigation, the educational attainment of the younger population will beimputed, depending on its age.

and to minimize the number of mismatches. So, in order to achieve a higher linkage rate,more efforts are made to link the remaining unlinked records by means of different variantsof the linkage key. For example, leaving out the house number and tolerating variationsin the numeric characters of the postal code. To keep the probability of a mismatch assmall as possible, some ’safety’ devices are built in the linkage process. This last linkingattempt accomplishes an extra one percent matches.

Page 48: Report on WP4 Case studies - European Commission · CONTENTS 7 The main contributors to this report are Bart Bakker, Vincent de Heij, Daniela Ichim, Frank Linder, Filippo Oropallo,

3.2 Sources on education and their quality 47

3.2.2.2 Secondary education

There are several administrative registers on secondary education.

From 1999 onwards there is a so-called Exam Results Register (ERR), whichcontains data of students taking exams in secondary general and in lowersecondary vocational education. Apart from the exam results there is alsoinformation on the type and level of the education programme. The infor-mation is processed by the Ministry of Education, Culture and Science. Thequality of the register information is not so relevant to the register keeper,since no financing depends on it. Therefore, the data possibly contain a fewmeasurement and processing errors. For Statistics Netherlands it is easy toderive the required statistical information from the administrative informa-tion. The representativeness of the ERR can be considered as very high.Only a few schools do not provide the required information. In addition,exam results of privately financed schools and of exams taken before 1999are not covered by the ERR.

An alternative way of obtaining information on secondary education grad-uates is to use registers on Higher Education (see section 3.2.2.3). Theseregisters record how the student complied with the preliminary admissionrequirements for higher education (CREHE Preliminary). Compared withthe ERR the information goes back a longer period in time as relevant datacollection had already started in the early 1980s. Perhaps the informationabout the education level required for admission to higher education is lessaccurate, as that is an irrelevant detail for the accountants.

Quite new are the Education Number Registers (ENR) for secondary edu-cation, including upper secondary vocational education and adult educationprogrammes. The ENR focus on all grades a student passes through. TheENR for secondary general and lower secondary vocational education startedin 2002/’03; for upper secondary vocational and secondary adult educationin 2004/’05.

In principle, the ENR are used for financial purposes, such as improvingthe transparency of public financing of education. For that reason some ofthe ENR variables are audited by accountants. Both the register keeperand accountants concentrate on variables also in use for statistical purposes.Therefore, it may be assumed that the variables are measured accurately.There is hardly any problem deriving the required statistical variables fromthe administrative ones. However, there are some problems with regard torepresentation. In the year the register was introduced the register targetpopulation was only covered partially. For upper secondary vocational edu-

Page 49: Report on WP4 Case studies - European Commission · CONTENTS 7 The main contributors to this report are Bart Bakker, Vincent de Heij, Daniela Ichim, Frank Linder, Filippo Oropallo,

3.2 Sources on education and their quality 48

cation, this was even the case in the second year of its existence. Further,as ENR registers are restricted to publicly financed educational institutesthere is no coverage for students in private schools6 and institutes outsidethe Netherlands.

3.2.2.3 Higher or tertiary education

The Central Register for Enrolment in Higher Education (CREHE) recordsinformation on students at universities (from 1983) and in non-academic ter-tiary education (from 1986, with some underreporting in the initial two years)on an annual basis. This includes data on certificates. As the CREHE hasexisted for a considerable time, it is nowadays possible to represent a wholegeneration of (former) students in higher education, roughly aged 18–40, byCREHE data. This is with exception of those who were enrolled in highereducation outside the country or at privately financed institutions7.

The CREHE register is hampered by some small measurement errors. As itis largely used for financing purposes, the register is audited by accountants.Both the register keeper and accountants concentrate for their data checks onvariables also used for statistical purposes. In addition, the processes to cor-rect for errors are designed by the Ministry of Education, Culture and Sciencein narrow cooperation with experts from PersonNameStatistics Netherlandsand the register keeper. Altogether the information in the CREHE can beconsidered accurate.

The CREHE suffers from some errors in the representation of the statisticaltarget population. The register covers only higher education that is publiclyfinanced. Courses taken at private colleges and universities and at institu-tions abroad are outside the scope of the register. Further, coverage startedfrom the early 1980s, so there is no information about the preceding period.

3.2.2.4 Other administrative registers

Beside the education registers mentioned above there are some administrativeregisters which can be useful when supplementary data on education levelsare needed.

6A rough LFS-based estimate indicates that some 5 percent of the student populationof 2005 obtained secondary or higher education (of more than 12 months) that was notpublicly financed. For the near future the intention is to add data on private secondaryeducation to the ENR.

7See previous footnote

Page 50: Report on WP4 Case studies - European Commission · CONTENTS 7 The main contributors to this report are Bart Bakker, Vincent de Heij, Daniela Ichim, Frank Linder, Filippo Oropallo,

3.2 Sources on education and their quality 49

The Public Employment Service Register (PESR) is potentially a promisingsource of information on education levels, owing to the large scale of its targetpopulation. Between 1990 and 2009 information was collected in the registeron 5.4 million people, which is some thirty percent of the present Dutch pop-ulation. In the PESR people are registered who have sought a job throughthe Public Employment Service (since 1990), as well as benefit claimants. Italso includes the population that had its education abroad, and that is inter-esting because there is hardly any other register information about educationoutside the country. Another reason why PESR-data are useful is that theyalso contain education information on people who are in their forties, fiftiesand sixties who are scarcely covered by education registers. The educationinformation in the PESR is obtained from face-to-face interviews, but morerecently also via a webform. During the period of registration in the PESR,the education level of the client is updated for any possible change.

Unfortunately, the educational attainment information is of insufficient qual-ity in the present PESR-data. As it is now, educational attainment in thePESR-version has probably more to do with competencies for the labourmarket than with education career. There are indications for that frommicro-linkage with other sources, such as the LFS. Another inferior point ofthe PESR is that the information is less detailed, since only five differenteducation levels are distinguished.

However, there are promising developments taking place now. The Pub-lic Employment Service has recently produced new data files which containmore detailed information on educational attainment. It is likely that theeducation level information in these new files approaches more closely thedesired statistical concept of education, which is ’real’ education in the senseof what one has learned at educational institutes, and has less to do withwork experience. However, it is something that has still to be confirmed bythorough examination. If the new PESR-data on education appear to be ofgood quality, there is a good chance that they will be used in the near futureas a target variable in the process of estimating the distribution of educationlevels.

In the present PESR-data the administrative variable educational attainmentcan still be useful as an auxiliary variable. In spite of the differences withthe desired statistical concept of educational attainment, there is undeniablya high correlation between both. See also section 4 on weighting matters.

The linking error which arises when linking PESR with the Population Reg-ister is within acceptable bounds, but higher than in the above-mentioned

Page 51: Report on WP4 Case studies - European Commission · CONTENTS 7 The main contributors to this report are Bart Bakker, Vincent de Heij, Daniela Ichim, Frank Linder, Filippo Oropallo,

3.2 Sources on education and their quality 50

registers.

Another administrative register is the Student Finance Register (SFR) withinformation on study grants from the Dutch government. In the eNether-lands, most students in higher education (university and non-academic ter-tiary education), and also students over eighteen in upper-level secondaryvocational education, receive such a grant during a number of years. Whendiscussing the quality, there is a slight degree of linking errors, althoughrather higher than in the above-mentioned registers except the PESR. Itcould be a rich source, with numerous students registered from 1995, butunfortunately there are no data on certificates and the information on studystages (e.g. bachelor, master) suffers from lack of detail. Therefore, it is oflimited use for deriving educational attainment.

Then there is a financial register dating from 2001, which registers students insecondary education for whom school fees have to be paid by law. Originally,the target group was students over 16, but from 2006 it is restricted to thestudent population over 18. The problem with this Register of School Fees(RSF) is that we do not know to what type of secondary education the schoolfees refer, nor is there any information available about certificates. Therefore,the RSF cannot be used for the purpose of deriving education levels.

Unfortunately, except for the Public Employment Service Register mentionedabove, there is no register with information on the highest attained educationlevel for older people. To get some more insight into their education levels, ithas often been suggested to make use of the last traditional Dutch populationcensus of 1971. That is to say, as far as these persons were at the end of theireducation career in or before 1971. However, linking data of the 1971 Censusto present-day files is impossible due to the deletion of personal identifiers inthat Census for reasons of security. Accordingly, we have to forget this ideaof using the 1971 Census.

3.2.3 Sources on education and framework for errors

In Bakker (2010) and Bakker et al. (2008) a so-called life cycle and frameworkfor errors is introduced to classify measurement and representation errors inregister-based research and in studies based on combined register and surveydata.

This section will illustrate the theoretical framework by applying it to thesources on education presented in the preceding sections. The results areshown in the overview on the following pages.

Page 52: Report on WP4 Case studies - European Commission · CONTENTS 7 The main contributors to this report are Bart Bakker, Vincent de Heij, Daniela Ichim, Frank Linder, Filippo Oropallo,

3.2 Sources on education and their quality 51

MeasurementThe first element in the measurement part of the error framework is thetheoretical concept that has to be measured. In the case of registers this isthe administrative concept to be recorded by the register keeper for admin-istrative purposes. For example ’exam result (passed / not passed)’ in theERR. As long as the administrative concept can be clearly interpreted, andmisunderstandings about it are precluded, the validity of the concept maybe considered as sufficient.

The second element is operationalisation, i.e. how information about thetheoretical concept is collected. It is important that questionnaires and pro-tocols are designed in such a way that they minimise possible measurementerrors. a well-known problem with sources on education arises when foreigneducation is reported. Even if foreign education levels are classified as wellas possible, perhaps with the aid of a professional agency, it remains difficultto incorporate these levels perfectly within the national standard classifica-tion of education. So, in the case of education abroad it is hard to avoidmeasurement errors.

The third element concerns the response, or in other words the set of (ad-ministrative) data reported or recorded. These data will be processed bythe register keeper, and to detect irregularities in them, checks are carriedout during or after data collection. Depending on the interest of the registerkeeper this may vary from rough to very thorough checks. As we have seenmany education registers, such as CREHE and ENR, were mainly set upto examine whether public resources allocated to education are spent prop-erly. Accountants intensely audit the relevant variables for this purpose, soit may be assumed that the number of processing errors in this specific caseis negligible. Processing errors can also arise or even be exacerbated whencorrection methods for irregularities fail to do what they should do.

The fourth element deals with the statistical concept, which may differ con-siderably from the register concept. a clear example of this is the conceptof education level as measured in the PESR. The register concept focusesmore on the labour market oriented interpretation, whereas statistical agen-cies may prefer a statistical concept that is more actually education oriented.It is not always easy to derive the statistical variable from the administrativevariables, the more so if the administrative data lack detail.

RepresentationThe first element in the representation part of the framework for errors isdefining the target population. It should be kept in mind that the statistical

Page 53: Report on WP4 Case studies - European Commission · CONTENTS 7 The main contributors to this report are Bart Bakker, Vincent de Heij, Daniela Ichim, Frank Linder, Filippo Oropallo,

3.2 Sources on education and their quality 52

target population may differ from the register target population. The ENRtarget population consists of students in publicly financed secondary educa-tion, in other words excluding private education. As statisticians are usuallymore interested in the complete population of secondary education studentstheir statistical target population faces undercoverage when based on ENR.

As an equivalent of nonresponse in traditional surveys, mismatches andmissed matches occur when registers are linked with other registers, or withsurveys. If linking errors are severe, and moreover selective, the linked popu-lation elements might be affected by bias. Fortunately, in the case of registerswith education data hardly any problems are to be expected when they arelinked with other registers. For the LFS this is a different story, however,because even though linkage rates with, for example, the Population Regis-ter are high, they are not entirely up to one hundred percent and, therefore,problems with selectivity in the linkage cannot be excluded (Schulte Nord-holt and Linder, 2008). Analysis in the past has indicated that the youngpeople, in the 15-24 age bracket, show a lower linkage rate in household sam-ple surveys than other age groups. The reason for this is that they movemore frequently, therefore they are often registered at the wrong address.The linking rate for persons living in the four large cities Amsterdam, Rot-terdam, The Hague and Utrecht is lower than for persons living elsewhere.Ethnic minorities also have a lower linkage probability, among other thingsbecause their date of birth is often less well registered.

Apart from correcting these errors in the administrative processes of theseparate sources, a solution for measurement and representation errors willoften be found in combining various administrative sources and surveys. Thisis the stage of the so-called micro-integration process, which is what the nexttwo sections are about.

Page 54: Report on WP4 Case studies - European Commission · CONTENTS 7 The main contributors to this report are Bart Bakker, Vincent de Heij, Daniela Ichim, Frank Linder, Filippo Oropallo,

3.2 Sources on education and their quality 53

Page 55: Report on WP4 Case studies - European Commission · CONTENTS 7 The main contributors to this report are Bart Bakker, Vincent de Heij, Daniela Ichim, Frank Linder, Filippo Oropallo,

3.2 Sources on education and their quality 54

Page 56: Report on WP4 Case studies - European Commission · CONTENTS 7 The main contributors to this report are Bart Bakker, Vincent de Heij, Daniela Ichim, Frank Linder, Filippo Oropallo,

3.3 Micro-integration 55

3.3 Micro-integration

In order to derive the educational attainment variable from a combination ofsources on education a micro-integration process is set up. This section andthe following one give a detailed description of that process in line with theelements of the micro-integration model as presented in Bakker (2010).

3.3.1 From education sources to Education Archive

The education registers, mentioned in the preceding section, are all providedwith a personal identifier for each student, the Citizen Service (CS) number,previously called the social security and fiscal number. The CS-number canbe used as a linkage-key. However, for protection reasons it is first encryptedinto a unique personal Record Identification Number (RIN-person).

The LFS, the other source with data on education, lacks a CS-number. A per-sonal RIN is assigned to each person in the survey after identifying them bysex, date of birth, postal code and house number.

The first step in the process of constructing a combined estimator of thevariable educational attainment is to bring together the information fromthe registers with education data and the LFS, and to sort it on personallevel (RIN-person) in a huge database, the so-called Education Archive. Thisarchive shows the education careers of the total population as extracted fromthe sources.

The statistical unit in the Education Archive is RIN-person × educationprogramme. The archive is updated annually with new information on ed-ucation programmes. In other words, the Education Archive 2009 stores allavailable information on education careers up to 2009 in cumulative form, re-sulting in 76.6 million records. The information in the Education Archive canbe conflicting at times, when data on the same subject come from differentsources.

The information from the sources is kept as authentic as possible, as oughtto be the case in an archive. Therefore, micro-integration at this stage of theprocess is restricted to correcting obvious errors and a bit of harmonization.An example of the latter is the standardization of internal education codes ofthe different sources to unique SCED-codes for each education programme.

For RIN-person R006274060 in the figure above the only register informationon lower secondary vocational education is from ERR (exam results). After

Page 57: Report on WP4 Case studies - European Commission · CONTENTS 7 The main contributors to this report are Bart Bakker, Vincent de Heij, Daniela Ichim, Frank Linder, Filippo Oropallo,

3.3 Micro-integration 56

Figure 3.1. Education Archive 2009, fragment (fictitious example).

passing his exam at 16, this person must have attended secondary educationfor two years, as the RSF shows. This was probably upper secondary voca-tional education, but it cannot be confirmed as there were no registers onthat sort of education until 2004. Between August 2004 and September 2007there are no observations at all for this person. There can be various reasonsfor that. Maybe he chose to work instead of studying, or he may have goneabroad to go to college. The latter will remain unknown as there is no regis-ter information on studies outside the country. It is clear from this examplethat it cannot be guaranteed that the information on someone’s educationcareer in the Education Archive is always complete.

RIN-person R046879435 looked for a job in 1994, according to PESR’94. Hewas registered as someone with higher secondary education. There are nofurther details on his education.

RIN-person R165529489, who was interviewed for the LFS’96, was evidentlyattending secondary general education from 1993 to 1996, but there is noinformation what happened to her afterwards, because there were no ENRregisters at that time. Did she pass her exam in ’96/’97 or ’97/’ 98? Thatwill remain the question, as the ERR did not then exist.

3.3.2 From Education Archive to Educational Attain-ment File

The ultimate goal is to build an Educational Attainment File containing thehighest attained education level of an individual at a reference date. Ed-ucation levels in the Educational Attainment File are presented as 2-digitSCED-level codes8 (Schaart et al., 2008). The figure below shows the edu-

8Each education programme is provided with a unique 6-digit SCED-code. The corre-sponding education level is presented with a 2-digit SCED level code, within a hierarchicclassification. The 6-digit SCED-code itself is unique, but obviously there are numerous

Page 58: Report on WP4 Case studies - European Commission · CONTENTS 7 The main contributors to this report are Bart Bakker, Vincent de Heij, Daniela Ichim, Frank Linder, Filippo Oropallo,

3.3 Micro-integration 57

cation levels according to the 2006 version of the SCED classification, andtheir relationship with the international ISCED level classification of 1997.

To derive someone’s educational attainment, data have to be selected fromthe Education Archive. Now data in the Education Archive are stored practi-cally in their original state, depending on the source they come from. There-fore differences in definition, level of detail, quality, measurement and cov-erage in the data of all these sources are inevitable. In order to get anunambiguous presentation of educational attainment, micro-integration hasto be performed in successive stages.

SCED2006 Description SCED2006 ISCED1997 Description ISCED1997level code level code level code level code10/20 Not more than primary education 0 / 1 0. Pre-Primary education

30 Secondary education, 2 1. Primary education1st stage

31 Secondary education, 2 2. Lower secondary education1st stage, lower level

32 Secondary education, 2 3. Upper secondary education1st stage, intermediate level

33 Secondary education, 2/3 4. Post secondary non-tertiary education1st stage, higher level

41 Secondary education, 3 5. First stage of tertiary education2nd stage, lower level

42 Secondary education, 3 6. Second stage of tertiary education2nd stage, intermediate level

43 Secondary education, 3/4/52nd stage, higher level

51 Higher education, 51st stage, lower level

52 Higher education, 51st stage, intermediate level

53 Higher education, 51st stage, higher level

60/70 Higher education, 5/62nd and 3rd stage

3.3.2.1 Selection from the Education Archive

The first step to obtain education levels at the reference date is to select allrecords in the Education Archive representing education careers until thatdate. Before this can be done, it may in some cases be necessary to deriveor impute missing start and end dates of an education programme. This isthe so-called micro-integration step of imputation or derivation for missingvalues.

If an education programme has been completed with a certificate, it is the enddate of that programme that is relevant to mark the moment that someonegets an upgrade of his education level. If one is only interested in the highesteducation level of education programmes that have been attended, even ifnot completed with a certificate, then it is enough to know the start date.

education programmes with the same 2-digit SCED level code.

Page 59: Report on WP4 Case studies - European Commission · CONTENTS 7 The main contributors to this report are Bart Bakker, Vincent de Heij, Daniela Ichim, Frank Linder, Filippo Oropallo,

3.3 Micro-integration 58

In the second stage of the micro-integration process the differences in defini-tions between source and target population are adjusted.

Only those records are selected from the Education Archive that belong topeople defined by the Population Register (PR), as that is the target popu-lation of the Educational Attainment File. The Education Archive also hasinformation on foreign students, but they are no part of the target popula-tion. So their records are left out.

On the other hand the Education Archive hardly contains any informationon youngsters up to 12. This means that another micro-integration activityhas to be performed: completion of populations. Accordingly, records areadded from the PR for boys and girls under 12 at the reference date, aslong as there is no source on primary education ready for operation. Theyare all classified with education level ’not more than primary education’(SCED2006-level 10/20). Apart from that, also those 12 to 14 year-olds areincluded from the PR who are not represented in the Education Archive. Theidea behind this is that although they are missing in the Education Archive,school attendance is compulsory for these ages. Their absence in the Edu-cation Archive most likely has to do with an omission in the administrativeregisters. For this particular population the highest attained education levelwill be ’not more than primary education’ (SCED2006-level 10/20). Thehighest attended education level is determined as ’secondary education firststage’ (SCED2006-level 30).

To conclude, not all source information in the Education Archive is usefulfor deriving educational attainment. RSF records have no information oneducation levels and are therefore not selected. SFR records on the otherhand are only selected if the SFR education codes have sufficient detail,which is not always the case. The information on education levels in thepresent PESR records also suffers from lack of detail. For that reason, thesedata are not used to determine the highest education level. However, thesePESR data will perform an important role in the weighting strategy, as willbe set out in section 3.4. However, things may change if new-style PSR dataappear to be fit for use, as was discussed in section 3.2.

Micro-integration here is a matter of quality assessment of the sources (seesection 3.2), and a decision is made which sources to use.

3.3.2.2 Determination of education levels at reference date

In order to be able to derive someone’s educational attainment level, the6-digit SCED code of each education programme is converted into the cor-

Page 60: Report on WP4 Case studies - European Commission · CONTENTS 7 The main contributors to this report are Bart Bakker, Vincent de Heij, Daniela Ichim, Frank Linder, Filippo Oropallo,

3.3 Micro-integration 59

responding SCED or ISCED level code. This makes it possible to comparethe education levels of the different programmes in the individual’s educationcareer. Deriving comparable levels is in fact a micro-integration activity ofharmonization.

Determining educational attainment now simply appears to be a matter ofselecting the highest attained education level (with and without certificate)of all programmes. Unfortunately, however, it is not that simple. As has beenremarked before, the coverage of education careers in the Education Archiveis to some extent fragmented. Looking back to the figure in subsection 3.3.1with a part of the Education Archive of 2009, there is no evidence about theeducational attainment of RIN-person R006274060 in 2005, because of aninformation gap for school-years 2004/’05 and 2005/’06. Next, what aboutthe educational attainment of RIN-person R165529489 in 1997 of 1998? Didit change after she was interviewed for the LFS in 1996? The informationmay be out-of-date for later years. It is clear that before education levelscan be compared at reference date, their validity has to be assessed for thatdate. In terms of micro-integration, we can interpret the problem as one ofmeasurement errors that must be corrected for. This will be tackled in thenext subsection.

3.3.2.3 Assessing the validity of education levels at reference date

A decision strategy has been developed to assess whether the information onsomeone’s education level is still valid at reference date. Various deterministicand probabilistic decision rules are distinguished.

An example of a deterministic decision rule applied is the following. Ifa record in the Education Archive originates from the ENR register, its levelwill be assessed as still valid at reference date, no matter from which yearthe source information dates. The argumentation behind this is that in casesomeone would have moved up to a higher education level there would havebeen evidence on it in the Education Archive, because nearly all possiblecontinuation programmes are to be found in that archive. One exception isthat the student may have continued in private education, which is not yetcovered by registers in the Education Archive. In that case the validity mayhave been assessed wrongly.

A trivial deterministic decision rule is the one in which all records with SCED-level code 60/70 are assessed valid at reference date. The simple reason isthat there is no higher level attainable.

Another trivial deterministic decision rule is that education levels in the

Page 61: Report on WP4 Case studies - European Commission · CONTENTS 7 The main contributors to this report are Bart Bakker, Vincent de Heij, Daniela Ichim, Frank Linder, Filippo Oropallo,

3.3 Micro-integration 60

Labour Force Survey (LFS) should automatically be considered valid, if theLFS interview date lies beyond the reference date for which educational at-tainment has to be determined. This is because in the LFS the completeeducation career is registered until interview date.

Notice that the decision rules can validate different education levels simul-taneously. For example, suppose we find for some person in the EducationArchive an ENR record with SCED-level code 43 and another record withSCED-level code 60/70. Of course, it is only possible to have just one highestattained level at reference date. Therefore, in the end educational attainmentwill be determined as the highest level declared valid. In other words, thelower levels will be overruled. Therefore, if an education level is declaredvalid at reference date, it does not have to mean that this is the person’shighest education level. In fact, validating education levels is nothing elsethan selecting valid candidates being eligible for the highest level position.

Probabilistic decision rules are based on the probability that an educationlevel is still valid at reference date. These probabilities are calculated bymeans of non-parametric survival analysis models (Life Tables method).

Probabilistic decision rule 1The first probabilistic decision rule judges until when an education levelcan still be considered valid after the last observed date of attainment D.Decision rule 1 is intended for use in situations where there is no informationon the continuation of someone’s education career after this person attaineda certain education level at attainment date D. This means that rule 1 willbe mostly applied to records from education registers, as in these sourcesregistration generally ends as soon as an education programme has beencompleted. However, rule 1 would also be valid for use in the LFS in thespecial case when the attainment date D is very close to interview date. Inother LFS cases, probabilistic decision rule 2 - which is formulated below -should preferably be used.

The education level measured at date of attainment D will be defined asvalid at reference date R (>D) if the probability that the level has remainedunchanged between D and R is 95 percent or more.

With the Life Tables method an upper bound U is determined for the numberof years after attainment date D, in which period the education level hasremained unchanged with a probability of more than 95 percent. The boundsare dependent on age, background (Dutch, other Western or non-Westernorigin), and education level.

Page 62: Report on WP4 Case studies - European Commission · CONTENTS 7 The main contributors to this report are Bart Bakker, Vincent de Heij, Daniela Ichim, Frank Linder, Filippo Oropallo,

3.3 Micro-integration 61

The Life Tables survival function

SA(t) = PA[T ≥ t |D ] (3.1)

is defined as the probability that, given the last observed date D at whicha certain education level has been attained, educational attainment of a per-son with attributes A (e.g. background, age, education level) remains un-changed within t years of D. T is the number of years that have elapsed sinceD until an event occurs, or in other words when a change in education leveltakes place.

SA(t) is a monotonous decreasing function of t.

Rule 1 demands determination of an upper bound UA such that SA(UA) ≥0.95.

The distribution of the survival function SA(t) is determined empirically onthe basis of the LFS for a succession of years.

In other words, the education level attained at date D is still valid at referencedate R, if R is situated within the interval [D; D+UA]. This is because theprobability that the education level at R is still the same as it was at D (lasttime observed) is at least 95 percent.

For more details about the application of decision rule 1, illustrated with anexample, see Appendix A.

Probabilistic decision rule 2The second probabilistic decision rule applies exclusively to LFS records, andlike decision rule 1 it judges how long an education level can still be consideredvalid after the last observed transition of education level. Decision rule 2 isdifferent in that it also makes use of additional information that the LFS hasto offer. In addition to information about the last observed transition datein education level, the LFS also contains empirical evidence that during theperiod from this transition date until the interview date nothing has changedin the level of education. For the calculation of probabilities decision rule 2also takes into account the length of this period.

Page 63: Report on WP4 Case studies - European Commission · CONTENTS 7 The main contributors to this report are Bart Bakker, Vincent de Heij, Daniela Ichim, Frank Linder, Filippo Oropallo,

3.3 Micro-integration 62

In the same way as decision rule 1, the education level measured at date ofattainment D will be defined as valid for a reference date R if the condi-tional probability that the level has remained unchanged between D and Ris 95 percent or more, given the knowledge that this education level has notchanged in the period from D until interview date.

Suppose J is the number of years between D, the last observed transition ineducation level in the LFS, and the interview date. In other words, J denotesthe period for which there is empirical evidence that education level has notchanged since last transition.

With the Life Tables method an upper bound M(≥ J) has to be determinedfor the number of years after date of attainment D for which the conditionalprobability is more than 95 percent that educational attainment has notchanged. In the same way as the first probabilistic decision rule, the boundsare dependent on age, background and level of education.

The Life Tables conditional survival function

SJ,A(t) = PA[T ≥ t|J,D] (3.2)

is defined as the probability that, given the last observed date D at whicha certain education level has been attained and given empirical knowledgethat this education level has remained unchanged for J years, the educationalattainment of a person with attributes A remains unchanged within t yearsof D. T is the number of years that have elapsed since D until an event, i.e.until education level changes.

Rule 2 demands determination of an upper bound MA (J) such that

SJ,A (MA (J)) ≥ 0.95.

For each J there is a separate upper bound MA(J).

Notice that survival function SJ,A(t) can be written (in terms of the originalsurvival function) as:

SJ,A(t) = PA[T ≥ t | J,D] =PA[T ≥ t ∧ T ≥ J | D]

PA[T ≥ J | D]=

=

{1, when t ≤ J (trivial case),SA(t)SA(J)

, when t > J.(3.3)

Page 64: Report on WP4 Case studies - European Commission · CONTENTS 7 The main contributors to this report are Bart Bakker, Vincent de Heij, Daniela Ichim, Frank Linder, Filippo Oropallo,

3.3 Micro-integration 63

In other words, the education level attained at date D is still valid at referencedate R, if R is situated within the interval [D+J; D+MA(J)]. This is becausethe conditional probability that the education level at R is still the same asit was at D (last time observed) is at least 95 percent.

In general, when J is large enough, MA(J) will be infinite, which in fact meansthat the education level of date D can be considered as valid for the rest ofthis person’s life.

For more details about the application of decision rule 2, illustrated with anexample, see Appendix A.

Apart from the fact that probabilistic decision rule 2 only applies to LFSrecords, the main difference between rule 1 and 2 is that the first rule in gen-eral deals with level changes observed not too long ago, whereas the secondrule is mostly used for cases in which the last observed level change datesfrom quite a time ago.

Keep in mind that in some cases the outcome of the validation process mightbe that a higher level is not declared valid on reference date, whereas somelower levels are. In such cases, the strategy is that none of these levels are tobe used to determine this person’s highest education level. Electing one of thevalid levels would mean that one neglects the fact that the person has oncein his or her lifetime completed an education programme with a higher level,even though its validity at the reference date has been rejected. The reasonthis higher level is not declared valid, is not that there are doubts about thecorrectness of its measurement, but that there is no certainty about whetheror not this higher level has been surpassed at reference date.

3.3.2.4 Determining educational attainment at reference date

After the validity of all education levels has been assessed, it is possible toselect the highest of all levels declared valid at reference date and to fill theEducational Attainment File. In principle it does not matter which sourcesupplied the highest level. There is one exception to this. The rule is that ifa LFS as well as a register contribute information on the highest level, and

Page 65: Report on WP4 Case studies - European Commission · CONTENTS 7 The main contributors to this report are Bart Bakker, Vincent de Heij, Daniela Ichim, Frank Linder, Filippo Oropallo,

3.3 Micro-integration 64

moreover this highest level is similar in both sources, it will be taken fromthe register. This distinction is relevant, because LFS records get a sam-ple weight, while register records do not. The reason for preferring registerinformation is that administrative education data in general are thought tobe more accurate (see section 3.2), whereas sample surveys may suffer frommemory errors.

The Educational Attainment File contains two types of educational attain-ment.

For the first type it is essential that someone is qualified at the highestlevel attained9. For the second type it is sufficient to be educated at thehighest level without necessarily being qualified. This means that apartfrom those who have been qualified, the second type also applies to studentsstill attending school before going in for an examination or who failed fortheir examination, and to those who have left school without certificate.Consequently, the highest level according to the first type is lower than or atbest equal to the level according to the second type.

In order to determine someone’s highest completed level of educational at-tainment with certificate it will not always suffice to compare all this per-son’s certificates. One should not overlook the education programmes thatthis person attended without being qualified, because it may be necessaryto perform a so-called downgrade process. To explain this, see the followingexample. Suppose there are two records in the Education Archive represent-ing a part of someone’s education career. The first record refers to a lowersecondary education programme (SCED-level code 33) that has been com-pleted with certificate. The second record describes the person attendinguniversity. It would be wrong to qualify the student during his universityperiod with a highest completed level of educational attainment SCED-levelcode 33. After all, a higher preliminary education level is required for uni-versity admission. Apparently, the Education Archive lacked information onan intermediate education programme. In such cases the so-called method ofdowngrading is applied, which implies that the highest completed level withcertificate is determined as the minimum (downgrade) level that is requiredfor attending the education programme concerned. In the example of theuniversity student this would be SCED-level code 43, a level which is relatedto higher secondary education.

In the end all valid educational attainment records are stored in an Educa-9For some education programmes there is no examination requirement. To acquire

a certificate, it is sufficient to have attended the complete programme.

Page 66: Report on WP4 Case studies - European Commission · CONTENTS 7 The main contributors to this report are Bart Bakker, Vincent de Heij, Daniela Ichim, Frank Linder, Filippo Oropallo,

3.4 Weighting strategy 65

tional Attainment File (EAF). For reference date September 2008 this filecontains 7.5 million records in total, of which 6.7 million records from regis-ters and 770 thousand records from LFS (see the graph below). In terms ofintegration, population differences remain between observed and total popu-lation (16.5 million individuals). This means that a large gap of 9.0 million,or 54.7 percent, still has to be bridged. The next section goes into the strat-egy to achieve this.

Figure 3.2. Population coverage by source in Educational Attainment File (EAF), September 2008

PR = Population Register ERR=Exam Results registerENR = Education Number Register CREHE = Central Register for Enrolment in Higher EducationLFS = Labour Force Survey CREHE prelim = CREHE preliminaryNote: The contribution of the SFR is so small that it is imperceptible in the graph.

3.4 Weighting strategy

Subsection 3.4.1 presents the weighting process to make the Educational At-tainment File representative for population estimation purposes. Subsection3.4.2 provides a theoretical framework for a selection strategy for weightingmodels. The qualities of the models are judged on two important estimationproperties: variance and bias of the population estimator. In connectionwith this, the subsection also discusses the phenomenon of selectivity of non-response. In subsection 3.4.3 the selection strategy is worked out for the caseof the Educational Attainment File.

3.4.1 Population weights for the Educational Attain-ment File

The Educational Attainment File (EAF) consists of a mixture of registerand sample records. For 2008 the EAF covers about 45 percent of totalpopulation. However, the coverage is not uniformly spread over population,as the graph below shows.It is not surprising that younger people are better represented by the ed-ucation registers since these registers have existed for no more than a few

Page 67: Report on WP4 Case studies - European Commission · CONTENTS 7 The main contributors to this report are Bart Bakker, Vincent de Heij, Daniela Ichim, Frank Linder, Filippo Oropallo,

3.4 Weighting strategy 66

Figure 3.3. Population coverage by source and age in Educational Attainment File, September 2008

PR = Population Register

LFS = Labour Force Survey

decades.

The EAF provides full coverage for children aged 0-14 at the reference date,as was discussed earlier in section 3.3.2.110. At the age of 15 there is coverageof 97 percent, predominantly by registers.

As age rises, the register coverage curve shows a sharp decline in the firstsection, from 98 percent coverage for age 18 to 61 percent for 25 year-olds.Thereafter, the decline is less steep. The coverage of the register part dropsto 10 percent for the age of 50. The curve for registers shows a gradualdecrease to less than 3 percent when people are over 60.

The LFS coverage curve starts very modestly with a contribution of no morethan 2 percent up to age 30. This is, of course, because this area is stronglyrepresented by the registers11. The LFS curve shows a gradual incline up

10In ages 0 to 11 the Education Attainment File records come only from the PopulationRegister (PR). From ages 12 to 14 records come either from registers or the PR. There isnot much response from this group in the LFS, so it was decided not to use this source atall for these ages.11As mentioned in section 3.2.4 registers are given priority if the highest attained level

is observed in both registers and LFS.

Page 68: Report on WP4 Case studies - European Commission · CONTENTS 7 The main contributors to this report are Bart Bakker, Vincent de Heij, Daniela Ichim, Frank Linder, Filippo Oropallo,

3.4 Weighting strategy 67

to 10 percent for those who are aged 65. For people over 65 the LFS curveshows a decline. This is mainly because there is less sampling in the LFSamongst these ages. The undersampling is compensated by giving thema higher final LFS weight. For people older than 75, the LFS curve almostcoincides with the curve which represents total coverage, in other words LFSis almost entirely responsible for coverage in that age class.

Since older ages are underrepresented in the EAF, this file is not representa-tive for total population. The EAF will also not be representative for someother population characteristics, because of the relatively great share of reg-ister information in the EAF. After all, the education registers were set upto register a specific target population attending education, e.g. the studentpopulation in higher education, whereas the purpose of the EAF is to de-scribe education levels of total population. Another reason for the EAF notbeing representative is the selective loss which may arise, when records aredeclared invalid by the different decision rules mentioned in section 3.3.2.3.

A weighting strategy has been applied to make the EAF representative. Thishas been done as follows.

The register records (6.7 million in 2008) are supposed to give an integral rep-resentation of the register’s target population. Therefore a final weight equalto 1 is attached to the register records, which in fact amounts to countingthem. Accordingly, in 2008, a total of 6.7 million register records got a weightequal to 1.

What remains is the LFS (770 thousand records in 2008). The LFS recordsare reweighted to bridge the gap (9.010 + 0.770 million in 2008) betweenregister population and total population. For 2008 this implies that an aver-age calibrated LFS weight of 12.7 (9.010+0.770 divided by 0.770) is neededto cover the remaining population including LFS. As the LFS has its ownspecific sampling design, it is better to use the LFS weights as initial weights.These, for their part, have to be reweighted12 in order to get a representativeoutcome.

A complicating factor is that there are several LFSs involved from differentyears. For EAF 2008, for example, fourteen LFSs (1996,..,2009) have beenused13. Respondents generally only participate once14 in the LFS, so whenthe aim is to collect as much data as possible, it would be wise to make use of

12Register records are not affected by reweighting. Register weights remain equal to 1.13Therefore, the sampling design is in fact a matter of sampling in space and time.14In fact, in order to have continuous measurement, a LFS respondent is approached

up to five times in one year.

Page 69: Report on WP4 Case studies - European Commission · CONTENTS 7 The main contributors to this report are Bart Bakker, Vincent de Heij, Daniela Ichim, Frank Linder, Filippo Oropallo,

3.4 Weighting strategy 68

all the available LFS. It should be kept in mind that not all LFS observationswill be included in the EAF, as some of them are rejected in the validationprocess after applying probabilistic decision rule 2.

Each LFS has its own year-specific sampling design and sample size. It isadvisable to prevent LFS from having a more dominant contribution for someyears than for others. To realize this, a scaling procedure is applied to theLFS weights so that: (1) the total sum of all the scaled weights is equal tothe remaining population including LFS (9.010+0.770 million in 2008); (2)the average scaled weight per year is the same for each year15.

From the table 3.1 one can see that without scaling, later years would havebeen assigned higher average LFS sample weights. Scaling provides equalaverage weights (12.7), and these weights add up to the remaining population(9,779,781 people).

Together with the register records the scaled LFS weights add up to thetotal population. At this point, and although we now have achieved completecoverage, the weighted sums for subgroups will still differ from the populationmargins. Therefore, apart from scaling, we have to take one more step, whichis calibration (reweighting) of the scaled weights. Besides consistency withpopulation margins, the function of calibration is also to lower variances forthe estimates of the target variable, and to reduce selective nonresponse bias.

For the EAF of 2008 several auxiliary variables were used in the weightingmodel16:

1. demographic variables such as gender x age (mainly 5-year classes),marital status, country of origin (7 categories plus a distinction byfirst/second generation17) and region;

15Scaling procedure in formulae, worked out for LFS (1996,..,2009): wt,i=LFS-weightof observation i=1,..,n(t) in LFS of year t; Nremain is remaining population (LFS inclu-sive) to be covered by LFS; nt is sample size of LFS in year t; t=year 1996,..,2009). Withscaling factor λt = (nt/Σt=1996,..,2009nt). (Nremain/Σi=1,..,n(t)wt,i) we get the combinedresult that:1) the scaled weights λt.wt,i altogether add up to Nremain : ΣtΣiλt.wt,i = Nremain;2) the average scaled weights Σi=1,..,n(t)λt.wt,i/nt for each year t are identical:Nremain/Σt=1996,..,2009nt.16For a solution of the weighting model a method of linear weighting is applied. The

solution is found with BASCULA, a tool for weighting sample survey data (reference:Bascula 4). For further details on weighting models see the sections 3.4.2 and 3.4.3.17A first generation foreign background refers to the country of birth of the person.

A second generation foreign backgrond refers to the country of birth of the person’s mother,or if she is a native the father’s country of birth.

Page 70: Report on WP4 Case studies - European Commission · CONTENTS 7 The main contributors to this report are Bart Bakker, Vincent de Heij, Daniela Ichim, Frank Linder, Filippo Oropallo,

3.4 Weighting strategy 69

Tab

le3.

1.S

calin

gp

roce

ss,

Ed

uca

tion

alA

ttai

nm

ent

File

2008

yea

rnu

mb

erof

wei

ghte

dnu

mb

erof

aver

age

scal

ing

wei

ghte

dnu

mb

erof

aver

age

scal

edw

eight

tsa

mp

lesa

mp

lere

cord

sL

FS

wei

ght

fact

or

sam

ple

reco

rds

scal

edΣi=

1,..,n(t)λt.wt,i/nt

reco

rds

LF

Sw

eights

yea

rt

yea

rt

λt

LF

Sw

eights

yea

rt

nt

Σi=

1,..,n(t)wt,i

Σi=

1,..,n(t)wt,i/nt

Σi=

1,..,n(t)λt.wt,i

1996

57,5

255,

782,

653

100.

50.

126

728,

735

12.7

1997

56,8

055,

905,

449

104.

00.

122

719,

614

12.7

1998

51,6

616,

281,

878

121.

60.

104

654,

449

12.7

1999

45,0

456,

553,

554

145.

50.

087

570,

636

12.7

2000

50,3

367,

063,

156

140.

30.

090

637,

664

12.7

2001

50,5

017,

302,

712

144.

60.

088

639,

754

12.7

2002

53,2

797,

606,

316

142.

80.

089

674,

946

12.7

2003

57,5

468,

085,

128

140.

50.

090

729,

001

12.7

2004

66,3

918,

760,

946

132.

00.

096

841,

051

12.7

2005

63,9

109,

021,

215

141.

20.

090

809,

621

12.7

2006

58,8

399,

245,

660

157.

10.

081

745,

381

12.7

2007

58,0

999,

369,

058

161.

30.

079

736,

006

12.7

2008

56,3

859,

028,

528

160.

10.

079

714,

293

12.7

2009

45,6

768,

853,

437

193.

80.

065

578,

630

12.7

rem

ain

ing

pop

ula

tion

771,

998

108,

859,

689

141.

09,

779,

781

12.7

Page 71: Report on WP4 Case studies - European Commission · CONTENTS 7 The main contributors to this report are Bart Bakker, Vincent de Heij, Daniela Ichim, Frank Linder, Filippo Oropallo,

3.4 Weighting strategy 70

2. socio-economic variables such as regular labour income (yes/no), en-trepreneurial income (yes/no), other labour income18 (yes/no), unem-ployment benefit (yes/no), disablement benefit (yes/no), income sup-port benefit (yes/no), pension/life insurance benefit (yes/no), otherbenefit not mentioned before (yes/no), socio-economic category, andincome level (20 percent income brackets);

3. educational attainment from the Public Employment Service Register(PESR) is added as an auxiliary variable. As mentioned in section3.2.2.4 the PESR version of educational attainment was rejected forusage as target variable19, but it was considered suitable as an auxiliaryvariable in a weighting model because of the high correlation with thedesired concept of educational attainment.

Once scaling and calibration are carried out, one can obtain representa-tive estimates of the distribution of the educational attainment level for(sub)populations. The (sub)population estimator of the number of peoplewith education level y is a combined estimator and is given by the formula:

ty =∑i∈Reg

yi +∑

i∈S∩NReg

wiyi (3.4)

Reg is register population

S ∩ NReg is population in sample of LFSs that is not recorded in registers(NR is non-register populaton).

yi = 1 for people with education level y and 0 otherwise,

wi are scaled weights for records in S ∩NReg after calibration.

The tables below show educational attainment for people in the Nether-lands in September 2008 (source EAF 2008). Three basic categories aredistinguished (international ISCED level coding): lower education (ISCED0/1/2), intermediate education (ISCED 3/4) and higher or tertiary education(ISCED 5/6).

18Other labour income: labour income from a non-permanent job such as free-lancework, work of an artist and volunteer work.19As mentioned in section 3.2.2.4 new PESR-data have recently become available.

These data contain a variable education level in a version that probably is much closer tothe desired statistical concept of education, and therefore more suitable for use as a targetvariable. It is, however, still a subject of study. If it appears to be an appropriate targetvariable, there is no need at all to incorporate the new version as an auxiliary variable inthe weighting model.

Page 72: Report on WP4 Case studies - European Commission · CONTENTS 7 The main contributors to this report are Bart Bakker, Vincent de Heij, Daniela Ichim, Frank Linder, Filippo Oropallo,

3.4 Weighting strategy 71

The first table maps the contribution of registers. As was discussed before,EAF 2008 gives complete coverage of the age group 0-14 by data from regis-ters. All the people of age 0-14 have a lower education level (ISCED 0/1/2).

Table 3.2. Education levels for ages 0-14 and 15+ in the Netherlands, September 20081), contributionfrom registers

Age Education level (ISCED 1997) Registers Share of total Share ofage group age groupregistered

(x 1,000) % %0-14 ISCED 0/1/2 Lower Education 2) 2,923 100.0 100.0

Subtotal age 0-14 (registered) 3) 2,923 100.0 100.015+ ISCED 0/1/2 Lower Education 2) 1,173 31.1 8.7

ISCED 3/4 Intermediate Education 1,143 30.3 8.4ISCED 5/6 Tertiary Education 1,453 38.6 10.7Subtotal age 15+ (registered) 3,769 100.0 27.8Non-registered age 15+ 9,780 72.2Subtotal age 15+ 13,549 100.0

Total (all ages) ISCED 0/1/2 Lower Education 2) 4,097 61.2 24.9ISCED 3/4 Intermediate Education 1,143 17.1 6.9ISCED 5/6 Tertiary Education 1,453 21.7 8.8Total registered 6,692 100.0 40.6Non-registered 9,780 59.4Total 16,472 100.0

1) based on EAF 2008.2) including those who had no formal education, e.g. children aged 0-3 years.3) main part is population imputation (see section 3.3.2.1).

The coverage for the age group 15+ by registers amounts to 27.8 percent -that is, 72.2 percent is not observed. It is clear that, judged from the registers,the majority of people of 15 years and older would turn out to have a tertiary(or higher) education level (38.6 percent). However, this is an overestimateof the real share of the higher educated. The main reason is that there isa relatively great deal of CREHE-information on higher education, since thisregister has existed from the early eighties (see section 3.2.2.3), while registerinformation on lower and intermediate education is available from the latenineties, some fifteen years later (see section 3.2.2.2). A much better estimateof the share of tertiary education for age 15+ will be given in the second andthird table, see hereafter.

The second table maps the contribution of a combination of registers andLFS-surveys. xxx

The survey records are weighted by scaled LFS-weights, as described above(see also footnote 25). Compared with the results of age 15+ in the firsttable, the share of lower and intermediate education has increased, while theshare of tertiary education has considerably dropped. The task of the LFS isto compensate for information on education that has not been observed by

Page 73: Report on WP4 Case studies - European Commission · CONTENTS 7 The main contributors to this report are Bart Bakker, Vincent de Heij, Daniela Ichim, Frank Linder, Filippo Oropallo,

3.4 Weighting strategy 72

Tab

le3.

3.E

du

cati

onle

vels

for

ages

0-14

and

15+

inth

eN

eth

erla

nd

s,S

epte

mb

er20

081),

cont

rib

uti

onfr

omre

gist

ers

and

LF

S(w

eigh

ted

wit

hsc

aled

LF

Sw

eigh

ts)

Age

Ed

uca

tion

leve

l(I

SC

ED

1997

)R

egis

ters

Su

rvey

wei

ghte

dR

egis

ters

plu

ssu

rvey

wei

ghte

d(s

cale

dL

FS

wei

ghts

)(s

cale

dL

FS

wei

ghts

)S

har

eof

age

grou

pS

har

eof

age

grou

pS

har

eof

age

grou

p(x

1,00

0)%

(x1,

000)

%(x

1,00

0)%

0-14

ISC

ED

0/1/

2L

ower

Ed

uca

tion

2)

2,92

310

0.0

2,92

310

0.0

Su

bto

tal

age

0-14

2,92

310

0.0

2,92

310

0.0

15+

ISC

ED

0/1/

2L

ower

Ed

uca

tion

2)

1,17

331

.13,

719

38.0

4,89

236

.1IS

CE

D3/

4In

term

edia

teE

du

cati

on1,

143

30.3

4,09

141

.85,

233

38.6

ISC

ED

5/6

Ter

tiar

yE

du

cati

on1,

453

38.6

1,97

020

.13,

424

25.3

Su

bto

tal

age

15+

3,76

910

0.0

9,78

010

0.0

13,5

4910

0.0

Tot

alIS

CE

D0/

1/2

Low

erE

du

cati

on2)

4,09

761

.23,

719

38.0

7,81

547

.4IS

CE

D3/

4In

term

edia

teE

du

cati

on1,

143

17.1

4,09

141

.85,

233

31.8

ISC

ED

5/6

Ter

tiar

yE

du

cati

on1,

453

21.7

1,97

020

.13,

424

20.8

Tot

al6,

692

100.

09,

780

100.

016

,472

100.

0

1)

bas

edon

EA

F20

08.

2)

incl

ud

ing

thos

ew

ho

had

no

form

aled

uca

tion

,e.

g.ch

ild

ren

aged

0-3

year

s.

Page 74: Report on WP4 Case studies - European Commission · CONTENTS 7 The main contributors to this report are Bart Bakker, Vincent de Heij, Daniela Ichim, Frank Linder, Filippo Oropallo,

3.4 Weighting strategy 73

registers. There are 770 thousand LFS-records, of which 35.6 percent is lowereducation, 43.5 percent intermediate and 20.9 percent tertiary education (notin the table). In other words, a relatively high proportion of the compensatingLFS-records represents lower and intermediate education. It is interestingto notice that these percentages do not change strongly when scaled LFSweighting is applied: 38.0, 41.8 and 20.1 percent respectively (see the column’Survey weighted (scaled LFS weights)’).

The share of tertiary education for the age group 15+, when adding LFSinformation to that of the registers, drops from 38.6 to 25.3 percent, whichis a more realistic estimation.

The third table, like the second, gives the results of combining registers andLFS-surveys, although now with reweighted (calibrated) LFS-records.

The distribution of the combined estimator does not differ very much fromthat in the second table. The largest difference is to be found in the cat-egory tertiary education, where the share drops from 25.3 to 23.3 percentwhen calibration is applied. Apparently, most of the overrepresentation andunderrepresentation has already been corrected with scaled LFS weighting.The effect of the calibration of the scaled LFS weights on the distribution ofeducation levels can be considered as adding the ultimate ’finishing touch’,in order to reflect the real population structure20.

There is an impression that the weighting model for the EAF, as mentionedabove, may have been too generously specified with auxiliary variables. Thisis based on the occurrence of some unexpected and implausible outcomesthat were found for some small subpopulations. One of the objectives ofthe weighting model, besides variance reduction, is to achieve consistencywith as many population margins as possible. However, the drawback ofsuch an abundant model is that fluctuation may affect the final weights toomuch. This can have a disrupting effect on the accuracy of the target variableeducational attainment, in particular in cells with a small number of obser-vations. In a simulation study, Nascimento Silva and Skinner (1997) pointout that adding auxiliary variables to a regression model causes the varianceof the regression estimator to drop initially. However, there may be a turningpoint where adding still more variables leads to higher variances instead, andconsequently produces less accurate estimates. Perhaps the auxiliary infor-mation should be restricted to the most relevant variables that show strong

20The effect of calibrating the scaled LFS weights appears to be rather small wheretotal population is concerned. Obviously, for smaller subpopulations the effect mightoccasionally be more substantial.

Page 75: Report on WP4 Case studies - European Commission · CONTENTS 7 The main contributors to this report are Bart Bakker, Vincent de Heij, Daniela Ichim, Frank Linder, Filippo Oropallo,

3.4 Weighting strategy 74

Tab

le3.

4.E

du

cati

onle

vels

for

ages

0-14

and

15+

inth

eN

eth

erla

nd

s,S

epte

mb

er20

081),

cont

rib

uti

onfr

omre

gist

ers

and

LF

S(w

eigh

ted

wit

hca

lib

rate

dL

FS

wei

ghts

)

Age

Ed

uca

tion

leve

l(I

SC

ED

1997

)R

egis

ters

Su

rvey

wei

ghte

dR

egis

ters

plu

ssu

rvey

wei

ghte

d(c

alib

rate

dL

FS

wei

ghts

)(c

alib

rate

dL

FS

wei

ghts

)S

har

eof

age

grou

pS

har

eof

age

grou

pS

har

eof

age

grou

p(x

1,00

0)%

(x1,

000)

%(x

1,00

0)%

0-14

ISC

ED

0/1/

2L

ower

Ed

uca

tion

2)

2,92

310

0.0

2,92

310

0.0

Su

bto

tal

age

0-14

2,92

310

0.0

2,92

310

0.0

15+

ISC

ED

0/1/

2L

ower

Ed

uca

tion

2)

1,17

331

.13,

812

39.0

4,98

536

.8IS

CE

D3/

4In

term

edia

teE

du

cati

on1,

143

30.3

4,26

243

.65,

405

39.9

ISC

ED

5/6

Ter

tiar

yE

du

cati

on1,

453

38.6

1,70

617

.43,

159

23.3

Su

bto

tal

age

15+

3,76

910

0.0

9,78

010

0.0

13,5

4910

0.0

Tot

alIS

CE

D0/

1/2

Low

erE

du

cati

on2)

4,09

761

.23,

812

39.0

7,90

848

.0IS

CE

D3/

4In

term

edia

teE

du

cati

on1,

143

17.1

4,26

243

.65,

405

32.8

ISC

ED

5/6

Ter

tiar

yE

du

cati

on1,

453

21.7

1,70

617

.43,

159

19.2

Tot

al6,

692

100.

09,

780

100.

016

,472

100.

0

1)

bas

edon

EA

F20

08.

2)in

clu

din

gth

ose

wh

oh

adn

ofo

rmal

edu

cati

on,

e.g.

child

ren

aged

0-3

year

s.

Page 76: Report on WP4 Case studies - European Commission · CONTENTS 7 The main contributors to this report are Bart Bakker, Vincent de Heij, Daniela Ichim, Frank Linder, Filippo Oropallo,

3.4 Weighting strategy 75

coherence with education attainment. The following two sections 3.4.2 and3.4.3 focus on a selection strategy for optimal weighting models in which theestimators are examined with respect to properties as accuracy (in terms oflow variances and limited bias).

If consistency is also considered important for other variables in addition tothe weighting variables, the method of repeated weighting could be consid-ered. An extensive treatment of this method is to be found in Bakker (2010).This method was also used for the Virtual Census 2001, see Linder (2004).Repeated weighting is in fact based on the repeated application of the regres-sion method, and is not the same as calibration (reweighting). Calibrationresults in a fixed set of survey weights for the sample concerned, whereas withrepeated weighting a new set of weights (based on the survey weights) is de-rived for each table in order to attain consistency with population marginsor other tables previously estimated.

3.4.2 Selection strategy for weighting models, theoret-ical framework21

Introduction

If a sample survey is used for statistical inference on population characteris-tics Y one should weight the sample observations. Estimations of populationtotals for Y are found by adding up the products of weight wi and populationcharacteristics yi for all the respondents:

∑iwiyi

The sampling design sets the values of the weights in the first stage (Sarndal,Swensson and Wretman, 1992, chapter 2 and 3). These are so-called initialor inclusion weights π−1

i (reciprocal of inclusion probability πi). Weightingwith inclusion weights π−1

i may lead to biased estimates∑

i π−1i yi of the

population totals in the case of selective nonresponse.

This can be adjusted in the second stage by applying correction weights gi,which should result in an improved estimation of the population total. Oneway of finding correction weights is by making use of available backgroundcharacteristics as auxiliary information in linear regression estimation.

The Generalized Regression Estimator assumes a linear relationship betweenthe target variable Yi and auxiliary variables ~X t

i = (Xi 1, . . . , Xim) for alli=1,. . . N in population P.

21Based on Heij (2011)

Page 77: Report on WP4 Case studies - European Commission · CONTENTS 7 The main contributors to this report are Bart Bakker, Vincent de Heij, Daniela Ichim, Frank Linder, Filippo Oropallo,

3.4 Weighting strategy 76

The linear relationship is given by the formula

E(Yi

∣∣∣ ~Xi = ~xi

)= ~βt~xi (3.5)

In linear regression the coefficient vector ~β is estimated as:

~β =

(∑k∈R

π−1k ~xk~x

tk

)−1(∑k∈R

π−1k ~xkyk

)(3.6)

with R, response population in a sample S taken from population P, and withresponse data ~xk and yk.

Notice that there is information on background characteristics ~Xi for theentire population, so for all i=1,. . . ,N in population P.

The Generalized Regression Estimator of the population total can be writtenas:

tY =∑i∈R

π−1i giyi (3.7)

with correction weight:

gi = 1 +

(∑k∈P

~xk −∑k∈R

π−1k ~xk

)t(∑k∈R

π−1k ~xk~x

tk

)−1

xi (3.8)

More detailed information on the Generalized Regression Estimator can befound in, for example, chapter 6 and 7 of Sarndal, Swensson and Wretman(1992).

Although application of the Generalized Regression Estimator provides anestimate of the true value of the population parameter, just as any other es-timator it is subject to potential variance and bias. The variance depends onthe number of respondents and the extent of bias depends on the selectivity ofthe response. The higher the number of respondents, the lower the varianceof the estimator will be. A less selective response will result in a less biasedestimator. The size of variance and bias can to some extent be controlled forby the choice of auxiliary variables in the linear regression model. A highcorrelation between the population estimator (target variable) and auxiliaryvariables will reduce the variance and bias of the estimator, whereas a lowcorrelation may even result in an increase of the variance, see Kish (1992)and Little and Vartivarian (2005). It is therefore important to handle themodel selection with great care.

Page 78: Report on WP4 Case studies - European Commission · CONTENTS 7 The main contributors to this report are Bart Bakker, Vincent de Heij, Daniela Ichim, Frank Linder, Filippo Oropallo,

3.4 Weighting strategy 77

Two indicators, a variance and a bias indicator, will be introduced which canbe used in a selection strategy of weighting models. These indicators measurethe effect of the choice of auxiliary variables in a model on the variance andbias of the regression estimator. A comparison of the estimated variancesand biases of different models will show which selection of auxiliary variablesgives the best results.

The ideal selected set of auxiliary variables will provide a regression estimatorwith the lowest variance as well as the smallest bias. In practice this cannot always be realised. So, one can either opt for an estimator with thelowest variance while tolerating some bias, or vice versa: an estimator witha minimum of bias and a restricted amount of variance. Some may prefera model specification which generates an estimator with the lowest meansquare error (MSE), that is the sum of the variance and the bias squared.However, the use of the MSE has its pros and cons.

Variance indicatorThe variance indicator judges whether the use of a specific selection of aux-iliary variables in a regression improves population estimation in terms ofa lower variance. It compares the variance of the regression estimator forthat specific selection with the variance of a direct estimator in which nocorrection weights and only inclusion weights are used (gi=1).

A value of the variance indicator of less than one indicates that for a selectedset of auxiliary variables linear regression improves the quality of estimationin terms of variance. For values higher than one regression performs lesswell (higher variance) for the selection of auxiliary variables than estimationbased on only initial weights.

The variance indicator wV is expressed as a ratio of two variance estimators:

wV(tY)

=

∑i∈R(nRπ

−1i gi (yi − yi)− tY−Y

)2∑i∈R(nRπ

−1i yi − tY

)2 (3.9)

with nR the number of respondents in the sample response population R and

yi = ~βt~xi. The advantage of a ratio expression is that it makes the indicatorinvariant of the number of respondents in the sample22.

22The variance indicator is set up as a ratio. The advantage of a ratio expression isthat it makes the indicator invariant of the number of respondents in the sample. Seebelow, where the expression [nR (nR − 1)]-1 in both numerator and denominator canceleach other out.The numerator contains the estimator for the variance of the regression estimator, which isconstructed by combining the results of section 6.6 and formula 11.2.6 of Sarndal, Swens-

Page 79: Report on WP4 Case studies - European Commission · CONTENTS 7 The main contributors to this report are Bart Bakker, Vincent de Heij, Daniela Ichim, Frank Linder, Filippo Oropallo,

3.4 Weighting strategy 78

Bias indicatorThere is no direct estimator which measures the influence of a selection ofauxiliary variables on the bias of the regression estimator. However, it ispossible to find an estimator for an upper bound of the bias.

The bias indicator, wB, is constructed in such a way that it compares anupper bound of the absolute value of the bias of the regression estimatorwith an upper bound of the absolute value of the bias of the estimator thatis only based on initial weights.

It should be noticed that the bias indicator deals with a ratio of upper boundsof bias, and not with a ratio of bias itself. This means that in theory, if wB< 1, there is no guarantee that the regression estimator itself suffers less biasthan the estimator based on initial weights. However, simulation studies (seeAppendix B) have indicated that what results from the bias indicator (ratioof upper bounds) in general also applies for the ratio of bias itself.

The bias indicator has the following expression:

wB(tY)

=

√1− γ2

S

(~βt ~X, IR

)√1− γ2

R

(~βt ~X, Y

). (3.10)

IR is the response indicator, a dichotomous 0-1 variable, that indicates foreach person in the sample whether this person responds (1) or not (0).γS(X1, X2) is the correlation between X1 and X2 for sample elements;γR(X1, X2) is the correlation between X 1en X 2 for the responding elementsin the sample23.

son and Wretman (1992):V(tY)

= 1nR(nR−1)

∑i∈R

(nRπ

−1i gi (yi − yi)− tY−Y

)2The denominator contains the estimator for the variance of the estimator wich only usesinitial weights (gi = 1) and no auxiliary variables (yi = 0):V(tY)

= 1nR(nR−1)

∑i∈R

(nRπ

−1i yi − tY

)2In order to derive V

(tY)

it is assumed that the sample is carried out with replacement,which is acceptable if the sample fraction is small enough. In the case of a sample with-out replacement the formula above overestimates the real variance somewhat, but theoverestimation will be less if the sample fraction decreases.23The bias indicator is constructed as a ratio of bias.

For the numerator an estimator is needed for an upper bound of the absolute value ofthe bias of the regression estimator. Such an estimator BMAX can be found in Schouten(2007). The author distinguishes different response mechanisms, of which Not Missing AtRandom (NMAR) can be considered as the most extreme form. With NMAR the responseprobability depends on the value of the target variable Y. In the case of NMAR the upperbound of the bias (because of NMAR’s extreme form the upper bound can be interpretedas a sort of ’worst case’ nonresponse bias, and so the bound will also be valid for other

Page 80: Report on WP4 Case studies - European Commission · CONTENTS 7 The main contributors to this report are Bart Bakker, Vincent de Heij, Daniela Ichim, Frank Linder, Filippo Oropallo,

3.4 Weighting strategy 79

It is not difficult to see that wB is bounded: 0 ≤ wB ≤ 1.

If there is no sample and no response correlation at all the auxiliary variableswill have no contribution in explaining the target variable and the responseindicator. The bias indicator will then have value one, which implies thatthe upper bound of the variance of the regression estimator is equal to thatof the estimator based on initial weights only.

A bias indicator of zero would imply that the regression estimator is unbiased.In this case either the target variable or the response indicator, or even both,are perfectly explained by the auxiliary information.

From the formula of the bias indicator wB it is easy to see that the value ofthe indicator depends on the correlation between the auxiliary variables ~X

and the target variable Y . A high correlation implies a value of γ2R(~βt ~X, Y )

near to one, and consequently a small value for indicator wB, or in otherwords a low upper bound for bias. Therefore, the selection strategy shouldpreferably opt for auxiliary variables ~X that have a strong relationship withthe target variable Y . It is not only the bias upper bound for which thisstrategy would have a positive effect, but it is expected that bias itself is alsoreduced.

The bias indicator wB is also dependent on the correlation γ2S between the

auxiliary variables ~X and the response indicator IR. It should be noticed,however, that this dependency is weaker than the relationship between in-

dicator wB and γ2R. The reason for this is that in γ2

S(~βt ~X, IR), ~X is pre-

multiplied by the estimated coefficient vector ~β, which in turn is dependenton the correlation between the auxiliary variables ~X and target variable Y .So, the correlation γ2

S between ~X and IR is in fact mainly determined bythose variables in the set of auxiliary variables ~X which strongly correlate

with Y , and therefore have a substantial contribution to the product ~βt ~X.The remaining variables in ~X have no significant correlation with Y , and

more specific response mechanisms) will be:

BMAX

(tY)

= 2NPσY

√1−θθ

√1− γ2

S

(~βt ~X, IR

)√1− γ2

R

(~βt ~X, Y

).

NP is the size of total population P; θ is the average response probability.For the denominator an estimator is needed for the upper bound of the absolute value of

the bias if only initial weights are used. This is found by substituting zero for γ2R(~βt ~X, IR)

and γ2S(~βt ~X, Y ) in the formula for BMAX .

The result is: BMAX

(tY)

= 2NPσY

√1−θθ

.

Page 81: Report on WP4 Case studies - European Commission · CONTENTS 7 The main contributors to this report are Bart Bakker, Vincent de Heij, Daniela Ichim, Frank Linder, Filippo Oropallo,

3.4 Weighting strategy 80

therefore their contribution will be hardly visible in ~βt ~X24.

Summarizing, the bias indicator is basically influenced by correlation γ2S be-

tween the auxiliary variables ~X and the target variable Y , and less by thecorrelation between ~X and the response indicator IR.

Appendix B gives an extensive treatment of how the variance and bias in-dicator operate in practice, by applying them to an imaginary population25.It is a simulation study in which the effect on these indicators is studied ofdifferent sets of selected auxiliary variables. The advantage of such an imagi-nary population is that every relevant fact is known for every member of thispopulation, even their response behaviour. The simulation study can testwhether the predicted theoretical effects, as discussed above, will prove to betrue in practice. Although the results of the simulation study are valid formany representative situations, one should not generalise them for all possi-ble situations. In theory it will always be possible to find a counterexamplefor which the results of the simulation study do not apply.

3.4.3 Selection strategy for weighting models appliedto Educational Attainment File

Introduction

As an instrument for selecting weighting variables two indicators were intro-duced in 3.4.2 a variance and a bias indicator. Variance and bias are twoimportant attributes for the quality of the regression estimator in the weight-ing process. In the search for an optimal model preference should be givento models with a low outcome for the indicators mentioned. The correlationof target variable and response probability with the auxiliary variables inthe model plays an important role in this respect. Appendix B presents anextensive treatment of how these indicators operate in the situation of animaginary dataset.

24The statement can be made intuitively clear by using a trivial example. Suppose anauxiliary variable set X = (X1, X2)t with no correlation between X1 and X2, and Y =(1,0).X and IR = (0, 1).X. In other words Y is perfectly correlated with X1 while IR isperfectly correlated with X2.

The expression ~βt ~X = X1 has correlation with Y, but not at all with IR. So, γ2S(~βt ~X, IR)

is zero, and will therefore not contribute in the determination of wB25The results of the variance and bias indicator in Appendix B are shown for three

different response mechanisms: Missing Completely At Random (MCAR), Missing AtRandom (MAR) en Not Missing At Random (NMAR).

Page 82: Report on WP4 Case studies - European Commission · CONTENTS 7 The main contributors to this report are Bart Bakker, Vincent de Heij, Daniela Ichim, Frank Linder, Filippo Oropallo,

3.4 Weighting strategy 81

The present section presents the variance and the bias indicator as a meansto select first-rate weighting models with the purpose of making the Educa-tional Attainment File (sections 3.3 and 3.4.1) representative for populationestimation.

The first part of this section deals with educational attainment of the com-plete Dutch population.

The second part focuses on education levels of subpopulations (or domains).A particularly interesting target group of research is, for example, the rela-tively small subpopulation of Moroccans aged 18-3026. For most Moroccansyounger than thirty years who attended their final stage of education in theNetherlands there will be information on their educational attainment in reg-isters. Problems may arise with first-generation Moroccans. Most of themprobably attended education in Morocco, and information about this will notbe present in Dutch education registers. For a few of them the LFS mighthave data on education level, however, but the number of observations will belimited as nonresponse is relatively high among first-generation Moroccans.

The Educational Attainment File (EAF) contains data on education levelsfor about 7.5 million people in the country (reference date September 2008).Most of these data, about ninety percent, originate from registers, and theother ten percent are sample data (for more details, see section 3.4.1). Itshould be kept in mind that the weighting process only refers to the samplepart of EAF. The response data in the sample will be reweighted in such a waythat the weights add to the so-called remaining population (LFS inclusive), inother words the non-register population. This is the part of total populationfor which there is no information from education registers.

The variance and bias indicators will be applied only to the sample part.Obviously, the only relevant information that should be known are the edu-cation levels for (sub)populations as a whole, or in other words the result ofthe combined estimator from both sample and register data. As registers areconsidered not to be a source of variance, the variance of the combined esti-mator will be equal to that of the sample part. Likewise, as it is assumed thatthere is no bias involved with data from registers the combined estimator willnot differ from the sample estimator in terms of bias either. However, it isobvious that the coefficient of variation (relative standard error) and relativebias of the combined estimator will be reduced due to the contribution of the26In section 3.5 similarly, the relative small subpopulations males aged 18-24 and 25-30

of Turkish origin in 2008 are target group in a study on the measurement of the accuracyof estimation of education levels.

Page 83: Report on WP4 Case studies - European Commission · CONTENTS 7 The main contributors to this report are Bart Bakker, Vincent de Heij, Daniela Ichim, Frank Linder, Filippo Oropallo,

3.4 Weighting strategy 82

register data. For more on the coefficient of variation, see section 3.5.

It is also important to bear in mind that in the specific case of EAF theinclusion weights π−1

i in the formula of the variance indicator are in factthe scaled LFS weights that were discussed in section 3.4.1. These LFSweights are not genuine inclusion weights in the sense of inverse inclusionprobabilities; they themselves are a product of the calibration applied toobtain representative LFS estimates. Therefore, in the estimator of the non-register population (NReg) in formula (3.7) π−1

i should be substituted byscaled LFS-weights λtwt,i (see also footnote15).

tyNReg =∑

i∈R∩NReg

λtwt,igiyi (3.11)

R is response population in the sample part.

The combined estimator is then given by the following formula:

ty =∑i∈Reg

yi +∑

i∈R∩NReg

λtwt,igiyi (3.12)

Scaled LFS weights adjust overrepresentation and underrepresentation in theEAF, as was shown in section 3.4.1. However, because of the selective lossinvolved in the statistical process of building EAF, the adjustments will notbe sufficient and therefore the LFS weights will have to be calibrated again.

In this section the education levels will be classified as 1-digit level codes ofthe SCED 2006 classification (for more details, see section 3.3.2). SCED 1and 2 are taken together, as are the case with SCED 6 and 7. Altogether,five different SCED levels are distinguished:

• SCED = 1+2; primary education (or less)

• SCED = 3; secondary education, 1st stage

• SCED = 4; secondary education, 2nd stage

• SCED = 5; higher education, 1st stage

• SCED = 6+7; higher education, 2nd and 3rd stage

In section 3.4.1 seventeen auxiliary variables were introduced that are usedby the Generalized Regression Estimator in the weighting model. Thesevariables are gender, age class, marital status, country of origin plus a dis-tinction by generation (groups), region, socio-economic category, regular

Page 84: Report on WP4 Case studies - European Commission · CONTENTS 7 The main contributors to this report are Bart Bakker, Vincent de Heij, Daniela Ichim, Frank Linder, Filippo Oropallo,

3.4 Weighting strategy 83

labour income (yes/no), entrepreneurial income (yes/no), other labour in-come (yes/no), unemployment benefit (yes/no), disablement benefit (yes/no),income support benefits (yes/no), pension/life insurance benefit (yes/no),other benefit not mentioned before (yes/no), income class and educationalattainment according to the PESR-source. They are either categorical ordichotomous variables.

Weighting models for the complete Dutch populationA weighting model which consists of no fewer than seventeen weighting vari-ables can rightly be considered as a complex and an abundantly specifiedmodel. It is not unlikely that at least some of the variables (or accompany-ing terms) in the model will have no or only little coherence with the targetvariable. So, as discussed in section 3.4.2 there is the risk of a needless in-crease of variance, while at the same time irrelevant auxiliary variables maynot contribute to reduction of bias. In other words, eliminating the redun-dant variables from the complete model (with 17 auxiliary variables) couldresult in an estimator which is hardly more biased, but has lower variance.

To find simpler models with optimal variance and bias qualities a selectionstrategy could be to eliminate one or two variables from the complete model(with 17 auxiliary variables), and to compare the values of the bias andvariance indicator before and after elimination. In the process of testing dif-ferent models, all eight dichotomous socio-economic variables, such as regularlabour income, entrepreneurial income, and the different types of benefits willbe considered as one entity. So, to test which socio-economic variables areeligible for elimination from the complete model, it is the whole block withthese variables that is (or is not) eliminated, not just one separate variable.

Taking this into account the result is that nine ((

17− 81

)) weighting mod-

els remain with ’one’ omitted variable and thirty-six ((

17− 82

)) with ’two’

omitted variables from the complete model.

For computation of the bias indicator wB it is necessary to calculate the

sample correlation γ2S

(~βt ~X, IR

)between the auxiliary variables and the re-

sponse indicator. In the case of the Educational Attainment File we knowthat the response data come from more than one sample. In EAF 2008, forexample, there is response from fourteen LFSs (1996,..,2009). As noticedbefore, together these LFSs can be considered as a sample in space and time.Moreover, not all the observations in these samples are included, only a se-lective part of them (see section 3.4.1). So the sampling design concerned

Page 85: Report on WP4 Case studies - European Commission · CONTENTS 7 The main contributors to this report are Bart Bakker, Vincent de Heij, Daniela Ichim, Frank Linder, Filippo Oropallo,

3.4 Weighting strategy 84

is quite complex, and it is not easy to define the corresponding sample pre-cisely. For this reason γ2

S will not be calculated for sample data, but in factfor the total remaining population as defined in section 3.4.1. This is thepopulation for which there is no education level information from registers.The response indicator IR27 is defined for persons in the remaining popu-lation and its value is one if LFS has education level information for thesepersons, and zero if not.

To get an impression of the extent to which individual auxiliary variablescontribute in terms of variance and bias, the indicators will also be calculatedfor basic weighting models with no more than 1 or 2 variables from the set of17 auxiliary variables in the complete model. This gives 9 weighting modelswith only one variable and 36 with a combination of two variables.

Altogether, values of the indicators will have to be determined for 91 models.This is the sum of one complete model (17 auxiliary variables), 9 modelswith combinations of 16 (9 if the block of socio-economic variables is leftout) auxiliary variables, and 36 model combinations of 15 (8 if the blockof socio-economic variables is left out) auxiliary variables, 9 models with 1auxiliary variable, and to conclude 36 models with combinations of 2 auxiliaryvariables.

This time the calculations were done for the Educational Attainment Fileof a previous year, EAF 2005. EAF 2005 contained sample response datafrom LFSs (1996,..,2006) for 674,711 persons, and the remaining populationconsisted of 10,477,704 persons28. As calculating the indicator for all 9127It is most likely that an Not Missing At Random (NMAR) response situation arises

in the building process of EAF (for the NMAR concept see Appendix B). This is becausethe probabilistic decision rules employed to validate records (see section 3.3.2.3) causeselective loss of response data. These decision rules depend on known characteristics ofthe non-register population such as age and background, but also for example on educationlevel, the target variable in the weighting model, for which in the non-register populationno information is available except from LFS-observations. So, the response probabilitydepends on known and unknown characteristics of the person, which classifies it to anNMAR response type. As the selective loss is of an NMAR-type estimation of educationlevels of the population will suffer irreparable bias to some extent. Auxiliary variables ina weighting model as age and background (country of origin) will be able to adjust forbias somewhat, however, there is no auxiliary information for unknown characteristics aseducation level. One more reason to expect bias is that the probabilistic decision rule willgive preference to stable education patterns, in which education levels remain unchangedduring a longer observation period (see Appendix B), which means that dynamic patternsare underrepresented. Notwithstanding the appearance of bias, the selection strategy -guided by the bias and variance indicator - is to search for weighting models with as littlebias and variance as possible.28In EAF 2008 total sample response was 771,998 persons and the remaining population

Page 86: Report on WP4 Case studies - European Commission · CONTENTS 7 The main contributors to this report are Bart Bakker, Vincent de Heij, Daniela Ichim, Frank Linder, Filippo Oropallo,

3.4 Weighting strategy 85

models was tedious and time-consuming, it was decided not to use the datafor all the 674,711 persons, but only of a sample from them. A simple randomsample without replacement was taken of 300,000 persons. As discussedbefore, the variance indicator wV is invariant of the number of respondents.The same applies to the bias indicator wB, as this indicator only dependson correlations between auxiliary variables and target variable, and betweenauxiliary variables and the response probabilities. These correlations are notaffected by the number of respondents. So, there is no problem with reducingthe sample size to 300,000 as this will not result in systematic differences inthe values of these indicators.

The target variable, level of educational attainment, is in fact an ordinalvariable with five categories SCED 1+2, SCED 3, SCED4, SCED 5 andSCED 6+7. To calculate values for the bias and variance indicators, each ofthese five categories will be represented by a separate dichotomous variable.In this way five corresponding dichotomous variables Y 12, Y 3, Y 4, Y 5 andY 67 will be defined. E.g. Y 12 =1 if the educational attainment level isSCED 1+2 andY 12=0 if not. Notice that these five target variables are notcompletely independent as Y 12 + Y 3 + Y 4 + Y 5 + Y 67=1. However,this is not really a problem as the same set of auxiliary variables is used foreach of the five target variables, and so the five regression equations maybe estimated separately. Both the variance and the bias indicator will becalculated for the 91 weighting models mentioned before. See the graphbelow.

In the graph29 four clearly distinguishable groups are immediately noticeablein the pattern of weighting models for the higher education levels SCED 5and SCED 6+7.

All the weighting models in group 1 contain at least the auxiliary variablesincome level and educational attainment as measured in the PESR source.The model encircled in this group has only these two variables in its auxiliaryvariable set.

Weighting models belonging to group 2 have in common that they contain atleast the auxiliary variable educational attainment (PESR), and that incomelevel is not included in the auxiliary variable set. The encircled model in the

consisted of 9,779,781 persons (see section 4.1)29Thoughout the presentation of the results of the bias indicator the term bias is used.

It should be kept in mind that the bias indicator is in fact a ratio of upper bounds for bias.However, in Appendix B it is shown that in general there is a clear correlation betweenthe bias indicator wB and bias B for MAR and NMAR-response situations.

Page 87: Report on WP4 Case studies - European Commission · CONTENTS 7 The main contributors to this report are Bart Bakker, Vincent de Heij, Daniela Ichim, Frank Linder, Filippo Oropallo,

3.4 Weighting strategy 86

pattern of group 2 is the model with the two variables educational attainment(PESR) and socio-economic category.

Models from group 3 include at least the variable income level, but excludeeducational attainment (PESR).

Models in group 4 have neither of the two mentioned variables as auxiliaryinformation.

The complete model is by definition of course a member of group 1, and isreferred to in the graph as a square.

Page 88: Report on WP4 Case studies - European Commission · CONTENTS 7 The main contributors to this report are Bart Bakker, Vincent de Heij, Daniela Ichim, Frank Linder, Filippo Oropallo,

3.4 Weighting strategy 87

Figure 3.4. Values of indicators wS=√wV ) and wB , five education levels and 91 weighting models,

EAF2005.

The square denotes the complete model; the encircled number 1 is a model with two auxiliary variables:educational attainment (PESR) and income level; the encircled number 2 is a model with two auxiliaryvariables: educational attainment (PESR) and socio-economic category

Page 89: Report on WP4 Case studies - European Commission · CONTENTS 7 The main contributors to this report are Bart Bakker, Vincent de Heij, Daniela Ichim, Frank Linder, Filippo Oropallo,

3.4 Weighting strategy 88

It is not without reason that out of all the auxiliary variables, income leveland educational attainment (PESR) are responsible for the remarkable pat-terns in the graph. The two variables correlate strongly with the targetvariable as the following table shows for education level SCED 5.

Table 3.5. Values of indicators wS and wB and correlation of response indicator IR and target variable Y

with the auxiliary variable set ~βt ~X for a number of weighting models, education level SCED 5.

Auxiliary variable set in weighting model wS wB γS(~βt ~X, IR) γR(~βt ~X, Y )Complete model 0.85 0.92 0.03 0.40Educational attainment (PESR) + Income level 0.84 0.92 0.03 0.38Educational attainment (PESR) 0.89 0.94 0.02 0.33Income level 0.92 0.97 0.02 0.23Gender * Age 0.98 0.99 0.04 0.12

In the table, the correlation γR(~βt ~X, Y ) in a model with both auxiliaryvariables, income level and educational attainment (PESR), approaches thatof the complete model. With these two variables in the auxiliary variable set,the bias indicator wB and the square root of the variance indicator wS=

√wV

is clearly reduced compared to a model with the term gender * age.

It is instantly clear from the graph that none of the overall 91 weightingmodels is optimal in terms of lowest variance and lowest bias for all thedifferent education levels. While a model from group 1 may prevail for highereducation levels, for the lower education levels a representative from group2 appears to be superior.

The encircled number 1 in the graph, a model with the two auxiliary vari-ables educational attainment (PESR) and income level, would certainly bean excellent choice as weighting model for the higher education levels. Thismodel gives the lowest value for the variance indicator. It also performs verywell in terms of bias although for that one can find slightly better models ingroup 1, such as the complete model.

It is a bit more difficult to find the ’best’ model for lower education levels.Models which generate the lowest wB, the complete model for example, cer-tainly do not always provide the lowest wS. The encircled model from group2, which contains the two auxiliary variables educational attainment (PESR)and socio-economic category, has one of the lowest values for wS, but is lesssuperior with respect to the bias indicator.

The table below shows the explicit values of the variance and bias indicatorsfor three selected models in the graph:

Page 90: Report on WP4 Case studies - European Commission · CONTENTS 7 The main contributors to this report are Bart Bakker, Vincent de Heij, Daniela Ichim, Frank Linder, Filippo Oropallo,

3.4 Weighting strategy 89

1. Complete model with 17 auxiliary variables (the square in the graph).

2. Model 1 with the 2 auxiliary variables, income level and education levelPESR (encircled number 1 in the graph).

3. Model 2 with the 2 auxiliary variables, education level PESR and socio-economic category (encircled number 2 in the graph).

Table 3.6. Values of estimator tY and of indicators wS =√wV and wB for the complete model, model

11) and model 22).

Education level Complete model Model 1 Model 2tY wB wS wB wS wB wS

Y 12 1,433,375 0.92 0.96 0.94 0.96 0.94 0.92Y 3 2,877,523 0.94 0.98 0.95 0.94 0.96 0.92Y 4 4,440,732 0.93 0.94 0.95 0.90 0.94 0.89Y 5 1,154,937 0.92 0.84 0.92 0.84 0.94 0.88Y 67 571,137 0.89 0.83 0.89 0.77 0.92 0.85

1) model with the two auxiliary variables income level and education level PESR.2) model with the two auxiliary variables socio-economic category and education level PESR.

The complete model is definitely the front-runner of the three selected modelsin terms of bias of the estimator. However, it should be noticed at the sametime that the difference with models 1 and 2 is marginal in that respect.

Model 1 is to be preferred for the higher education levels when looking atthe variance indicator and the same can be said for model 2 where lowereducation levels are concerned. Once again the differences are not at allsensational.

This leads to the conclusion that - based on a 91-model study - the choicefor a complete model to be applied to all five education categories for thecomplete population would certainly be very satisfactory, even if it is notalways the best model for each separate education level.

Weighting models for subpopulations (domains) in the Dutch pop-ulation

Of course, the Educational Attainment File is not only intended for theestimation of the education of the complete Dutch population, but also forsubpopulations (domains) within that population. The strength of the EAFis that it enables education estimation even for smaller subpopulations, as itcontains a great deal more data than any other source on education.

Understandably, the variance and bias of (small) domain estimators alsodepend on the weighting model selected. Several weighting models were

Page 91: Report on WP4 Case studies - European Commission · CONTENTS 7 The main contributors to this report are Bart Bakker, Vincent de Heij, Daniela Ichim, Frank Linder, Filippo Oropallo,

3.4 Weighting strategy 90

tested on a selected domain of young Moroccans aged 18-30. Once again,the testing process was applied to data in EAF 2005.

As in the case of the complete population of the Netherlands, five dichoto-mous variables Y 12, Y 3, Y 4, Y 5 and Y 67 are defined for each separatesubpopulation or domain. So, for example Y 4 = 1 indicates that the personbelongs to the subpopulation Moroccans, aged 18-30, with education levelSCED 4. If not, Y 4 = 0. As Moroccans are a minority group with a rela-tively high survey nonresponse rate, it is not surprising that of the 300,000subsample, Y i = 1 (i=12, 3, 4, 5, 67) 30 for only 306 persons.

The following graph gives the values of the variance and bias indicators forMoroccans in the age group 18-30. The values are shown for each of the 91weighting models. The graph distinguishes four groups of model recognisableby their number:

• Group 1 stands for weighting models with the term gender * age (15-29; 30-49; 50+) * country of origin + country of origin * generation inthe auxiliary variable set.

• Group 2 stands for weighting models with the term gender * age (15-29; 30-49; 50+) * country of origin in the auxiliary variable set. Theterm country of origin * generation is not in the model.

• Group 3 stands for weighting models with the term country of origin *generation in the auxiliary variable set. The term gender * age (15-29;30-49; 50+) * country of origin is not in the model.

• Group 4 stands for weighting models with neither of the two terms.

It is clear that both terms correlate strongly with the definition of the domain,Moroccans aged 18-30, and therefore also with Y 12 to Y 67.

Models in group 4 do not reduce the value of the bias indicator. The modelsin the other groups certainly do, in particular for the lower education levels.The reduction can be up to 20 percent, see for example group 1 and educationlevel SCED 4. For the education levels SCED 1+2 to SCED 4, the largest

30 The total number of Moroccans aged 18-30 was 78,763 in September 2005. Therewere register data for 30,609 of them. This means that data on 306 respondents are neededto cover the remaining population of 48,154 (=78,763-30,609) Moroccans, i.e. 1 samplerecord for every 157 Moroccans in the remaining population. On average, there is onesample record for about every 75 persons in the total remaining population aged 18-30.So, the response rate for Moroccans is about half the total population in the age class18-30.

Page 92: Report on WP4 Case studies - European Commission · CONTENTS 7 The main contributors to this report are Bart Bakker, Vincent de Heij, Daniela Ichim, Frank Linder, Filippo Oropallo,

3.4 Weighting strategy 91

Figure 3.5. Values of indicators wS =√wV and wB for EAF 2005, five education levels and 91 weighting

models, Moroccans aged 18-30.

Square is symbol for the complete model.

Page 93: Report on WP4 Case studies - European Commission · CONTENTS 7 The main contributors to this report are Bart Bakker, Vincent de Heij, Daniela Ichim, Frank Linder, Filippo Oropallo,

3.4 Weighting strategy 92

reduction in the value of the bias indicator is achieved by models in whichboth terms as defined above are included, in other words models belongingto group 1. Notice that for SCED 6+7 there is hardly any perceptible biasreduction, no matter which weighting model is applied. Altogether this isa strong indication that in general estimates for subpopulations will be lessbiased if the weighting models are specified with auxiliary variables that havea strong connection with the classification of the subgroup.

With respect to the variance, the story is quite different. Notice that wS>1regardless of the weighting model applied, which means that all these modelsare subject to more variance than if no auxiliary information is used. Forthe complete model (symbolised as a square) the ratio can get as high as2.5 to 3, except for education level SCED 6+7. A reduction of the valueof the bias indicator is often coupled with an even larger increase in thevariance indicator. In this case it is the relatively low number of respondents(306) in the target population that is to blame for the high values of wS31.Section 3.4.1 has already addressed the occurrence of some unexpected andimplausible outcomes found for some small subpopulations. What one canlearn from this is that auxiliary information in a weighting process does notnecessarily improve the quality of domain estimators in terms of variance, inparticular for small domains.

Let there be no mistake about weighting for small domains. Weighting maydeteriorate the variance quality of the estimators, but this applies to domainsthat are part of the remaining population. However, what matters is theoutcome of the combined estimator. This estimator not only focuses on theremaining population, but also adds register information from the registerpopulation. In relative terms (coefficient of variation) a larger value of thevariance indicator may be offset by the register contribution in the domain.If there are sufficient data from registers in the small domain the combinedestimator can still be more accurate than an estimator which is exclusivelysample-based.

Concluding, a uniform model such as the complete model works fine with thecomplete population, and has the advantage of consistency with populationmargins. For subpopulations it may also improve the bias conditions, inparticular if at least some of the auxiliary variables in the complete modelhave an important role in defining these subpopulations. The effect of the

31With relative little response data there will be much more fluctuation in the correction

weights gi in the formula of the variance indicator: wV(tY)

=∑

i∈R(nRπ−1i gi(yi−yi)−tY−Y )

2∑i∈R(nRπ

−1i yi−tY )

2

More dispersion in the correction weights produces an upward effect on wV .

Page 94: Report on WP4 Case studies - European Commission · CONTENTS 7 The main contributors to this report are Bart Bakker, Vincent de Heij, Daniela Ichim, Frank Linder, Filippo Oropallo,

3.5 Measurement of accuracy 93

complete model on the variance in the case of subpopulations can be lesspositive, as shown by the example above. However, as long as there areenough register observations for the subpopulation in question, even with thecomplete model the coefficient of variation will still be lower for a combinedestimator than if the estimation is only sample-based.

3.5 Measurement of accuracy

Statisticians generally want to know if the results of their estimates are reli-able. When samples are used as data source instead of registers one is facedwith sampling errors. The standard literature on sampling theory mostlyrefers to estimation on sample data.

There is less literature in the case of combined register and sample data. Itis intuitively clear that a higher share of register data reduces the effect ofsampling errors. Even then, these errors are still there, so we need statisti-cal measures to determine whether (sub)population estimates are accurateenough.

The ideal solution to measurement of the accuracy is to derive exact formulaefor the variance estimators. Unfortunately, that would be a difficult projectin the case of combined register and sample survey data. It is much easierto derive formulae which are approximations of the exact variance. However,the problem with that kind of formulae is that they only hold above a mini-mum number of sample observations. This is unfortunate, because for biggersubpopulations in general there is no need to worry about the accuracy. Itis mainly the accuracy of the smaller subpopulations that we are interestedin, in particular if the sample size drops below that minimum.

An alternative approach to the estimation problem for the measurementof accuracy is presented in Kuijvenhoven and Scholtus (2010). These twomethodologists introduce a method of bootstrap resampling inspired by Cantyand Davison (1999). It can be summarised as follows. First of all, the samplepart of the Education Attainment File (EAF) is inflated by adding copies ofeach sample record, so that it represents the entire population. Then a num-ber of so-called bootstrap samples without replacement is taken from thepopulation file so constructed, with each sample as big as the original sam-ple part in the EAF. Kuijvenhoven and Scholtus find that 500 bootstrap

Page 95: Report on WP4 Case studies - European Commission · CONTENTS 7 The main contributors to this report are Bart Bakker, Vincent de Heij, Daniela Ichim, Frank Linder, Filippo Oropallo,

3.6 Some concluding remarks 94

replications are sufficient for convergence of the bootstrap variance32 33.

The table below compares the accuracy of the educational attainment esti-mates from the LFS and EAF for an example of a small subpopulation. Inthis case it concerns young people of Turkish origin in the Netherlands withan education level equivalent to a Master’s degree or higher. The LFS 2008does not contain enough observations to give accurate information (coeffi-cient of variation (CV) of 50 percent or higher). With the EAF 2008 thenumber of sample observations is also rather low, but thanks to the contri-bution of the register observations the accuracy is much better, in particularfor the age group 25-30 (CV of at most 15 percent).

Table 3.7. Coefficient of variation (CV) for some highly educated (Master’s degree level or higher) Turkishmales and females in the Netherlands, aged 18-30, September 2008.

SubpopulationLFS 2008 EAF 2008

n(g) N(g) CV 1) N1(g) n2(g) N2(g) N(g) CV 2)

Turkish males; age 18-24 1 146 100% 71 1 71 142 40%Turkish females; age 18-24 3 535 58% 92 4 185 277 36%Turkish males; age 25-30 4 612 50% 461 11 435 896 15%Turkish females; age 25-30 2 356 71% 533 13 349 882 13%

n(g) is number of observations in cell g in LFS 2008; N(g) is estimate (weighted number of sampleobservations) of total population in cell g; N1(g) is number of register observations in cell g in EAF2008; n2(g) is number of sample observations in cell g in EAF 2008 (mind that EAF 2008 has sampleobservations from fourteen LFS (1996,..,2009); Ń2(g) is weighted number of sample observations in cell gin EAF 2008; N(g) = N1(g) + N2(g)1) Derived with approximation formula; Suppose N is population size; n is sample size; N(g) is totalpopulation in cell g; n(g) is number of sample observations in cell g; p=N(g)/N; q=1-p and f=n/N. Withn large enough (sample size LFS2008 is almost 90 thousand), the variance of N(g) can be approximatedas: var(N(g)) = N2pq(1 − f)/n. With the assumption that the average sample fraction f is very small(e.g. LFS sample fraction is about 1 percent), and p is very small (i.e. relative small subpopulation) the

coefficient of variation (CV) can be approximated as [q(1− f)/np]12 ≈ [1/n(g)]

12 .

2) Bootstrap estimation.

3.6 Some concluding remarks

The introduction of administrative sources on education has enabled the cre-ation of an Educational Attainment File (EAF) for the population of the

32The following formula is used as variance estimator for cell g with bootstrap samplesb=1,...,B:V arB(g) = (1/(B − 1)).

∑b[Nb(g)− (1/B).

∑bNb(g)]2.

Nb(g) is the estimate of total population in cell g of the bootstrap sample b=1,..,B. It isfound by totalling the number of register observations in cell g and the weighted numberof sample observations in cell g.33For estimation of confidence intervals the literature on bootstrap methods suggests

that the number of bootstrap samples be increased to at least one thousand.

Page 96: Report on WP4 Case studies - European Commission · CONTENTS 7 The main contributors to this report are Bart Bakker, Vincent de Heij, Daniela Ichim, Frank Linder, Filippo Oropallo,

3.6 Some concluding remarks 95

Netherlands. The educational attainment variable is derived by combiningdata from these administrative sources and sample surveys. Combining in-formation from different sources will in general improve data quality. Datafrom single administrative sources and surveys are subject to measurementand representation errors. Micro-integration is the method that aims at im-proving the data quality in combined sources by searching and correcting forthe errors on unit level. The quality of the sources on education combined tocreate the EAF has been examined by applying the life cycle and frameworkfor errors as introduced by Bakker (see section 3.2.3)34.

The use of a combined estimator to determine education levels of (sub)po-pulations can be considered as a new development in the field of micro-integration, that demands sophisticated methodology to solve all the com-plex issues involved. An important advantage of the new method is thatestimates on education level have improved in accuracy, in particular whensmaller populations are involved. As education registers do not cover theentire population sample surveys are still indispensable. This is the case,for example, for older citizens who had completed their education before ad-ministrations came into being. In spite of this, the increased contributionof registers in recent years clears the way for substantial cost reduction forsurveys, particularly in areas where registers already give sufficient coverage.

In general, the results of combined estimation appear to be quite satisfactorywhen judged on statistical properties such as variance and bias. For thisreason the use of EAF as data source for educational attainment figures forthe 2011 Census is under serious consideration. In the last few decades, theLabour Force Survey (LFS) has been the conventional source for this, but thepresent Census is much more demanding with regard to the level of detail,and it is not unlikely that the LFS has insufficient observations to complywith all the requirements of the Census table programme. Expectations arethat more detailed table cells can be published with EAF as a source, becauseof its relatively large number of observations.

Even though EAF has proved its usefulness for estimation of education levelsthere is always room for quality improvement. A lot of effort is being done,for example, to find and exploit new register sources to fill the major gaps inthe EAF and to reduce the contribution of surveys. Weak spots in EAF are ofcourse the coverage for older people, people with a foreign background whoseeducation took place abroad, and private education. The new files of the

34Zhang (2011) has recently expanded and developed Bakker’s model in a so-calledtwo-phase life cycle of multiple-source integrated statistical micro data.

Page 97: Report on WP4 Case studies - European Commission · CONTENTS 7 The main contributors to this report are Bart Bakker, Vincent de Heij, Daniela Ichim, Frank Linder, Filippo Oropallo,

3.6 Some concluding remarks 96

Public Employment Service Register (PESR), for example, are a potentiallypromising source of information, owing to the improved registration and thelarge number of clients of the Public Employment Service. PESR is oneof the very few registers that includes information on the population whoseeducation took place abroad. Another reason why PESR data are useful isthat they also contain education information on people in their forties, fiftiesand sixties who are scarcely covered by education registers.

One of the complexities of the statistical process of building an EAF is thesophisticated weighting strategy that has to be applied to the sample part.Weighting is intended to achieve consistency with population margins, tolower the variance for the estimates of the target variable and to reduceselective nonresponse bias, in particular bias caused by selective loss of samplerecords that have been replaced by register data. If low variance and bias arethe desired statistical properties, it is wise to define a weighting model withauxiliary variables that are correlated with the dependent variable educationlevel. For the amount of bias, the correlation between the auxiliary variablesand the response indicator is also relevant, but to a lesser degree.

In the present EAF a rather ambitious weighting model has been specifiedwith no fewer than 17 auxiliary variables. An advantage of so many weightingvariables is of course the broad consistency with many population margins.On the other hand, for a number of reasons the generous model specificationcan also be considered to be too abundant, an aspect that is reflected bythe occurrence of some unexpected and implausible outcomes found for somesmall subpopulations. With a small number of observations in a cell anda multitude of weighting variables in the model, weights may be affected byrather too much fluctuation. The damage is restricted if a large number ofregister data are available in the cell for combined estimation.

It should be noticed that the combined estimator given by formula (3.4) (seesection 3.4.1)

ty =∑i∈Reg

yi +∑

i∈S∩NReg

wiyi

Reg is register population

S ∩NReg is population in sample S of LFSs that is not recorded in registers(NR is non-register populaton).

yi = 1 for people with education level y and 0 otherwise,

wi are scaled weights for records in S ∩NReg after calibration.

Page 98: Report on WP4 Case studies - European Commission · CONTENTS 7 The main contributors to this report are Bart Bakker, Vincent de Heij, Daniela Ichim, Frank Linder, Filippo Oropallo,

3.6 Some concluding remarks 97

makes use of register information on a person’s education level yi in theoverlap S ∩ Reg. Sample information is neglected in the overlap. After all,the basic principle in the building process of EAF was to give priority toall the available register data, as data from registers were believed to bemeasured more accurately.

Sample data are only used for estimation of education levels of the non-register population. This estimation does not make use of register data inS ∩Reg as auxiliary information.

Of course, it is possible to formulate alternative versions of a combined esti-mator. One could for example decide to make use of sample data instead ofregister data in the overlap S ∩ Reg. This idea is the basis for the so-calledAdditive Combined Estimator as introduced by Kuijvenhoven and Scholtus(2011a). The sample data in the overlap are not weighted:

ty =∑i∈Reg

zi +∑

i∈S∩Reg

(υi − zi) +∑

i∈S∩NReg

wiυi (3.13)

zi is yi as measured by the register,

υi is yi as measured by the sample.

Kuijvenhoven and Scholtus (2011a) also introduce a Regression-Based Com-bined Estimator.

ty =∑

i∈S∩Reg

wi,Regυi +∑

i∈S∩NReg

wi,NRegυi (3.14)

In the weights wi,Reg auxiliary register information on education level is in-cluded.

Kuijvenhoven and Scholtus have also studied the statistical properties ofcombined estimators. In Kuijvenhoven and Scholtus (2011b) they show theconditions under which the combined estimator has a lower mean square error(MSE) than a direct sample-based estimator. To simplify the expressionsthey developed, Kuijvenhoven and Scholtus had to make some assumptions.They also tolerate some bias in the register data caused by measurementerrors.

Page 99: Report on WP4 Case studies - European Commission · CONTENTS 7 The main contributors to this report are Bart Bakker, Vincent de Heij, Daniela Ichim, Frank Linder, Filippo Oropallo,

Chapter 4First Steps in Profiling ItalianPatenting Enterprises

Daniela Ichim, Giulio Perani, Giovanni Seri

ISTAT – Italian National Statistical Institute1

Abstract

The paper describes the record linkage scheme followed at the Italian nationalstatistical institute to match micro-data on patent application from the inter-national database PATSTAT with the data available from the Italian OfficialBusiness Register (ASIA).

The target data in PATSTAT are the applicants based in Italy registeringpatent/s in the period 1985-2010. Patents applicants can be ’individuals’or ’establishments’. In this last category we aim at identifying business en-terprises who were active (as recorded in ASIA) in the period 1989-2008.The wishing output of the linkage process is, for each patenting enterprise,a pair composed by the ’applicant identification code in PATSTAT’ and the’enterprise identification number in ASIA’. This last allows for accessing therepositories of the official statistical data and, therefore, linking economicdata to patenting enterprises. Statistical analysis such as: identifying thepremises of patenting propensity; evaluate the impact of patenting on theenterprise profitability; etc. can be then performed.

On the methodological side, linkage of patent data has to rely on the ’appli-cants names’. Consequently, a great effort has been put in the pre-processing

1{ichim,perani,seri}@istat.it

Page 100: Report on WP4 Case studies - European Commission · CONTENTS 7 The main contributors to this report are Bart Bakker, Vincent de Heij, Daniela Ichim, Frank Linder, Filippo Oropallo,

4.1 A general description of patenting administrative flows 99

phase of the process to standardise the applicant/enterprise names and ex-tract the ’legal form’ from the name string. During the linkage process, twopractical problems were faced: the reduced number of comparison variablesand the huge dimension, in terms of number records, of the Italian Busi-ness Register. These issues were addressed within a rule-based deterministicrecord linkage approach. In this paper, together with the results obtained,we will illustrate the main features of the sequential searching and linkagemethodology we adopted.

Introduction

In this report we will describe a preliminary stage of an Istat project aiming,mainly, at monitoring and profiling Italian patenting enterprises. A completecharacterisation of such enterprises might allow, for example, updating thesurvey frame list of potential Research and Development (R&D) performers,might favour investigation of specific subpopulations like biotech enterprises.Moreover, from a statistical analysis point of view, identification of patentingenterprises enables their linking to structural characteristics. Thus, factorsinfluencing patenting propensity of enterprises might be studied, as well as,the economic impact of patenting activity.

The preliminary stage we are concerned with in this report is the designof a strategy aiming at the unambiguous identification of Italian patentingenterprises.

This document is divided in six sections. In section 4.1 a general descriptionof patenting administrative flows is given. In section 4.2 we discuss the se-lection of the databases to work with. A brief description of these databasesis provided. In section 4.3 details on the applied standardization procedureare reported. The record linkage methodology as applied to these particulardatasets is illustrated in section 4.4. Due to the reduced number of compari-son variables and to the huge amount of data, we had to deal with, in section4.4 emphasis is put on search space reduction methods. Finally, in section4.5 we present the obtained results. Some conclusions and ideas for furtherimplementations, analyses and research are given in the last section.

4.1 A general description of patenting admin-istrative flows

A patent is an exclusive right granted for an invention, which is a productor a process that provides, in general, a new way of doing something, or

Page 101: Report on WP4 Case studies - European Commission · CONTENTS 7 The main contributors to this report are Bart Bakker, Vincent de Heij, Daniela Ichim, Frank Linder, Filippo Oropallo,

4.1 A general description of patenting administrative flows 100

offers a new technical solution to a problem. In order to be patentable, theinvention must fulfill certain conditions. Namely, it must be of practicaluse; it must show an element of novelty, that is, some new characteristicwhich is not known in the body of existing knowledge in its technical field.This body of existing knowledge is called ”prior art”. The invention mustshow an inventive step which could not be deduced by a person with averageknowledge of the technical field. Finally, its subject matter must be acceptedas ”patentable” under law.

A patent is granted by a national patent office or by a regional office thatdoes the work for a number of countries, such as the European Patent Of-fice. Under such regional systems, an applicant requests protection for theinvention in one or more countries, and each country decides as to whetherto offer patent protection within its borders. A patent provides protectionfor the invention to the owner of the patent. The protection is granted fora limited period.

A patent owner has the right to decide who may - or may not - use thepatented invention for the period in which the invention is protected. Thepatent owner may give permission to, or license, other parties to use theinvention on mutually agreed terms. The owner may also sell the right tothe invention to someone else, who will then become the new owner of thepatent. Once a patent expires, the protection ends, and an invention entersthe public domain, that is, the owner no longer holds exclusive rights to theinvention, which becomes available to commercial exploitation by others.

The first step in securing a patent is the filing of a patent application. Thepatent application generally contains the title of the invention, as well as anindication of its technical field; it must include the background and a de-scription of the invention.

There are 3 main actors in any administrative patenting flow: the inventor,the owner and the applicant. A special feature of a patenting process is thatthe inventor, the owner and the applicant might be different subjects (eachreferring to one or more entities).

A special case of the relationship inventor-owner-applicant is provided by thepatents whose original idea ’born’ in enterprises where the general manger(head of the company) is also the owner of the enterprises. Sometimes, themanager is the patent owner while in other cases the patent owner is theenterprise itself. In both cases, the inventor might be a completely differentperson (for example a researcher employed by the enterprise) as well as theapplicant (for example a notary’s office offering patenting services).

Page 102: Report on WP4 Case studies - European Commission · CONTENTS 7 The main contributors to this report are Bart Bakker, Vincent de Heij, Daniela Ichim, Frank Linder, Filippo Oropallo,

4.2 Data sources 101

4.2 Data sources

To our knowledge, the most complete and updated database on patents isthe European Patent Office (EPO) database ”Worldwide Patent StatisticalDatabase”, called PATSTAT. Much of the raw data in PATSTAT is extractedfrom the EPO’s master bibliographic database DOCDB, also known as theEPO Patent Information Resource. PATSTAT is updated twice a year (Apriland October). PATSTAT is a relational database containing 20 tables withmore than 70 millions of records (63 millions patent applications) from over 80countries. Other sources on patents either concern only regional applications(like Ufficio Italiano Brevetti e Marchi) or offer only data extraction andanalyses services.

PATSTAT registers mainly information on patent applications. To reach ourgoal (identification of Italian patenting enterprises), in this work we concen-trated only on the two tables depicted in Figure 4.1 The link between them isgiven by the unique values of the field Application Number, or alternatively,Publication Number. The Application Number also contains the patent yearof registration. The time period covered by the database is given by the years1985-2010. There is no explicit database field concerning the legal form ofthe inventor, owner or applicant. PATSTAT registers both the inventor andapplicant name; only the latter was used in this work. The possible legal formshould be extracted from those names. About the applicant, PATSTAT alsoregisters its address (street, city, postal code) and its country code. Onlyapplicants based in Italy, i.e. COUNTRY CODE = ”IT”, were selected fromPATSTAT tables. At this stage of the work, the postal code was used asgeographical location assuming it has the same accuracy as the address. Weplan to rely on the detailed address to assess the linkage quality or to eluci-date special cases like those in which the applicant is the manager (or owner)of an enterprise. These aspects are not reported in this document.

About the patent, PATSTAT registers its IPC (International Patent Clas-sification), its application and publication number. It is worth noting thata patent could have assigned more than one IPC codes. Indeed, while in thesecond table of Figure 4.1 each record corresponds to a unique ApplicationNumber (about 70000), in the first table the number of records is around300000. Moreover, it should be stressed that there is no formal/well-definedrelationship between IPC codes and the principal economic activity classifi-cation (NACE).The most recent versions of PATSTAT also include a standardized version

Page 103: Report on WP4 Case studies - European Commission · CONTENTS 7 The main contributors to this report are Bart Bakker, Vincent de Heij, Daniela Ichim, Frank Linder, Filippo Oropallo,

4.2 Data sources 102

Figure 4.1. Used database tables from PATSTAT; COUNTRY CODE = ”IT”.

of the applicant name2. In our case study, this standardization was ignoredbecause it is not fully compliant with Italian enterprise names, it includesthe legal form. Moreover, as our goal is to link the patent applications to anenterprise register, we should apply the same standardization process to theselected enterprise register too.

Applicants may be classified as individuals or establishments. These lat-ter, according to the Frascati manual, see OECD (2002), could be: businessenterprises, public institutions, non-profit institutions and private or publicuniversities. In this work, the aim is the identification of patenting enter-prises. The complete classification of applications will be performed in laterstages. A distinction between business enterprises and natural individualscould be favoured by a catalogue of Italian first names. Istat provides sucha list, stemming from surveys on population register. Alternatively, a list ofItalian first names may be downloaded from www.nomix.it. From preliminaryinvestigations, other data sources potentially helpful in profiling the Italianpatent applications might concern the general managers of large enterprisesand academia researchers.

Additional details on PATSTAT may be found at www.epo.org.

On enterprises, many registers might be available in Italy, with different de-grees of accessibility. From the Istat point of view, the most important issurely ASIA (Archivio Statistico delle Imprese Attive). ASIA is developed,updated and maintained through the statistical integration of different ad-ministrative sources (Tax Register, Register of Enterprises and Local Units,Social Security Register, Work Accident Insurance Register, Register of theElectric Power Board), covering the entire population of enterprises of indus-

2see OECD ”Harmonised Applicants’ Names” available at http://www.oecd.org/dataoecd/52/17/43846611.pdf

Page 104: Report on WP4 Case studies - European Commission · CONTENTS 7 The main contributors to this report are Bart Bakker, Vincent de Heij, Daniela Ichim, Frank Linder, Filippo Oropallo,

4.2 Data sources 103

try and services, other minor archives available (covering particular sectors),and structural business statistics currently produced by Istat. ASIA is a busi-ness register used in many different business survey stages, e.g. samplingframe, post-stratification, calibration, etc.

Among the variables included in ASIA, one may specify:

a) Enterprises Identification Number (an Istat internal identification codeallowing linkage to whatever economical information on the same unitcollected by Istat); this identification code is unique for each enterprise

b) Enterprises Name

c) Zip Code

d) NACE code

e) Geographical information (address, municipality, province, region),

f) Legal form

Other variables that could be useful are the Fiscal Code, Number of employeesand Turnover.

It should be observed that only Enterprise Name and Zip code are overlap-ping with the information contained in PATSTAT.

According to the ASIA reliability and availability only enterprises that wereactive in the period 1998-2008 have been taken into consideration. Conse-quently, in this work, it was assumed that an enterprise was active duringthe year it applied for a patent. In this report we refer only to the selectionof ASIA corresponding exclusively to active enterprises.

To give an idea of the numerical complexity of the problem we report, in table4.1 the number of active enterprises is presented. In the second column, foreach year, we also present the percentage of active enterprises with more than1 employee, which is almost constant, around 40%. As it may be observed,the union of different versions of ASIA contains more than 47 millions ofrecords.Given that our goal is to identify the patent applicants, it was consideredthat it could be useful and efficient to concentrate first on lists of enterprisesshowing a high research and innovation propensity. To this aim, we took intoconsideration the survey frame of Research and Development survey whichis yearly conducted by Istat. Only 2006 and 2007 waves were available in

Page 105: Report on WP4 Case studies - European Commission · CONTENTS 7 The main contributors to this report are Bart Bakker, Vincent de Heij, Daniela Ichim, Frank Linder, Filippo Oropallo,

4.3 Data pre-processing and standardisation 104

Table 4.1. Statistics on the number active enterprises in Italy, period 1998-2008.

YEAR Thousands of % of active enterprisesactive Enterprises with more than 1 employee

1998 3871 401999 3950 402000 4223 402001 3992 402002 4323 402003 4327 402004 4367 402005 4458 402006 4484 402007 4554 402008 4577 40

a standardized form. In the 2006 R & D data file, there are 26237 records,while in the 2007 data file, there are 16730 records. Since ASIA is thesampling frame for any business survey conducted at Istat, the informationincluded in R & D survey frames is similar to the one contained in ASIA.When using the R & D survey frames, we only assumed that the linkageprobability would be higher (due to the innovation propensity) for the R &D survey frames than for the entire business register ASIA.

4.3 Data pre-processing and standardisation

PATSTAT counts 299769 applications identified by an Application Numberand a Publication Number; the latter is redundant information and thereforeit was ignored in this work. The number of Italian applications reduces to72037. To each Application Number is assigned an applicant name (and idcode), and the Zip Code. Additional information may be derived from theprevious information: year of application, year of first/last application byapplicant; number of patent applications filed by each applicant, region ofresidence of the applicants.

Variable Applicant Name has been subject to the following standardisationoperations:

1. extraction of the application year, recorded in a new variable ”Appli-cation Year”

2. transformation of all letters in upper case letters

3. removal of punctuations

Page 106: Report on WP4 Case studies - European Commission · CONTENTS 7 The main contributors to this report are Bart Bakker, Vincent de Heij, Daniela Ichim, Frank Linder, Filippo Oropallo,

4.3 Data pre-processing and standardisation 105

a. accents

b. symbols and special characters (e.g. ’$’, ’%’ , ’&’, ’/’, ’*’)

c. double spaces (transformed in a single space)

d. dots (e.g. L.T.D. transformed in LTD)

4. standardisation of known abbreviations (e.g. we found about 150 waysto say ”in short”) in an unique value (typically Italian words)

5. standardisation of the most frequent words using a deterministic recordlinkage procedure in Relais, see Istat (2011)

a. input files: we considered a file of words (sequence of charac-ters separated by a blank character) with frequencies greater than1000 against a file of words with frequencies greater than 100, butsmaller than 1000;

b. parameters: comparison function = ”Edit distance”; threshold=0.8,greedy algorithm to perform the one-to-one assignment;

c. output check: the word pairs declared ”match” were subject toa clerical review;

d. standardization: the 122 pairs declared as equivalent were stan-dardized in the same way; they generally concerned singular –plural or Italian – English versions of the same words.

e. examples: TERMOIDRAULICA – TERMOIDRAULICI; SOLU-TION – SOLUTIONS; MULTISERVICE – MULTISERVIZI;

6. removal of duplicated words in the same name (each second occur-rence of the same word was removed). This means that each name iscomposed by words of frequency 1, e.g. AAA BB AAA CCCCC wastransformed in AAA BB CCCCC

7. ordering of words in alphabetical order, e.g. CC BB AA was trans-formed in AA BB CC

8. identification and removal of the legal form, if any. Information on thelegal form was stored in a standardized manner in another variable,called Legal Form. About 80 ways of expressing 6 main standardizedlegal forms were identified. The 6 main legal form categories are ’SPA’,’SRL’, ’SAS’, ’SNC’, ’COOP’ and ’NONE’.

The resulting variable is called Standardized Name.

Page 107: Report on WP4 Case studies - European Commission · CONTENTS 7 The main contributors to this report are Bart Bakker, Vincent de Heij, Daniela Ichim, Frank Linder, Filippo Oropallo,

4.3 Data pre-processing and standardisation 106

Then, some additional variables have been derived from the standardizedApplicant Name:

a. the Standardised name, without abbreviations, duplications, etc.

b. the standardised Applicant Name, without some very common words,e.g. ITALIA

c. acronyms and abbreviations

d. number of characters

e. longest and shortest words

f. length of the longest/shortest words

Since, in this stage, only enterprises should be subject to any linkage pro-cess, universities and known public administrations were eliminated from thefile (those records were identified as names containing words like ”UNIVER-SITY”, ”POLITECNICO”, etc.).

Except for standardisation operations 1 and 8, the same pre-processing wasapplied to ASIA. Operations 1 and 8 are not necessary since ASIA alreadycontains information on year and legal form of enterprises. Additionally,the same unique standard values identified when performing operation 5 onPATSTAT were used also for ASIA.

As comparison variables, in this linkage stage, the only three variables sharedby PATSTAT and ASIA are: Standardised Name, Zip Code and Legal Form(stemming from the Applicant Name).

Finally, the PATSTAT data file was deduplicated by considering duplicatedthose records having simultaneously the same values for the three comparisonvariables mentioned above. Thus, the number of records reduced from 72037to 23833. It should be noted that records in ASIA are supposed to be unique,each enterprise being assigned an unique identification code, i.e. a key num-ber. This unique identification number allows the enterprise traceability inwhatever Istat business survey conducted.

In figure 4.2 histograms of length of Standardized Name and number of words(sequence of characters separated by blank) in the standardized PATSTATdatabase are shown. It may be observed that, in mean, the StandardizedName has a length equal to 15, while the mean number of words in a nameequals 2.2.

Page 108: Report on WP4 Case studies - European Commission · CONTENTS 7 The main contributors to this report are Bart Bakker, Vincent de Heij, Daniela Ichim, Frank Linder, Filippo Oropallo,

4.4 The record linkage process 107

Figure 4.2. Distribution of both length of Standardized Name and number of words in a name, PATSTATdatabase.

In table 4.2 the distribution of variable Legal Form is shown. It may beobserved that for almost 40% of records none legal form was identified, whilethe majority, about 56%, of records is concentrated in categories ”SPA” and”SRL”.

Table 4.2. Distribution of Legal Form, PATSTAT database

Legal Form COOP SAS SNC SPA SRL TotalFrequency 8979 63 501 756 6164 7370 23833% 37.67 0.26 2.10 3.17 25.86 30.92 100

4.4 The record linkage process

As illustrated in figure 4.3 by the red arrow, the linkage problem desirableoutput is the pair Applicant Identification Number (PATSTAT) - EnterpriseIdentification Number (ASIA). The latter allows linking structural and eco-nomical information stemming from Istat official surveys to patenting enter-prises, shown in gray in Figure 4.3.

As described in the previous sections, the only overlapping information be-tween the two datasets PATSTAT and ASIA are those relative to the Appli-cant Name – Enterprise Name and the Zip Code. It should be noted that,in PATSTAT, Applicant Name is missing in 40 records, while Zip Code ismissing in about 10% of records. Besides the missing value problem, variable

Page 109: Report on WP4 Case studies - European Commission · CONTENTS 7 The main contributors to this report are Bart Bakker, Vincent de Heij, Daniela Ichim, Frank Linder, Filippo Oropallo,

4.4 The record linkage process 108

Figure 4.3. The PATSTAT-ASIA linkage problem and its opportunities.

Zip Code in PATSTAT, also presents about 9.4% of values representing thegeographical location only at aggregated level.

4.4.1 Search space reduction

Due to the huge amount of data, and, consequently, the huge amount ofcandidate matching pairs, the usage of search space reduction techniques wasnecessary. In this section details on the search space reduction techniquesapplied to PATSTAT and ASIA will be given. Moreover, a blocking techniqueby neighbourhoods of words will be introduced. Some classical blockingtechniques based on the patent year and 2-digit ZIP Code proved to beextremely ineffective; these are not further detailed here.

A) Reduction of PATSTAT

After the removal of duplicated records, PATSTAT, the number of recordsequals 23833. It should be reminded that records showing the same exactvalues for Standardized Name, Zip Code and Legal Form were consideredduplicated records.

Since, in this phase, our goal is to link PATSTAT enterprises to ASIA en-terprises, PATSTAT was reduced in order to contain only units probablyrepresenting enterprises. Unfortunately, given its meaning, variable LegalForm does not provide a perfect discrimination between enterprise and not-enterprise units. Thus a list of Italian First Names, containing about 1600units, was used. From the PATSTAT database, we removed those recordswhose Standardized Name satisfy simultaneously the following conditions:

1. it contains an Italian First Name

2. it has an empty Legal Form

Page 110: Report on WP4 Case studies - European Commission · CONTENTS 7 The main contributors to this report are Bart Bakker, Vincent de Heij, Daniela Ichim, Frank Linder, Filippo Oropallo,

4.4 The record linkage process 109

3. it does not contain several special words indicating a business activity(e.g. enterprise, group, systems, hotel, holding, etc.). These specialwords were found by a manual inspection of those records satisfyingonly the first two conditions. About 63 such special words were found.

These procedure does not offer a 100% discrimination between enterpriseand non-enterprise units. For example, those Standardized Names contain-ing a non-Italian First Name or extremely rare (almost uniuqes) Italian FirstNames, e.g. Karl Dietriech, Jean-Pierre or Odoardo, would not be correctlyclassified. Anyway, since it was considered that it is very difficult to discoverthese situations in automatic manner, the PATSTAT reduction was not fur-ther improved. Probably some very simple record linkage technique wouldhelp in finding some typing errors, e.g. Eduardo instead Odoardo.

Based on the above separation procedure, PATSTAT was divided in twoparts: the first one containing 7700 records considered non-enterprises and16132 records considered enterprises. The record linkage process was appliedto the latter.

B) Reduction of ASIA

Obviously, there is a large number of enterprise which are active in consec-utive years. This means that the same enterprise, if active in consecutiveyears, should be registered in consecutive versions of ASIA. Moreover, thesame reasoning holds for non-consecutive years, too. In order to reduce thesearch space, the 11 versions of ASIA (1998 - 2008) were prepared in sucha way that an active enterprise is included only once in their union. Indeed,the ASIA 2008 was considered the most complete and updated version. Thenfrom ASIA 2007, enterprises that were active also in 2008 were deleted, sincethey are already included in ASIA 2008. Next, from ASIA 2006, enterprisesthat were active in 2007 and/or 2008 were removed, and so on backwardsuntil ASIA 1998. The Enterprise Identification Number was used to performthe recursive selection of records. The final number of ASIA records we hadto deal with is shown in table 4.3. For each year, the percentage (third col-umn) was computed over the total number of unique enterprises (union ofASIA 1998 – ASIA 2008).In the union of different waves of ASIA, except for 885 records (over morethan 7 millions), the ZIP Code is always registered with 5 digits.

In table 4.4 the percentage of enterprises in several ASIA datasets by LegalForm and year is shown. Only information stemming from even years wasused to derive the table 4.4. First, it may be observed the high percentage ofenterprises without a ”Legal form”; such enterprises are probably individual

Page 111: Report on WP4 Case studies - European Commission · CONTENTS 7 The main contributors to this report are Bart Bakker, Vincent de Heij, Daniela Ichim, Frank Linder, Filippo Oropallo,

4.4 The record linkage process 110

Table 4.3. Number of active enterprises in the search space, by year

YEAR Thousands of remaining Percentage of remainingactive enterprises active enterprise

1998 292 3.761999 170 2.192000 484 6.242001 127 1.642002 336 4.342003 321 4.142004 322 4.152005 347 4.472006 367 4.732007 417 5.372008 4577 58.99

Total 7760 100

enterprises. Second, a quite stable temporal trend of Legal form distributionmight be noticed, too. Finally, these ASIA distributions seem quite differentwith respect to the PATSTAT distribution shown in table 4.2.

Table 4.4. ASIA: percentage of enterprises by Legal Form and year

Legal form\YEAR 1998 2000 2002 2004 2006NONE 71.03 79.17 73.68 72.94 72.40COOP 1.07 0.48 0.77 0.86 1.09SAS 6.92 5.65 6.74 6.59 6.38SNC 8.09 7.26 7.82 7.43 6.66SPA 0.58 0.46 0.48 0.43 0.37SRL 12.31 6.98 10.51 11.76 13.10

Finally, it should be mentioned that, due to the huge computational burden,ASIA 2008 was divided in three parts: a) with more than 10 employees, b)with 1-9 employees, with non-empty Legal Form, and c) with less than 1employee with non-empty Legal Form.

A subpopulation receiving special attention is represented by the R & D en-terprises. Indeed, it was assumed that the patenting enterprises have an in-creased probability of performing research and development activities. Con-sequently, in the first record linkage procedures were applied using the 2006and 2007 R & D survey frames. Since a significant number of patenting en-terprises were linked to enterprises in R & D survey frames, this approachallowed the reduction of the number of patenting enterprises, too.

A similar reasoning was applied to ASIA 2008. It was assumed that thegreatest enterprises, in terms of number of employees, have an increased

Page 112: Report on WP4 Case studies - European Commission · CONTENTS 7 The main contributors to this report are Bart Bakker, Vincent de Heij, Daniela Ichim, Frank Linder, Filippo Oropallo,

4.4 The record linkage process 111

probability of performing patenting activities. This assumption is supportedby the complexity of the technical, legal and administrative procedure anapplicant should follow in order to have granted a patent. Consequently,enterprises with more than 10 employees in ASIA 2008 were considered inthe second record linkage step.

C) Blocking by neighbourhood

Due to the huge number of records we had to deal with, some search spacereduction blocking technique was still necessary. As previously discussed,PATSTAT and ASIA share only three variables, Standardized Name, ZipCode and Legal Form. Unfortunately, none of these variables is reliableenough to be used as blocking variable. The idea of neighbourhood of wordswas then introduced. For a pair of records, it was assumed that a necessarymatching condition was that their Standardized Names share at least oneword. Here by ”word” it is meant a sequence of characters not includinga blank. Otherwise stated, it was assumed that at least one word is registeredcorrectly. Then, for each record in PATSTAT, the list of words definingits Standardized Name was found. These words are illustrated by coloured(main horizontal row, non-gray symbols in Figure 4.4). Next, for each suchword, the list of enterprises in ASIA containing those words was identified(the vertical columns in Figure 4.4). The union of lists of such enterpriseswas named Neighbourhood of the Standardized Name under consideration.If an exact match on Standardized Name exists, it should belong to thisNeighbourhood, it should belong to the intersection of the list of enterprisesforming the Neighbourhood (see the rows with all coloured non-gray symbolsin each column). In such cases, a merge operation should be equivalent. Itmight also happen that the exact match on Standardized Name does not exist.In such cases, a rule-based deterministic record linkage should be applied insituations like the one depicted in the fifth row in the third column. Finally,for each record in PATSTAT, the record linkage procedure was applied usingonly its Neighbourhood, i.e. the Neighbourhood was used as blocking variable.

Several considerations hold. First, blocking by Neighbourhood allows us todivide the enormous search space in a huge number of much smaller searchspaces. Obviously, the number of search spaces equals the number of recordsin PATSTAT and RELAIS may dealt with many search spaces in an auto-matic manner. Second, each search space has a reduced dimension. In Table4.5 some statistics on the dimension of such search spaces are shown. It maybe observed that the maximum dimension of the search space equals 15570,a very reasonable dimension to deal with in record linkage problems. Third,it should be mentioned that, by construction, each Neighbourhood contains

Page 113: Report on WP4 Case studies - European Commission · CONTENTS 7 The main contributors to this report are Bart Bakker, Vincent de Heij, Daniela Ichim, Frank Linder, Filippo Oropallo,

4.4 The record linkage process 112

Figure 4.4. The PATSTAT-ASIA neighbourhoods.

at maximum one correct link. Due to this reason and to the dependencybetween Neighbourhood and Standardized Name, this blocking procedure, asit was here defined, cannot be used in a probabilistic record linkage proce-dure based on the Standardized Name as comparison variable (because blocksand comparison variables are not independent). Moreover, it might be dif-ficult/ineffective to apply the Neighbourhood blocking procedure when unitsare individuals (natural persons) because the variability of names of naturalpersons is much smaller than the variability shown by names of enterprises.Hence, when dealing with natural persons, Neighbourhoods might containa huge number of records as well, thus a real reduction of the search spacewould not be obtained.

Table 4.5. ASIA 2008: percentage of enterprises by Legal Form and year

# of ASIA enterprises # of Neighborhoods containingin a Neighborhood the same ASIA enterprise

MIN 1 11◦ QUARTILE 5 1MEDIAN 77 3MEAN 760 83◦ QUARTILE 837 10MAX 15570 124

Of course, it might happen that very short Standardized Names (1-2 let-ters) or very common words (e.g. ITALIA, GROUP, etc) could generatehuge neighbourhoods. PATSTAT records containing only very short words,i.e. length of the longest word equal to 1 or 2, were excluded from theNeighbourhood creation phase. Such PATSTAT records were searched for

Page 114: Report on WP4 Case studies - European Commission · CONTENTS 7 The main contributors to this report are Bart Bakker, Vincent de Heij, Daniela Ichim, Frank Linder, Filippo Oropallo,

4.4 The record linkage process 113

by a simple merging procedure. From 649 PATSTAT records, 169 were iden-tified by a complete search in ASIA. As for the very common words, it wasconsidered that no reliable record linkage procedure could be performed onlyon the basis of such words; the reasoning is similar to the one applied for thenames of natural persons.

Moreover, it might happen that some Standardized Names have an emptyNeighbourhood. This is generally the case for Standardized Names of a sin-gle word. If such words are differently registered in PATSTAT and ASIA,the corresponding Neighbourhood would be empty because of the way it isderived (at least one word is registered in exactly the same manner in bothdatabases). In table 4.6 the number and percentage (over the total number ofPATSTAT Standardized Names with empty Neighbourhood) of records withempty Neighbourhood are shown. Even if, for each database the numberof records with empty Neighbourhood is not so small, it was observed thatonly 836 records were classified as ”WITHOUT Neighbourhood” through theentire search space creation flow.

Table 4.6. Number and percentage of records without Neighbourhood.

# records without percentage of recordsASIA neighbourhood without neighbourhood1998 1870 11.61999 1983 12.32000 1794 11.12001 2053 12.72002 1888 11.72003 1793 11.12004 1832 11.42005 1818 11.32006 1812 11.22007 1734 10.7

2008 less than 1 employee 2481 15.42008 more than 10 employees 2114 13.1

R & D 2006 4144 25.7R & D 2007 3664 22.7

2008 more than 1 employee with Legal Form 5046 31.3

The 836 empty Neighbourhood Standardized Names generally have 1 or 2words, as illustrated in Figure 4. Indeed, the number of words in StandardizedNames corresponding to empty Neighbourhood, has a median equal to 1,while the number words in all Standardized Names has a median equal to 2.

Of course, neighbourhoods could be defined also by an approximate matching(e.g. a similarity distance different instead of equality distance) of at leastone word. This idea will be subject to further implementations.

Page 115: Report on WP4 Case studies - European Commission · CONTENTS 7 The main contributors to this report are Bart Bakker, Vincent de Heij, Daniela Ichim, Frank Linder, Filippo Oropallo,

4.4 The record linkage process 114

4.4.2 Deterministic record linkage

Even if the Neighbourhood was used as blocking variable, the usage of sim-ilarity criteria was still necessary. Indeed, the Neighbourhood contains allASIA records having at least one exact word in common with the studiedPATSTAT record. A similarity criteria between Standardized Names wasused to give an overall measure of the records similarity.

Figure 4.5. Number of words in Standardized Names corresponding to empty Neighbourhoods.

A deterministic rule was used in this work. It is a compound one, statingthat at least one of the following string comparators is greater than 0.8:

1. Jaro

2. Levensthein

3. Jaro-Winkler

4. Dice

5. 3-Grams

6. equality rule (only in this case the threshold was equal to 1)

Details on the implementation of this comparison functions may be found inthe Relais manual. Other thresholds, different from 0.8, were tested, but 0.8proved to be the most efficient.

The selection of the unique links was also performed using Relais, by meansof a greedy solution already implemented in the software. In this work, equalweights for all rules were always used.

Finally, the pairs declared matches were subject to a clerical review.

Page 116: Report on WP4 Case studies - European Commission · CONTENTS 7 The main contributors to this report are Bart Bakker, Vincent de Heij, Daniela Ichim, Frank Linder, Filippo Oropallo,

4.5 Preliminary Results 115

To conclude this section, we summarize the record linkage procedure. Onlythe blocking procedure and the selection of databases were varied. The de-terministic rule when comparing Standardized Names and the threshold wereconstant. In a first phase, blocking by Neighbourhood, Zip Code and LegalForm was used. In a second phase, only blocking by Neighbourhood and LegalForm was used. The matching pairs were always subject to a clerical review.The records in PATSTAT were linked against the following databases:

1. 2006 and 2007 R & D survey frames

2. ASIA 2008 with more than 10 employees

Then an update of the PATSTAT database was performed.

3. ASIA 2008 with more than 1 employee

4. ASIA 1998 – ASIA 2007

Then an update of PATSTAT database was performed.

5. ASIA 2008 with less than 1 employee

4.5 Preliminary Results

At this stage, the number of found ”correct” link is 12510 out of 16132 (ap-plicant names potentially referring to individuals have been stored for lateanalysis), i.e. 78%. As for ”correct” link we intend a (non duplicated) pair(Applicant Identification Code - Enterprise Code) stemming from one of thelinkage steps performed during the project: in each step, the links foundhave been classified as ”correct” (true according to the available informa-tion), ”maybe” (possibly subject to more detailed and sophisticated clericalreview) or ”false” (discarded) and stored removing duplications. Even if pairsApplicant Identification Code - Enterprise Code are non-duplicated, some ofthem may represent duplication of Applicants (more than one Applicant Iden-tification Code may be linked to the same Enterprise Code) or of enterprises(the same Applicant Identification Code may be linked to more than one En-terprise Code). The first case may happen when a multi patenting applicanthas been registered with different names in different applications; the stan-dardisation process do not compensate for these differences. Consequently,at the same applicant might be assigned to more than one enterprise. On theother side, as different steps of the linkage procedure have been performedon the same applicants dataset (identified applicants were removed from the

Page 117: Report on WP4 Case studies - European Commission · CONTENTS 7 The main contributors to this report are Bart Bakker, Vincent de Heij, Daniela Ichim, Frank Linder, Filippo Oropallo,

4.5 Preliminary Results 116

file only twice) several applicants linked ”correctly” to the same enterprisecode cannot be excluded. Anyway the impact on the total number of linksseems to be limited to few cases: the number of unique enterprise codes atthis stage is 12488 out of 12510. Moreover, in order to asses the quality ofthe results, a small experiment has been conducted on a set of 190 codes ran-domly selected from the Espacenet web database (the ”application number”field has been used to download patent information from the EPA web-site3).We found 5 mismatches out of 190 records (2,5%). This means that, even ifthe available standardised information coincide in the two sources it is notpossible to grant 100% exact link because of very similar (or common) names.Other possible sources of misclassification that should be taken into accountwhen checking the quality of the linkage process are: enterprises belongingto the same enterprise group often register their patents with similar namesand the changes incurred to enterprises through their life (changes of address,legal form, etc.).

In Table 4.7 the patenting enterprises (corresponding to the Enterprise Codefound in ASIA) are reported by size, i.e. classes of employees. Frequenciesare shown in two subsequent phases, in the second one the similarity crite-rion adopted in searching a link into the ’neighbourhood’ have been relaxedremoving the postal code from the set of the blocking variables. As expected,more than half of the population of patenting enterprises have a size greateror equal to ten employees (the highest class considered).

Table 4.7. Patenting enterprises by size (classes of employees)

Classes ofEmployees First phase Second phaseFreq(Cum Freq) %(Cum %) Freq(Cum Freq) %(Cum %)

1 793 8.1 2345 18.8(1-10) 1995 (2788) 20.4 (28.5) 2334 (4679) 18.7 (37.5)[10, 6985 71.5 7809 62.5Total 9773 12488

In Table 4.8 the frequency distribution of the patenting enterprises by thedivision economic activity (NACE 2 digits code) are reported in descendingorder of the frequency count. Only the ten most frequent NACE divisionsare shown out of the 45 assigned to the whole set of matched pairs. These’most important’ divisions cover more than 65% of the total and belong tothe Manufacturing sector.

3Ten applicants number can be downloaded by trials, for a maximum of 200 applicationnumbers. Moreover, the information provided need to be managed before the use.

Page 118: Report on WP4 Case studies - European Commission · CONTENTS 7 The main contributors to this report are Bart Bakker, Vincent de Heij, Daniela Ichim, Frank Linder, Filippo Oropallo,

4.6 Conclusions and future plans 117

Table 4.8. Patenting enterprises (active in 2008) by economic activity (2 digit NACE 2007): the ten mostfrequent NACE’s division

NACE NACE Description Frequency % Cumulative CumulativeFrequency %

Manufacture of28 machinery and 2186 22.4 2186 22.4

equipment n.e.c.Manufacture of fabricated

25 metal products, except 885 9.1 3071 31.5machinery and equipment

Wholesale trade,46 except of motor 695 7.1 3766 38.6

vehicles and motorcyclesManufacture of

22 rubber and 601 6.2 4367 44.8plastic productsManufacture of

27 electrical 461 4.7 4828 49.5equipment

Manufacture of computer,26 electronic and 456 4.7 5284 54.2

optical productsManufacture of

20 chemicals and 324 3.3 5608 57.5chemical products

Real68 estate 282 2.9 5890 60.4

activitiesManufacture of motor

29 vehicles,trailers 280 2.9 6170 63.3and semi-trailers

32 Other 251 2.6 6421 65.9manufacturing

4.6 Conclusions and future plans

In this report we reported the path followed at the Italian national statisticalinstitute (Istat) in designing a linkage strategy to match micro-data on patentapplication from the international database PATSTAT and the data availablefrom the Italian Official Business Register (ASIA). The overall aim of thisproject is to identify the Italian patenting enterprises and characterise themthrough their economical information surveyed by Istat. It might allow, forexample, to investigate which factors influence the patenting propensity ofthe enterprises and/or if patenting activities has an impact on the economicalperformance. On the other side, monitoring such a subpopulation could beuseful in maintaining the survey frame list for surveys related to Researchand Development area, as it could be the biotechnology sector.

In PATSTAT, the applicants resident in Italy and registering at least one

Page 119: Report on WP4 Case studies - European Commission · CONTENTS 7 The main contributors to this report are Bart Bakker, Vincent de Heij, Daniela Ichim, Frank Linder, Filippo Oropallo,

4.6 Conclusions and future plans 118

patent in the period 1985-2010 have been considered. Patent applicants canbe ’individuals’ or ’establishments’. At this stage, the linkage process aimedat identifying, among establishments, the business enterprises recorded inASIA in the period 1989-2008. The desired output of the linkage processis to assign to each patenting enterprise the ’applicant identification codein PATSTAT’ and the ’enterprise identification number in ASIA’. This lastallows for accessing the repositories of the official statistical data and, there-fore, linking economic data to patenting enterprises.

The overlapping information between the two archives reliable as matchingvariables in the linkage process mainly consists only of the ’applicants names’and the ’postal code’. Moreover, the size of the business register ASIA interms of number of records represents a computational problem to be faced.Therefore, a great effort has been put in the pre-processing phase of the pro-cess to standardise the applicant/enterprise names and some ’search space’reduction techniques have been adopted. Among these last, particularly ef-fective has proved to be the ’blocking by neighbourhood’ introduced in thiswork. Assuming that, for a given patenting enterprise, at least one word inthe ’applicant name’ (in PATSTAT) and the ’enterprise name’ (in ASIA) iscorrectly registered in both the archives, the ’neighbourhood’ of an applicantname is defined as the set of enterprises which have a name containing atleast one word equal to (written in the same way) a word in the applicantname. Then, the correct link for the given applicant have been searched inits neighbourhood.

At this stage of the study, it can be considered a promising results the per-centage of around 75% of enterprises identified as patenting (12488 out of16132 applicants identified as ’establishments’) but further investigations areplanned to be improved. The next step will be to define the ’neighbourhood’on the base of similarity between words instead of equality, in order to man-age possible typing errors in the names. Moreover, the set of names withouta neighbourhood have to be investigated.

As further developments, it would be desirable to classify the whole set ofpatenting establishments as: business enterprises, public institutions, non-profit institutions and private or public universities enterprises, according tothe Frascati Manual (2002). A subset of applicants identified as ’individuals’(with no legal form) has not been investigated at the moment, assumingthe probability to find a correct link to enterprises is low. For this kind ofapplicants it is planned to use different archives (such as the List of enterprisemanager or the List of companies partners) or information (checking by theaddress).

Page 120: Report on WP4 Case studies - European Commission · CONTENTS 7 The main contributors to this report are Bart Bakker, Vincent de Heij, Daniela Ichim, Frank Linder, Filippo Oropallo,

4.6 Conclusions and future plans 119

Finally, a probabilistic approach to the record linkage can be considered. Inorder to reduce the computational size of the problem the R&D survey framecan be used as test set given that it proved to define a subpopulation witha high concentration of patenting enterprises.

Page 121: Report on WP4 Case studies - European Commission · CONTENTS 7 The main contributors to this report are Bart Bakker, Vincent de Heij, Daniela Ichim, Frank Linder, Filippo Oropallo,

Chapter 5The framework for error in anintegrated survey

Francesca Romana Pogelli, Filippo Oropallo

ISTAT – Italian National Statistical Institute

5.1 Introduction

In a general plan towards a modernization of the structural business statistics,also the Italian NSI has decided to intensify the use of administrative datawith the aim to reduce the statistical burden on enterprises and to improvethe statistical quality of surveys, in terms of comparability with other sourcesand reduction of non response bias.

In this context a new integration process, concerning the sample survey onSmall and Medium Enterprises (SME), has been developed with the use ofall available administrative sources that have relevant economic informationfor a business survey.

The sample size of SME survey is approximately 100.000 units, but the re-sponse rate is less than 40%. Adopted actions to speed up or increase theresponse rate are: reminders by postal service and by phone.

The main variables of interest collected on the SME sampled enterprises are:Turnover, Value added, Employment, Total purchases of goods and services,Personnel costs, Wages and salaries, Production value.

Until 2007, only Financial Statements data had been used to impute non-

Page 122: Report on WP4 Case studies - European Commission · CONTENTS 7 The main contributors to this report are Bart Bakker, Vincent de Heij, Daniela Ichim, Frank Linder, Filippo Oropallo,

5.2 Integration process steps of SME survey according to a frameworkon errors in register based statistics 121

response with an increase of about 5% of the response rate, but these financialstatements refer only to companies with a number of employees in the range20-99. In the last two years other administrative sources have been added(Sector Studies survey data and Tax Return data), and as a result there wasa large increase in sample coverage, up to about 95%1.

In this integration process the fiscal code is the natural key to link thesedifferent sources.

In this paper we would like to focus the attention on a methodological frame-work for register based statistics, proposed by Bart Bakker2, trying to classifythe tasks of the integration process between survey data and administrativedata regarding to the SME survey, for the year of reference 20073.

5.2 Integration process steps of SME surveyaccording to a framework on errors inregister based statistics

The following framework, based on the original ”life cycle” and errors ina survey (Groves et al. 2004), refers to a combined registration situation, withthe assumption that errors that normally emerge in surveys will also occur inregistration and that most of the registration data are collected with the aidof a survey technique. The columns of ”Measurement” and ”Representation”errors refer to all sources used to produce a statistical outcome (Bakker,2010).

5.2.1 Data sources of integration process

• Small and Medium Enterprise (SME) survey is a business survey, withthe purpose of investigating profit-and-loss account of enterprises withless than 100 persons employed, as requested by SBS EU Council Reg-ulation n. 58/97 and 295/2008.

• Financial Statements are the balance sheet data of all corporate enter-prises. They represent less than 20% of enterprise universe, although

1Without considering list errors or out of coverage units come from the frame ofinterest, the Statistical Register of Active Enterprises (called ASIA)

2Statistics Netherlands32007 year was the data reference year of the experimental integration data for SME

sample survey.

Page 123: Report on WP4 Case studies - European Commission · CONTENTS 7 The main contributors to this report are Bart Bakker, Vincent de Heij, Daniela Ichim, Frank Linder, Filippo Oropallo,

5.2 Integration process steps of SME survey according to a frameworkon errors in register based statistics 122

Figure 5.1. Framework on errors in register based statistics (Bakker-CBS)

they are about 57% in terms of persons employed. This source is thebest harmonized according to the SBS Regulation definitions.

• Tax Return data: all the enterprises are obliged to declare their taxableincome to the Fiscal Authority by filling in tax forms. Based on theirlegal type and accountancy regime, enterprises have to fill in differenttypes of tax forms. According to the simplified accountancy regime,sole proprietorships (PF) have to fill in either the Pf-Re, if they arefreelances, or the PF-RG form, if they are firms in a simplified ac-counting regime; the unincorporated firms (SP) are liable to fill in theSP-RG form, and the corporate ones (SC) have to compile the SC-RS.

• Sector Studies survey (or Fiscal Authority Survey) data: it is a fiscalsurvey aiming to evaluate the capacity of enterprises to produce incomeand to support the tax control on small and medium firms (single own-erships, Partnerships and corporate firms with turnover between 30.000and 7,5 million of euro).

Page 124: Report on WP4 Case studies - European Commission · CONTENTS 7 The main contributors to this report are Bart Bakker, Vincent de Heij, Daniela Ichim, Frank Linder, Filippo Oropallo,

5.2 Integration process steps of SME survey according to a frameworkon errors in register based statistics 123

Almost all enterprises are obliged to fill in the Sector Studies form togetherwith the tax return one on order to declare in detail costs and income items.Since sector studies have a part of questionnaire that is common to a financialstatement, they contain information more useful than tax return data forimputation purposes.

5.2.2 Measurement side

On the measurement side, the starting point is the ”administrative concept”4,in order to distinguish the collection data process between the register keeperand the statistical agency. In a context of micro integration based on admin-istrative data only, it could be more suitable to consider as starting pointan ”administrative concept”, but in our case, we have a different situation,since it is an integrated sample survey with administrative data in order toimpute the missing responses. For this reason, it could be more appropriateto say that the survey follows the aims of a statistical agency, according theEU regulations.

Actions belonging to the ”validity of the concept” can be considered:

• comparisons between SBS and administrative definitions that implyreview tasks of administrative sources useful to produce StructuralBusiness Statistics according to SBS EU reg. 58/97, 410/98, 2700/98,2056/02, 1670/03, 295/2008;

• reconciliations of the definitions and the values among sources whenevera variable has not the same definition or value across different sources(e.g. for different purpose of collection data).

These analyses have highlighted a large comparability with balance-sheetsvariables, like Turnover, Value of a Production, Intermediate costs, ValueAdded, Personnel costs, Gross and net operating surplus, while more diffi-culty was found on accounting variables, like for the Freelancers.

Following the left side of the framework, we show some micro integrationtechniques that have been carried out concerning this business survey.

In order to evaluate statistical coherence between administrative sources andthe survey an indicator based on the Kolmogorov-Smirnoff statistics5 has

4The original framework of Groves and al. (2004) contains only the word ”concept”5i.e. the statistics used for the Kolmogorov–Smirnov test, a nonparametric test for

the equality of continuous, one-dimensional probability distributions that can be used tocompare a sample with a reference probability distribution (one-sample K–S test), or tocompare two samples.

Page 125: Report on WP4 Case studies - European Commission · CONTENTS 7 The main contributors to this report are Bart Bakker, Vincent de Heij, Daniela Ichim, Frank Linder, Filippo Oropallo,

5.2 Integration process steps of SME survey according to a frameworkon errors in register based statistics 124

been applied, since it is able to determine a ranking between the admin-istrative sources. This criterion is used to impute the units with missingresponses.

The following table displays this priority, where the first column shows thenumber of variables with comparable definitions, and the other one showsthe number of variables with similar distribution according to the results ofthe KS indicator.

Table 5.1. KS indicator gives a ranking between the sources

Kolmogorov Smirnov indicator satisfied on comparable variables by sourcesComparable Test.KS

Financial Statements 21 13Fiscal Authority Survey 15 8Tax Return data - PF-RE 13 6Tax Return data - PF-RG 14 6Tax Return data - SP-RG 14 6Tax Return data - SC-RS 16 2

In this way, for imputation actions, we can choose at first the financial state-ment source (because of its almost total coverage of corporate firms, andthe greatest number of variables comparable with SME survey, 13 out of 21with a similar distribution), then the Sector Studies survey (15 comparablevariables, 8 of them with a similar distribution), and at last the tax returndata (6 coherent variables), leaving the SC-RS variables in case of no otheravailable data.

A further comparison between the variables consists in analysing the fre-quency of linked units with respect to the possible differences in the valuesof the observed variables in the survey and the administrative sources (Fi-nancial Statements, Fiscal Authority Survey, Tax Return PF and Tax ReturnSP). Figure 5.2 shows these comparisons for Turnover .

For instance, in the up-left figure, there are the differences between turnoverby financial statement data and turnover by survey data. About 73% oflinked units have the same turnover value (zero differences), and they rep-resent about 78 % of the total turnover value from administrative data andthe same share from survey turnover data.

The distribution is 0-balanced and positive asymmetric, which means thatthe source and the survey variables correspond, although the administrativeone tends to be higher than the correspondent in the survey.

Page 126: Report on WP4 Case studies - European Commission · CONTENTS 7 The main contributors to this report are Bart Bakker, Vincent de Heij, Daniela Ichim, Frank Linder, Filippo Oropallo,

5.2 Integration process steps of SME survey according to a frameworkon errors in register based statistics 125

Figure 5.2. Distribution of respondents units linked with administrative data by range of differences forTurnover

5.2.3 Representation side

On the representation side of the framework for micro integration (Figure5.1), the first component represents the target population, that in this caseconsists of small and medium-sized enterprises (SME) with less than 100persons employed belonging to the following economic activities according tothe Nace Rev.1.1 classification:

• Sections C, D, E, F, G, H, I, J (division 67), K;

• Sections M, N and O for the enterprises operating in the private sector.

The second framework component deals with unit coverage.

In the following figure, we can outline the coverage of the 2007 SME surveysample in terms of number of units and in terms of their information content,according to the legal type and the size class.

Unit coverage is represented through a big set, containing all sample data(about 103.000 units) and inside, we can find proportional subsets to size ofthe linked administrative units.

Page 127: Report on WP4 Case studies - European Commission · CONTENTS 7 The main contributors to this report are Bart Bakker, Vincent de Heij, Daniela Ichim, Frank Linder, Filippo Oropallo,

5.2 Integration process steps of SME survey according to a frameworkon errors in register based statistics 126

Unless coverage list errors, Financial statements and Sector Studies togetherwith Tax return modules, cover almost all sample enterprises: what remainsare only the large and very small sole proprietorships. The large ones (withan ordinary accountant regime) are asked to fill the RF form of Tax returnmodule which is not comparable with the profit and loss scheme. The verysmall ones, called minimum taxpayer, only from 2008 are liable to compilea special tax return form named CM.

Figure 5.3. Coverage analysis by legal type and size class

In Table 5.2 the sample coverage figures are showed in terms of the admin-istrative source used, with the priority rule defined before, according to theK-S indicator.If we do not take into account list errors and units out of SME frame, thetotal coverage is about 95%, half from respondent units and half from ad-ministrative sources.

So, this integration process makes possible to impute about half missingresponses of survey.

Although the available sources cover a large portion of sample, howewer somedifficulties arise in tems of variable harmonization.

Page 128: Report on WP4 Case studies - European Commission · CONTENTS 7 The main contributors to this report are Bart Bakker, Vincent de Heij, Daniela Ichim, Frank Linder, Filippo Oropallo,

5.2 Integration process steps of SME survey according to a frameworkon errors in register based statistics 127

Table 5.2. Coverage of the initial sample by type of response and administrative data

Initial theoretical sampleSource Non respondents Respondents TotalFinancial Statements 10.370 19.74 30.11Fiscal Authority Survey (F) 24.655 17.80 42.45Fiscal Authority Survey (G) 1.343 1.22 2.57Tax Return data - PF-RG 2.312 990.00 3.30Tax Return data - PF-RE 747 483.00 1.23Tax Return data - SP-RG 810 378.00 1.19Tax Return data - SC-RS 4.546 1.84 6.38From survey only - 1.25 1.25Total 44.783 43.70 88.48

Out of coverage and list errors 10.22No sources 4.34Total sample units 103.04

Going on the ”life cycle and error”, we meet the ”Linking error” step. Itcan happen that the key variable, enterprise fiscal code, is available butsometimes there are reported with errors causing mismatches. In these cases,these codes have been processed like missing responses.

Many problems of linking errors concern the list frame of survey, that isthe Statistical Register of Active Enterprises (ASIA) containing 4,5 millionsactive enterprises for the year 2007.

The most relavant questions refer to:

• Identifying changes in business units (developed for the sampling list)

• Changes involving a single unit (changes in kind of business classifica-tion, in legal form or localisation)

• Changes in the number of units (death, birth, breaks up and splits off,mergers and acquisitions)

The next step deals with the methods to correct the under/over coverageerrors.

Correction factors for initial sampling weights for unit non-response andunder-coverage are calculated by applying the methodology based on cali-bration estimators (Deville and Sarndal, 1992).

The final weight wk is obtained as a product of three factors:

wk = dkγ1,kγ2,k

where

Page 129: Report on WP4 Case studies - European Commission · CONTENTS 7 The main contributors to this report are Bart Bakker, Vincent de Heij, Daniela Ichim, Frank Linder, Filippo Oropallo,

5.2 Integration process steps of SME survey according to a frameworkon errors in register based statistics 128

• dk is the direct weight (the reciprocal of the inclusion probability);

• γ1,k is the total non-response correcting factor ;

• γ2,k is the ”post-stratification” factor.

After calculating the total non response correcting factors as the ratio of thenumber of sampled units and the number of respondent units belonging toappropriate ”weighting adjustment cells”, the weight of every single enter-prise is furtherly modified in order to match known or alternatively estimatedpopulation totals called benchmarks. In particular, known totals of selectedauxiliary variables on Asia Register (Average number of employees in theyear t-1, Number of enterprises) are currently used to correct for sample-survey nonresponse or for coverage error resulting from frame undercoverageor unit duplication.

The following figure reflects the final results of integrated ”SME data” withadministrative sources.

Figure 5.4. Integrated survey outcome

The left part of quadrant refers to common variables between the sources andit is divided in two further horizontal parts, the respondents and the nonre-

Page 130: Report on WP4 Case studies - European Commission · CONTENTS 7 The main contributors to this report are Bart Bakker, Vincent de Heij, Daniela Ichim, Frank Linder, Filippo Oropallo,

5.3 ”Life cycle and errors” in an integrated survey 129

spondents, with the correspondent number of linked units by the fiscal codeand also not linked units for different reasons (mismatches, out of coverage)

The right part corresponds to other survey variables for which, until now, itnot possible to link to administrative sources.

5.3 ”Life cycle and errors” in an integratedsurvey

With the aim to insert the phases of integration process of SME survey inthe ”life and error cycle” proposed by Bakker, some incoherencies have beenmet, since the model in Figure 5.1 refers only to integrated administrativesources.

On the other hand, if we consider the original ”life and error cycle” of Groveset al. (2004), we have a conceptual problem, since the ”representation side”of this framework refers only to sample data.

If the micro integration process concerns a sample survey, as the SME sur-vey, what framework could be the most suitable one? Should we considera different ”life and error cycle” based on the different methods to collectinformation?

A proposal, for this situation, consists of a mixed ”life and error cycle”between the original one by Groves and the other one, proposed by Bakker,in which the micro integration steps of a sample survey with administrativedata results more consistent. So we can build a ”Life cycle and errors in anintegrated survey”, that is described in the next figure:

On the measurement side, the steps of a sample survey correspond to theoriginal scheme of Groves (starting with a concept, defined for a statisticalsurvey, until the editing and imputation tasks), while on the representationside, we can consider two parts: the upper one refers to the sample surveysteps and the bottom part, concerning the integration tasks to impute thenon respondent units, can be considered pertinent to the administrative ”life-cycle” steps.

We prefer to keep the same measurement side of Groves (with the sameterminology), since it fits better to a more generic description, where we caninclude the usual steps of a sample survey with micro integration techniques,as described above.

Page 131: Report on WP4 Case studies - European Commission · CONTENTS 7 The main contributors to this report are Bart Bakker, Vincent de Heij, Daniela Ichim, Frank Linder, Filippo Oropallo,

5.3 ”Life cycle and errors” in an integrated survey 130

Figure 5.5. ”Life cycle” and errors in an integrated survey

So, the ”validity” word refers both to the actions to obtain a correct statisticalconcept and to obtain a valid administrative concept. The same is for the”measurement and processing error” and the for the actions correspondentto steps of ”operationalism” and ”response”.

For the representation side, it seems more appropriate to join two parts ofdifferent ”life cycles”, since the upper one (by Groves) concerns only thesample survey questions (frame error, sampling error) and the other one (byBakker), can be used to describe the integration with administrative data,where the starting point is represented by the ”Respondents” and not by thetarget population, as for the surveys based on administrative sources.

Page 132: Report on WP4 Case studies - European Commission · CONTENTS 7 The main contributors to this report are Bart Bakker, Vincent de Heij, Daniela Ichim, Frank Linder, Filippo Oropallo,

5.3 ”Life cycle and errors” in an integrated survey 131

At the end of the integration process, we have an ”integrated survey out-come”.

Page 133: Report on WP4 Case studies - European Commission · CONTENTS 7 The main contributors to this report are Bart Bakker, Vincent de Heij, Daniela Ichim, Frank Linder, Filippo Oropallo,

Chapter 6Statistical matching: Polish case study

Wojciech Roszka

Statistical Office in Poznan – Poland

6.1 Introduction

The present case study is empirical in scope, presenting the results of a sim-ulation study consisting in matching datasets from 2 surveys: the 2005 Mi-crocensus (a census conducted as a sample survey) and the Labour ForceSurvey carried out in the same year. The first dataset contains about 2 mil-lion records on various economic and demographic variables describing thePolish population, while the second one contains only about 220,000 records(for every quarter when the survey was conducted) describing the economicactivity of the population in different cross-sections. The study is a simula-tion, where the choice of datasets and the matching method was motivatedby the availability of data.

6.2 Data integration and statistical matching

Statistical matching is a method of matching records from two or moredatabases, which do not contain information about the same units. In thiscase, one does not create a unique key, and records are matched on the basisof similarity (see Figure 6.1). In the literature, the most common measureof similarity is the distance function. The file from which records are addedis called the donor, while the file to which records are attached is referredto as the recipient (cf. Scanu 2011). The decision as to which file should

Page 134: Report on WP4 Case studies - European Commission · CONTENTS 7 The main contributors to this report are Bart Bakker, Vincent de Heij, Daniela Ichim, Frank Linder, Filippo Oropallo,

6.2 Data integration and statistical matching 133

be the donor or the recipient depends on the character of the study. In oneapproach, the file with more records is treated as a recipient, to prevent a lossof information (cf. S. Raessler (2002)).

Given the assumption of no overlap between the files to be matched, match-ing the same units in both datasets is not possible. That is why, the matchingprocess is conducted using different algorithms, which rely on different mea-sures of similarity between records.

With the help of statistical matching methodology, a new dataset is created,which contains all variables from both files; in other words, variables thatwere not jointly observed, are observed jointly following the matching (cf.D’Orazio, Di Zio, Scanu (2006)). Units created in the process of matchingare not real. The new dataset, called a synthetic dataset, contains the so-called unreal units, created by adding information from similar records (cf.Figure 6.1).

Figure 6.1. The process of statistical matchingSource: produced by the author on the basis of (S. Raesler, H. Kiesl (2006))

Consequently, they can in no way be referred to really existing units 1. Nev-1The synthetic nature of units in the newly created data set prevents a disclosure of

the so-called sensitive information, which is protected under the Statistical ConfidentialityAct.

Page 135: Report on WP4 Case studies - European Commission · CONTENTS 7 The main contributors to this report are Bart Bakker, Vincent de Heij, Daniela Ichim, Frank Linder, Filippo Oropallo,

6.3 Methods of statistical matching 134

ertheless, when applied appropriately, the method can successfully handleaggregate data, since it ensures the equality of distributions.

The method of statistical matching will be discussed in more detailed lateron in the empirical part describing the process of integrating data from the1995 Microcensus and the LFS.

6.3 Methods of statistical matching

Statistical matching is a method of integrating two datasets (marked A andB), which do not contain information about the same units. To integrate suchdatasets, some assumptions must be met. The most important one is thepresence of the so-called common variables in both datasets (marked X; seeFigure 6.2). They are used to determine the measure of similarity (or lack ofit) between records and to find the most similar records. These variables musthave the same (or very similar) definitions, and their units of measurementor coding variables must be the same. Statistical units in the datasets to beintegrated must also be sampled from the same population and should be ofthe same category (e.g. persons or households). Another requirement is thatthe datasets should come from the same or similar reference period.

Figure 6.2 presents input data in the process of statistical matching. Vari-ables (X, Y , Z) are random variables with density f (x, y, z), where x ∈X, y ∈ Y, z ∈ Z.

X = (X1, . . . , Xp)T , Y = (Y1, . . . , Yq), Z = (Z1, . . . , ZR) are vectors of ran-

dom variables with values P , Q and R respectively.

A and B - two independent samples consisting of nA and nB independentlysampled observations. Z is not observed in dataset A, and Y is not observedin dataset B.(xAa , y

Aa

)=(xAa1 , . . . , x

AaP

; yAa1 , . . . yAaQ

), where a = 1, . . . , na are observed val-

ues of variables for units in dataset A.(xBb , z

Bb

)=(xBb1 , . . . , x

BbP

; zBb1 , . . . zBbR

), where b = 1, . . . , nb are observed values

of variables for units in dataset B (cf. Di Zio (2007)).One of the datasets, usually the larger one, is the recipient for records fromthe other dataset2. The dataset from which records are added is called the

2The decision as to which of the two datasets is to be the recipient or the donoris made on the basis of the researcher’s own knowledge about the studied phenomenon.Raessler (2002), however, suggests choosing the larger dataset as a recipient to prevent

Page 136: Report on WP4 Case studies - European Commission · CONTENTS 7 The main contributors to this report are Bart Bakker, Vincent de Heij, Daniela Ichim, Frank Linder, Filippo Oropallo,

6.3 Methods of statistical matching 135

Figure 6.2. Datasets to be integrated using the method of statistical matchingSource: Di Zio (2007)

donor. The integrated dataset is the size of the recipient dataset and containsall variables from the two input sets (see Figure 6.3).

Figure 6.3. Integrating two dataset using the method of statistical matchingSource: van der Putten, Kok, Gupta (2002)

A synthetic file created in this manner can be used to carry out a range ofstatistical analyses at the level of aggregation that the recipient file allows.Considering the information demand that public statistics agencies are facedwith, such a solution saves time and cost required to conduct additional

loss of information.

Page 137: Report on WP4 Case studies - European Commission · CONTENTS 7 The main contributors to this report are Bart Bakker, Vincent de Heij, Daniela Ichim, Frank Linder, Filippo Oropallo,

6.4 Integrating databases from the Microcensus and LFS 136

surveys.

6.4 Integrating databases from the Microcen-sus and LFS

The simulation study presented in this article was designed to enlarge theMicrocensus dataset by adding variables from the LFS, and to increase thesample size of the LFS. Since both databases are based on surveys, theymost probably do not contain data about the same persons, nor do they havea unique linkage key. Consequently, such data sources cannot be integratedusing the deterministic approach. In order to achieve the desired objective,it is necessary to resort to statistical matching.

6.4.1 Description of the datasets

The microcensus was conducted using survey methodology in the periodbetween 18th and 31st May 1995, yielding data for 17th May. A 5 percentsample, selected by a two-stage stratified sampling scheme, was large enoughto generalize the results for the urban and rural subpopulations within eachof the then 49 provinces of the country. The country’s entire population wasthe target population of the survey, which focused on various economic anddemographic variables. The survey yielded a dataset consisting of 2 040 062records, each containing 51 variables (see Table 6.1).

Table 6.1. Comparison of the Microcensus and the LFS

Survey description Microcensus LFSReference time frame 18 - 31 May 1995, at 17 May selected week in each of the 4 quartersTarget population the population of Poland the population of Poland aged 15 and overMethod sample survey sample survey

Sample scheme two-stage stratified sampling two-stage stratified samplingwith a rotating panel design

Generalisation of results 49 provinces broken down into the whole country broken downurban and rural areas by town class, sex and age group

Aim various demographic and economic variables related tovariables of the population the economic activity of the population

Number of variables 51 93Number of respondents 2 040 062 218 369 (from 4 quarters)

Source: own tabulation.

The Labour Force Survey is the largest (in terms of coverage), cyclical (quar-terly) survey in Poland. In the selected week of each quarter, survey respon-dents are selected using two-stage stratified random sampling (with a rotatingpanel design) to provide information about their situation in terms of eco-nomic activity in the week preceding the survey. The relatively large sample,

Page 138: Report on WP4 Case studies - European Commission · CONTENTS 7 The main contributors to this report are Bart Bakker, Vincent de Heij, Daniela Ichim, Frank Linder, Filippo Oropallo,

6.4 Integrating databases from the Microcensus and LFS 137

amounting to 218 369 persons over the whole year (1995), enables the gen-eralization of results for the whole country broken down by town class, sexand age group. The target population includes people aged 15 and over (seeTable 6.1).

Table 6.1 provides a synthetic overview to verify the hypothesis that bothsamples come from the same general population. By comparing the data, onecan conclude that the reference period is similar and the target populationsmust be unified (to include only people aged 15 and over).

6.4.2 The integration algorithm

The algorithm of integrating the LFS and the Microcensus can be brokendown into 6 basic steps (cf. Gołata (2009), D’Orazio, Di Zio, Scanu (2006)):

1. Variable harmonisation.

2. Selection of matching variables and their standardization or dichotomiza-tion

3. Stratification

4. Calculation of distance

5. Selection of records in the recipient and donor datasets with the leastdistance

6. Calculation of the estimated value of variables

The harmonization of variables involves adjusting the Microcensus popula-tion to that of the LFS. The operation consisted in:

1. removing duplicate records (resulting from the rotation panel sampledesign) from the LFS.

2. removing respondents below the age of 15 from the Microcensus

Table 6.2 shows the size of the two datasets at the successive stages of har-monization:

The next stage consisted in selecting variables to estimate the measure ofsimilarity between records (the so-called matching variables). These includedsex, age, education level, class of the place of residence and marital status.Then, the coding of qualitative variables was unified and the variables were

Page 139: Report on WP4 Case studies - European Commission · CONTENTS 7 The main contributors to this report are Bart Bakker, Vincent de Heij, Daniela Ichim, Frank Linder, Filippo Oropallo,

6.4 Integrating databases from the Microcensus and LFS 138

Table 6.2. The size of the datasets to be integrated at successive stages of harmonization

Characteristics of respondent Microcensus LFSTotal number of respondents 2 040 062 218 369number of unique respondents 2 040 062 126 353number of respondents aged 15 and over 1 553 493 126 353

Source: own tabulation.

dichotomized (transformed into binary variables). The quantitative variable”age” was standardized (replaced with a random variable with a normaldistribution, with the expected value 0 and standard deviation 1)

Then the two datasets were stratified. Two variables were used as strat-ification variables: ”province of residence” and ”employment status”. Thestratification procedure was conducted on the assumption that persons livingin a given province and characterized by a given employment status will bemore similar to one other than persons living in different provinces and hav-ing a different employment status. Another reason for stratifying the datasetwas to optimize the algorithm3. As a result, 147 strata were created4:

The measure of record similarity used in this case was the Euclidean SquaredDistance given by the formula:

dA,B =N∑i=1

Ki∑k=1

(aAik − aBik)2 +J∑j=1

(xAj − xBj)2,

where:

aik - binary variables created in the process of dichotomization of qualitativevariables (i -th category of k -th variable),

xj - standardized quantitative variables.

For a given record in recipient file, the algorithm searches for a record indonor file for which the distance measure is minimal.

The choice of Euclidean Squared Distance was motivated by the use of theintegration algorithm developed by Bacher (2002). The algorithm was mod-

3In spite of dividing the data set into strata, the integration lasted about 30 hours(Intel Core i5 processor, 4 GB RAM).

449 provinces x 3 attributes of the employment status (working, unemployed, occupa-tionally inactive).

Page 140: Report on WP4 Case studies - European Commission · CONTENTS 7 The main contributors to this report are Bart Bakker, Vincent de Heij, Daniela Ichim, Frank Linder, Filippo Oropallo,

6.4 Integrating databases from the Microcensus and LFS 139

ified and adjusted for purposes of integrating the Microcensus with the LFS.The study was performed under conditional independence assumption (CIA).

The integration algorithm yielded a dataset containing 1 553 493 records(the number of records of the larger of the two datasets) and 144 variablesdescribing the demographic and economic characteristics of Poland’s popu-lation.

6.4.3 The assessment of the integration

The minimum distance function, used as the criterion for the integration ofderived records, was characterized by a strong right skew (skewness coefficientequal to 3.18). The mean value of this function was 0.17, with standarddeviation of 0.45, median and mode equal to 0, and the maximum of 4.8.The histogram distribution of the minimum distance function is shown inFigure 6.4.

Figure 6.4. Distribution of the minimum distance functionSource: own chart

Page 141: Report on WP4 Case studies - European Commission · CONTENTS 7 The main contributors to this report are Bart Bakker, Vincent de Heij, Daniela Ichim, Frank Linder, Filippo Oropallo,

6.4 Integrating databases from the Microcensus and LFS 140

The strong positive skew of the distribution of this function indicates thatby far the most matches between records from the two datasets were madefor very small values of the distance. This means that records identified assimilar were not different from one another at all or differed only slightly. Thisconfirms the good choice of matching variables and the distance function.However, the drawback of this approach is the fact that as many as 49 493records (accounting for 39.2 percent of the total) from the donor datasetdid not meet the similarity criterion determined by the distance functionand were not matched even once. Within the set of successfully matchedrecords, one record was matched on average 20.21 times, with the standarddeviation of 21.75. The median number of matches was 14 and the modewas 3. The skewness coefficient of the number of matches was 3.29, which isevidence of a strong positive skew of the distribution of the number of donorrecords matched to recipient records. Disregarding the unmatched records,the minimum number of matches was 1 and the maximum was 378 (see figure6.5).

Figure 6.5. A box plot of the number of donor records matched with recipient records5

Source: own chart

Page 142: Report on WP4 Case studies - European Commission · CONTENTS 7 The main contributors to this report are Bart Bakker, Vincent de Heij, Daniela Ichim, Frank Linder, Filippo Oropallo,

6.4 Integrating databases from the Microcensus and LFS 141

One very important criterion of the matching quality is the equality ofdistributions of matching variable from the donor and recipient files (cf.D’Orazio, Di Zio, Scanu (2006)). Since, after the integration procedure,the two datasets contained very large fractions, applying classic tests usedto verify the hypothesis of equality of distribution (e.g. the Kolmogorov-Smirnov test or the Z-test for difference of proportions) results in rejectingthe null hypothesis even in the case of little differences in fraction sizes (inTable 3 the only fraction in the donor and recipient file after integration6,which did not display any significant difference is marked in bold7). For thisreason, for large samples, Gołata (2009) suggested using an equality measure

of distributions based on the coefficient of similarity Wp1 =k∑i=1

minDR(wi),

where k is the number of attributes of qualitative variables in the compar-ison and minDR(wi) is the smaller of the two fractions of a given variableattribute from the donor and recipient datasets. Distributions are considered”adequately” similar if Wp1 ≥ 95%. D’Orazio, Di Zio, Scanu (2006), on theother hand, proposed an equality measure of distributions called total varia-

tion distance: ∆(wd, wr) = 12

k∑i=1

|wd,i − wr,i|, where wd, wr are fractions of the

study variable in the files of the donor and recipient, respectively. Based onthis measure, distributions are considered ”adequately” similar if ∆ ≤ 3%.

Despite very similar fraction sizes for different variable attributes of matchingvariables, the Z-test for proportions showed no significant difference in onlyone case8. On the other hand, the coefficients proposed by Gołata (2009)and D’Orazio, Di Zio, Scanu (2006) not only indicated that distributions ofmatching variables after integration were very similar, but also showed thattheir similarity had actually increased as a result of integration (see Table6.4).

In the case of the continuous variable ”age”, the classic Kolmogorov-Smirnovtest for the equality of distribution also produced significant differences. Tocompare distributions of quantitative variables for large samples, D’Orazio,Di Zio oraz Scanu (2006) suggested using two coefficients: one based onthe arithmetic mean xR

xD, and the standard deviation SR

SD, where subscripts

B and R refer to the donor and recipient, respectively. The first of the two6The Z-test for proportions with the Bonferroni correction was used.7Significance level α = 0, 05 was used.8The variable: ”class of the place of residence”, variable attribute: ”town with 20 000

- 50 000 inhabitants” - the difference in fraction size was only 796 out of about 165 000people.

Page 143: Report on WP4 Case studies - European Commission · CONTENTS 7 The main contributors to this report are Bart Bakker, Vincent de Heij, Daniela Ichim, Frank Linder, Filippo Oropallo,

6.4 Integrating databases from the Microcensus and LFS 142

Tab

le6.

3.C

omp

aris

onof

frac

tion

sof

vari

ous

com

mon

qual

itat

ive

vari

able

sin

the

dat

aset

ssu

bje

cted

toin

tegr

atio

n

Mat

chin

gva

riab

leV

aria

ble

attr

ibu

teD

atas

ets

Don

or(b

efor

ein

tegr

atio

n)

Don

or(a

fter

inte

grat

ion

)aR

ecip

ient

N%

N%

N%

Sex

mal

e59

933

47,4

374

353

147

,86

750

886

48,3

4fe

mal

e66

420

52,5

780

996

252

,14

802

607

51,6

6

Mar

ital

stat

us

un

mar

ried

3130

224

,77

388

827

25,0

339

284

725

,29

mar

ried

7984

663

,19

987

724

63,5

896

617

162

,19

wid

owed

1177

09,

3214

862

69,

5715

224

39,

80d

ivor

ced

343

52,

7228

316

1,82

4223

22,

72

Cla

ssof

the

pla

ceof

resi

den

ce

over

100

000

3408

926

,98

244

791

15,7

624

202

115

,58

5000

0-

9999

911

620

9,20

124

334

8,00

123

301

7,94

2000

0-

4999

913

362

10,5

816

403

410

,56

164

830

10,6

110

000

-19

999

791

46,

2694

485

6,08

9861

86,

355

000

-99

994

303

3,41

3930

32,

5345

338

2,92

200

0-

4999

309

52,

4526

753

1,72

3246

92,

09le

ssth

an2

000

368

0,29

483

10,

314

299

0,28

villag

e51

602

40,8

485

496

255

,03

842

617

54,2

4

Ed

uca

tion

hig

her

edu

cati

on8

398

6,65

7528

74,

8579

495

5,12

pos

t-se

con

dar

ysc

hoo

l2

887

2,28

3553

42,

2935

841

2,31

seco

nd

ary

voca

tion

alsc

hoo

l22

195

17,5

724

671

015

,88

245

373

15,7

9se

con

dar

ysc

hoo

l8

861

7,01

8788

35,

6691

529

5,89

voca

tion

alm

idd

lesc

hoo

l32

860

26,0

141

067

426

,44

406

217

26,1

5co

mp

lete

dp

rim

ary

sch

ool

4426

335

,03

580

015

37,3

457

246

136

,85

no

sch

oolin

gco

mp

lete

d6

889

5,45

117

390

7,56

122

577

7,89

Sou

rce:

own

tab

ula

tion

.

aT

hedi

stri

buti

onof

com

mon

vari

able

sof

mat

ched

reco

rds

from

the

dono

rfil

e.

Page 144: Report on WP4 Case studies - European Commission · CONTENTS 7 The main contributors to this report are Bart Bakker, Vincent de Heij, Daniela Ichim, Frank Linder, Filippo Oropallo,

6.4 Integrating databases from the Microcensus and LFS 143

Table 6.4. Values of the coefficient of similarity and the total variation distance for matching variablesbefore and after integration

Matching Before integration After integrationvariable coefficient of total variation coefficient of total variation

similarity distance similarity distanceSex 99,10% 0,90% 99,53% 0,47%Marital status 99,00% 1,00% 98,61% 1,39%Class of the place of residence 86,48% 13,52% 98,93% 1,07%Education 95,58% 4,42% 99,14% 0,86%

Source: own tabulation.

coefficients for the variable ”age” amounted to 1.001, while the other onewas equal to 0.991, which also indicates a high degree of similarity betweendistributions. Figure 6.6 presents distributions of the age variable in thedonor file before and after integration and in the recipient file.

The last criterion of matching quality assessment is the extent to which thestrength and direction of dependence between matching variables is preservedin the donor and recipient files before and after integration. Tables 6.5 and6.6 showing the results of the χ2 test and contingency coefficients for se-lected variables indicate that the strength and direction of dependence waspreserved in the integrated dataset.

Table 6.5. Values of χ2 and coefficients describing the strength of dependence between the variables of”sex” and ”class of the place of residence” in the donor file before and after integration

STATISTICS BEFORE INTEGRATION AFTER INTEGRATION degrees criticalvalue p-value value p-value of freedom value

χ2 6117,8 0 84805,9 0 6 12,5916Pearson’s φ coefficient 0,220 0 0,234 0Cramer’s V 0,220 0 0,234 0Contingency coefficient 0,215 0 0,228 0

Source: own tabulation

Table 6.6. Values of χ2 coefficients describing the strength of dependence between the variables of ”classof the place of residence” and ”education” in the donor file before and after integration

STATISTICS BEFORE INTEGRATION AFTER INTEGRATION degrees criticalvalue p-value value p-value of freedom value

χ2 13090,5 0 178105 0 42 58,124Pearson’s φ coefficient 0,322 0 0,339 0Cramer’s V 0,131 0 0,138 0Contingency coefficient 0,306 0 0,321 0

Source: own tabulation

Page 145: Report on WP4 Case studies - European Commission · CONTENTS 7 The main contributors to this report are Bart Bakker, Vincent de Heij, Daniela Ichim, Frank Linder, Filippo Oropallo,

6.5 Summary 144

Figure 6.6. Distribution of the ”age” variable in the donor file before and after integration and in therecipient file

Source: own chart.

6.5 Summary

As a result of integrating the Microcensus and LFS by means of the methodof statistical matching using the SPSS statistical software package, it waspossible to create a new synthetic dataset containing 144 variables (including51 variables from the Microcensus and 93 from the LFS) and 1 553 493records. In this way, Microcensus and LFS variables that had not beenjointly observed could be observed together in one integrated database.

Quality assessment of the matching procedure revealed that the result wassatisfactory in terms of criteria adopted in the literature:

• the strong positive skew of the distribution of the minimum distancefunction indicates a high degree of similarity between matched records;

• differences in fractions in the files after integration, although statisti-cally significant (owing to large sample size), are slight;

Page 146: Report on WP4 Case studies - European Commission · CONTENTS 7 The main contributors to this report are Bart Bakker, Vincent de Heij, Daniela Ichim, Frank Linder, Filippo Oropallo,

6.5 Summary 145

• differences between distributions of the ”age” variable are also small;

• integration preserved the strength and direction of dependence betweenvariables

Among the drawbacks of the approach adopted in the process of integra-tion one should mention the high number of unmatched records, resultingin a loss of information. To prevent this, it would be worthwhile to testmethods that can reduce the number of unmatched records, e.g. assigning”punitive” weights to the minimum distance for records that have alreadybeen matched and removing from the donor file any records that have beenmatched a certain number of times. Additional simulation research is neces-sary to verify the usefulness of such methods, which can be considered as aninteresting topic for further studies.

Page 147: Report on WP4 Case studies - European Commission · CONTENTS 7 The main contributors to this report are Bart Bakker, Vincent de Heij, Daniela Ichim, Frank Linder, Filippo Oropallo,

Bibliography

[1] Aukrust, I., Aurdal, P.S., Brathen, M. & Køber, T. (2010), Register-basert sysselsettingsstatistikk – Dokumentasjon (in Norwegian), Docu-ments 8/2010, Statistics Norway.

[2] Bacher J. (2002), Statistisches Matching - Anwendungsmoglichkeiten,Verfahren und ihre praktische Umsetzung in SPSS, ZA-Informationen,51. Jg.; [Statistical matching - applications, procedures and applicationin SPSS].

[3] Bakker, B. F.M., Linder, F., Van Roon D. (2008), Could that be true?Methodological issues when deriving educational attainment from dif-ferent administrative datasources and surveys, paper presented at IAOSConference on Reshaping Official Statistics, Shanghai.

[4] Bakker, B.F.M. (2010), Micro integration: State of the art, in ’ESSneton Data Integration’.

[5] Bascula A4, A tool for weighting sample survey data and variance es-timation, http://www.cbs.nl/en-gb/menu/informatie/onderzoekers/blaise-software/blaise-voor-windows/productinformatie/ bascula 4,Statistics Netherlands.

[6] Bethlehem, J. (2009), Applied Survey Methods. A Statistical Perspec-tive, Wiley Series.

[7] Canty, A. J., Davison, A. C. (1999), Resampling-based variance esti-mation for labour force surveys, The Statistician 48, pp. 379–391.

[8] Cobben, F. (2009), Nonresponse in Sample Surveys. Methods for Anal-ysis and Adjustment, PhD-thesis, Statistics Netherlands and Universityof Amsterdam, pp. 51–53.

Page 148: Report on WP4 Case studies - European Commission · CONTENTS 7 The main contributors to this report are Bart Bakker, Vincent de Heij, Daniela Ichim, Frank Linder, Filippo Oropallo,

BIBLIOGRAPHY 147

[9] De Leeuw, E., de Heer W. (2002), Trends in Household Survey Non-response: A Longitudinal and International Comparison, In: R.M.Groves, D. A Dillman, J.L. Eltinge and R.J.A. Little (red.), SurveyNonresponse (New York: Wiley), pp. 41–54.

[10] Deville, J. C., Sarndal, C. E. (1992), Calibration Estimators in SurveySampling, Journal of the American Statistical Association, 87 (418),376–382.

[11] Di Zio M. (2007), What is statistical matching, Course on Methods forIntegration of Surveys and Administrative Data, Budapest, Hungary.

[12] D’Orazio M., Di Zio M., Scanu M. (2006), Statistical Matching. Theoryand Practice, John Wiley & Sons Ltd., England

[13] Draft Report of WP1. State of the art on statistical methodologies fordata integration.

[14] ESSnet ISAD (2008a), Literature review on micro integration process-ing, Chapter 3 of ESSnet ISAD Report of WP1, State of the art onstatistical methodologies for integration of surveys and administrativedata, pp. 40–48.

[15] ESSnet ISAD (2008b), Recommendations on micro integration process-ing methodologies, Chapter 4 of ESSnet ISAD Report of WP2, Recom-mendations on the use of methodologies for the integration of surveysand administrative data, pp. 49–58.

[16] Fosen, J. (2010), A brief description of the micro-integration in theNorwegian register-based employment statistics. Background materialfor the Essnet Data Integration project, in progress.

[17] Groves, R.M., Fowler, F.J., Couper, M.P., Lepkowski, J.M., Singer, E.,Tourangeau, R. (2004), Survey Methodology New York: Wiley Inter-science.

[18] Golata E., (2009), Raport. Opracowanie dla wybranych metod integracjidanych reguł, procedur integracji danych z różnych źródeł, GUS internalmaterials, Poznań, Poland [Report. Development of selected methodsfor data integration rules, procedures, data integration from varioussources].

[19] Heij, Vincent de (2011), Kiezen van hulpvariabelen voor de LineaireRegressieschatter, (Selection of auxiliary variables to be used by the

Page 149: Report on WP4 Case studies - European Commission · CONTENTS 7 The main contributors to this report are Bart Bakker, Vincent de Heij, Daniela Ichim, Frank Linder, Filippo Oropallo,

BIBLIOGRAPHY 148

Linear Regression Estimator), Discussion paper, Statistics Netherlands(in Dutch).

[20] ISCED (1997), International Standard Classification of Education, UN-ESCO.

[21] Istat (2003), Metodi statistici per il record linkage, Metodi e Norme n.16, Anno 2003, A cura di Mauro Scanu.

[22] Istat (2011), RELAIS - Record linkage at Istat, software and User’sguide available at: http://www.istat.it/strumenti/metodi/software/analisi dati/relais/

[23] Kish, L. (1992), Weighting for unequal Pi, In Journal of Official Statis-tics, 8, pp. 183–200.

[24] Kuijvenhoven, L., Scholtus, S. (2010), Estimating Accuracy for Statis-tics Based on Register and Survey Data, WP2 report, task 2.1.2, ESS-net project Data Integration.

[25] Kuijvenhoven, L., Scholtus, S. (2011a), Bootstrapping Combined Esti-mators based on Register and Survey Data, WP2 report, task 2.1.2, 2nd

version, ESSnet project Data Integration.

[26] Kuijvenhoven, L., Scholtus, S. (2011b), Het nut van het combinerenvan registers en steekproeven (The usefulness of combining registersand sample), Discussion paper, Statistics Netherlands (in Dutch).

[27] Linder, F. (2004), The Dutch Virtual Census 2001: A new approachby combining Administrative Registers and Household Sample Surveys,Austrian Journal of Statistics, 33, pp. 69–88.

[28] Linder, F., van Roon, D. (2009), Deriving educational attainment bycombining data from administrative sources and sample surveys, pa-per presented at International Conference Statistics Investment in thefuture 2, Prague.

[29] Little, R.J.A., Vartivarian, S. (2005), Does Weighting for NonresponseIncrease the Variance of Survey Means?, Survey Methodology, volume31, pp. 161–168.

[30] Nascimento Silva, P.L.D., C.J. Skinner (1997), Variable selection forregression estimation in finite populations, Survey Methodology, vol-ume 23 no.1, pp. 23–32.

Page 150: Report on WP4 Case studies - European Commission · CONTENTS 7 The main contributors to this report are Bart Bakker, Vincent de Heij, Daniela Ichim, Frank Linder, Filippo Oropallo,

BIBLIOGRAPHY 149

[31] OECD (2002), Frascati Manual 2002: Proposed Standard Practice forSurveys on Research and Experimental Development, Paris 2002.

[32] Rao, J. N. K. (2003), Small Area Estimation, Wiley Series in SurveyMethodology, Wiley.

[33] Raessler, S. (2002), Statistical Matching. A Frequentist Theory, Practi-cal Applications, and Alternative Bayesian Approaches, Springer, NewYork, USA.

[34] Sarndal, C.E., Swensson, B., Wretman, J. (1992), Model Assisted Sur-vey Sampling, (Springer).

[35] Scanu, M. (2010), Introduction to statistical matching, [in:] ESSNet onData Integration. Draft Report of WP1. State of the art on statisticalmethodologies for data integration, ESSNet,

[36] Schaart, R., Westerman S., Mies Bernelot Moens (2008), The Dutchstandard classification of education SOI 2006, Statistics Netherlands.

[37] Schulte Nordholt, E., Linder F. (2007), Record matching for censuspurposes in the Netherlands, Statistical Journal of the IAOS, 24, pp.163–171.

[38] Schulte Nordholt, E., Linder F. (2008), Combining data sources: microlinkage and micro integration, In ESSnet ISAD Report of WP1. Stateof the art on statistical methodologies for integration of surveys andadministrative data, section 3.2.1.

[39] Schouten, B. (2007), A Selection Strategy for Weighting Variables un-der a Not-Missing-at-Random Assumption, In Journal of Official Statis-tics, 23, pp. 1–19.

[40] Spjøtvoll, E., Thomsen, I. (1987), Application of some empirical Bayesmethods to small area statistics, Bulletin of the International StatisticalInstitute, vol. 2, 435—449.

[41] van der Putten P., Kok J. N., Gupta A, (2002), Data Fusion throughStatistical Matching, Center for eBusiness, MIT, USA.

[42] Zhang, L.-C. (2003), Simultaneous estimation of the mean of a binaryvariable from a large number of small areas, Journal of Official Statis-tics, vol. 19, 253—263.

Page 151: Report on WP4 Case studies - European Commission · CONTENTS 7 The main contributors to this report are Bart Bakker, Vincent de Heij, Daniela Ichim, Frank Linder, Filippo Oropallo,

BIBLIOGRAPHY 150

[43] Zhang, L.-C. (2010), On micro-data quality in data integration andsome related topics, in progress.

[44] Zhang, L.-C. (2011a), Topics of Statistical theory for register-basedstatistics, paper for the conference ISI 2011 in Dublin, Ireland.

[45] Zhang, L.-C. (2011b), Topics of statistical theory for register-basedstatistics and data integration, Statistica Neerlandica.

[46] UNECE (2007), Register-based statistics in the Nordic countries. Re-view of best practices with focus on population and social statistics.

Page 152: Report on WP4 Case studies - European Commission · CONTENTS 7 The main contributors to this report are Bart Bakker, Vincent de Heij, Daniela Ichim, Frank Linder, Filippo Oropallo,

List of Figures

1.1 Flowcharter illustrating the steps described in Section 1.5–Section 1.7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

1.2 The life cycle suggested by Bakker (2010; Figure 5, page 74). 211.3 The two-phase life-cycle suggested by Zhang (2011a). . . . . . 24

2.1 The standard deviation of Yi against the square root of theLFS sample size. Direct estimates as filled circles, the upperlimit function as dashed line and the GVF-function as solid line. 27

2.2 Square root of MSE for LFS-employment (dashed line), up-per 95% confidence interval for square root of MSE for REG-employment (dotted lines), and the estimated underlying REG-bias uiβ (solid line); lower confidence interval truncated tozero. Model (2.6) in left panel and model (2.16) in right panel. 33

2.3 Results when fitting Type A model: the square root of MSEof REG-employment against that of LFS-employment. MSEof REG-employment based on EBLUP estimates. Left panelis model (2.6) and right panel is model (2.16). . . . . . . . . . 34

2.4 The square root of MSE of REG-employment against that ofLFS-employment, for municipalitites. REG-bias based on theGaussian approach, with bias estimates b∗Gi . Model (2.6) . . . 35

2.5 The bias of REG-employment based on the EBLUP estimates.Model (2.6). . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

2.6 The parameter estimate γi (2.19) (solid line), the overshrinkage-factor adjusted γofi = (γi)

1/2 (dashed line). . . . . . . . . . . . 36

Page 153: Report on WP4 Case studies - European Commission · CONTENTS 7 The main contributors to this report are Bart Bakker, Vincent de Heij, Daniela Ichim, Frank Linder, Filippo Oropallo,

LIST OF FIGURES 152

2.7 Bias of REG-employment using EBLUP estimates (’1’) andwhen using adjusted (’2’). For the following methods: Gaussian-based overshrinkage-adjusted b∗Gi (left panel), and constrainedempirical Bayesian (CEB) overshrinkage-adjusted bofi (rightpanel). Model (2.6). . . . . . . . . . . . . . . . . . . . . . . . 37

2.8 The square root of MSE of REG-employment against that ofLFS-employment, for municipalitites. REG-bias based on theCEB overshrinkage adjusted bias estimate βofi . Model (2.6). . 37

3.1 Education Archive 2009, fragment (fictitious example). . . . . 563.2 Population coverage by source in Educational Attainment File

(EAF), September 2008 . . . . . . . . . . . . . . . . . . . . . 653.3 Population coverage by source and age in Educational Attain-

ment File, September 2008 . . . . . . . . . . . . . . . . . . . . 663.4 Values of indicators wS=

√wV ) and wB, five education levels

and 91 weighting models, EAF2005. . . . . . . . . . . . . . . . 873.5 Values of indicators wS =

√wV and wB for EAF 2005, five

education levels and 91 weighting models, Moroccans aged18-30. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

4.1 Used database tables from PATSTAT; COUNTRY CODE =”IT”. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

4.2 Distribution of both length of Standardized Name and numberof words in a name, PATSTAT database. . . . . . . . . . . . . 107

4.3 The PATSTAT-ASIA linkage problem and its opportunities. . 1084.4 The PATSTAT-ASIA neighbourhoods. . . . . . . . . . . . . . 1124.5 Number of words in Standardized Names corresponding to

empty Neighbourhoods. . . . . . . . . . . . . . . . . . . . . . . 114

5.1 Framework on errors in register based statistics (Bakker-CBS) 1225.2 Distribution of respondents units linked with administrative

data by range of differences for Turnover . . . . . . . . . . . . 1255.3 Coverage analysis by legal type and size class . . . . . . . . . 1265.4 Integrated survey outcome . . . . . . . . . . . . . . . . . . . . 1285.5 ”Life cycle” and errors in an integrated survey . . . . . . . . . 130

6.1 The process of statistical matching . . . . . . . . . . . . . . . 1336.2 Datasets to be integrated using the method of statistical match-

ing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1356.3 Integrating two dataset using the method of statistical matching1356.4 Distribution of the minimum distance function . . . . . . . . . 139

Page 154: Report on WP4 Case studies - European Commission · CONTENTS 7 The main contributors to this report are Bart Bakker, Vincent de Heij, Daniela Ichim, Frank Linder, Filippo Oropallo,

LIST OF FIGURES 153

6.5 A box plot of the number of donor records matched with re-cipient records . . . . . . . . . . . . . . . . . . . . . . . . . . . 140

6.6 Distribution of the ”age” variable in the donor file before andafter integration and in the recipient file . . . . . . . . . . . . 144

B.1 Average values of indicators wS=√wV ) and wB for fifteen

models a to o (response situations MCAR, MAR and NMAR)and a model, which is only based on initial weights (indicatedwith star symbol). . . . . . . . . . . . . . . . . . . . . . . . . 167

B.2 Average values of bias indicator wB and B for fifteen mod-els a to o (response situations MCAR, MAR and NMAR) anda model based only on initial weights (indicated with star sym-bol). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170

B.3 Average values of variance indicator wS =√wV and estimates

for standard error S(tY ) =

√V (tY ) for fifteen models a to o

(response situations MCAR, MAR and NMAR) and a modelonly based on initial weights (indicated with star symbol). . . 172

Page 155: Report on WP4 Case studies - European Commission · CONTENTS 7 The main contributors to this report are Bart Bakker, Vincent de Heij, Daniela Ichim, Frank Linder, Filippo Oropallo,

List of Tables

3.1 Scaling process, Educational Attainment File 2008 . . . . . . . 693.2 Education levels for ages 0-14 and 15+ in the Netherlands,

September 20081), contribution from registers . . . . . . . . . 713.3 Education levels for ages 0-14 and 15+ in the Netherlands,

September 20081), contribution from registers and LFS (weightedwith scaled LFS weights) . . . . . . . . . . . . . . . . . . . . . 72

3.4 Education levels for ages 0-14 and 15+ in the Netherlands,September 20081), contribution from registers and LFS (weightedwith calibrated LFS weights) . . . . . . . . . . . . . . . . . . . 74

3.5 Values of indicators wS and wB and correlation of responseindicator IR and target variable Y with the auxiliary variable

set ~βt ~X for a number of weighting models, education levelSCED 5. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88

3.6 Values of estimator tY and of indicators wS =√wV and wB

for the complete model, model 11) and model 22). . . . . . . . 893.7 Coefficient of variation (CV) for some highly educated (Mas-

ter’s degree level or higher) Turkish males and females in theNetherlands, aged 18-30, September 2008. . . . . . . . . . . . 94

4.1 Statistics on the number active enterprises in Italy, period1998-2008. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

4.2 Distribution of Legal Form, PATSTAT database . . . . . . . . 1074.3 Number of active enterprises in the search space, by year . . . 1104.4 ASIA: percentage of enterprises by Legal Form and year . . . 1104.5 ASIA 2008: percentage of enterprises by Legal Form and year 1124.6 Number and percentage of records without Neighbourhood. . . 1134.7 Patenting enterprises by size (classes of employees) . . . . . . 116

Page 156: Report on WP4 Case studies - European Commission · CONTENTS 7 The main contributors to this report are Bart Bakker, Vincent de Heij, Daniela Ichim, Frank Linder, Filippo Oropallo,

LIST OF TABLES 155

4.8 Patenting enterprises (active in 2008) by economic activity (2digit NACE 2007): the ten most frequent NACE’s division . . 117

5.1 KS indicator gives a ranking between the sources . . . . . . . 1245.2 Coverage of the initial sample by type of response and admin-

istrative data . . . . . . . . . . . . . . . . . . . . . . . . . . . 127

6.1 Comparison of the Microcensus and the LFS . . . . . . . . . . 1366.2 The size of the datasets to be integrated at successive stages

of harmonization . . . . . . . . . . . . . . . . . . . . . . . . . 1386.3 Comparison of fractions of various common qualitative vari-

ables in the datasets subjected to integration . . . . . . . . . . 1426.4 Values of the coefficient of similarity and the total variation

distance for matching variables before and after integration . . 1436.5 Values of χ2 and coefficients describing the strength of depen-

dence between the variables of ”sex” and ”class of the placeof residence” in the donor file before and after integration . . . 143

6.6 Values of χ2 coefficients describing the strength of dependencebetween the variables of ”class of the place of residence” and”education” in the donor file before and after integration . . . 143

A.1 Example, rule 1 (U≈3, education level at D is valid until 3years after date of attainment D) . . . . . . . . . . . . . . . . 158

B.1 Number of persons with Y = 0 and Y = 1 as determined by thelogit(p) formula for different values of the auxiliary variablesX1, X2, X3, X4, X5, X7 and X

1)9 . . . . . . . . . . . . . . . . . 163

B.2 Fifteen weighting models a,. . . ,o based on different sets of aux-iliary variables . . . . . . . . . . . . . . . . . . . . . . . . . . . 165

Page 157: Report on WP4 Case studies - European Commission · CONTENTS 7 The main contributors to this report are Bart Bakker, Vincent de Heij, Daniela Ichim, Frank Linder, Filippo Oropallo,

Appendix AProbabilistic decision rules in more detail

Life Tables Method as applied with probabilistic decision rule 1.The survival function S (t) = P [T ≥ t |D] gives the probability that educa-tional attainment has not changed within t years after the last observed dateof attainment D.

S(t) is a monotonous decreasing function of t. The distribution of the sur-vival function S(t) is determined empirically on the basis of the LFS fora succession of years.

The advantage of using the Life Tables Method is that it correctly dealswith censored cases and so avoids bias in estimation of life time, or in otherwords prevents underestimation of the number of years in which level ofeducation remains unchanged. Censored cases are cases in which a possibletermination of an event occurs beyond a time horizon of observation, andtherefore is non-observed.

Probabilistic decision rule 1.

Interval period t=1,2,. . . .: first year [0;1] after date D at which a certaineducation level has been attained (t=1); second year [1;2] after date D (t=2),. . . . . . , period [t-1;t] for t

Proportion terminating: probability Pr(t) that the education level hasincreased in interval [t-1;t]; in terms of survival analysis the probability thatan event has occurred (termination) in interval [t-1,t]

Proportion surviving: probability 1-Pr(t) that the education level hasnot changed within interval [t-1;t], in terms of survival analysis the proba-bility that there was no event (termination) in interval [t-1;t].

Page 158: Report on WP4 Case studies - European Commission · CONTENTS 7 The main contributors to this report are Bart Bakker, Vincent de Heij, Daniela Ichim, Frank Linder, Filippo Oropallo,

157

Life Tables Survival Function (cumulative proportion surviving atend):

S(t) = (1-Pr(1)).(1-Pr(2)). . . (1-Pr(t)): probability that since date ofattainment D of a certain education level that level has not changed in thefirst interval, second interval ... until interval t, or in other words in [0;t],t=1,2, .....

An upper bound U is sought such that:

S(U)= (1-Pr(1)).(1-Pr(2)). . . (1-Pr(U)) ≥ 0.95

Page 159: Report on WP4 Case studies - European Commission · CONTENTS 7 The main contributors to this report are Bart Bakker, Vincent de Heij, Daniela Ichim, Frank Linder, Filippo Oropallo,

158

Tab

leA

.1.

Exa

mp

le,

rule

1(U≈

3,ed

uca

tion

leve

lat

Dis

vali

du

ntil

3ye

ars

afte

rd

ate

ofat

tain

men

tD

)

Inte

rval

1-P

r(t)

S(t

)=In

terv

al1-P

r(t)

S(t

)=In

terv

al1-P

r(t)

S(t

)=In

terv

al1-P

r(t)

S(t

)=p

erio

d(1

-Pr(

1))

.p

erio

d(1

-Pr(

1))

.p

erio

d(1

-Pr(

1))

.p

erio

d(1

-Pr(

1))

.t

(1-P

r(2))

.t

(1-P

r(2))

.t

(1-P

r(2))

.t

(1-P

r(2))

...

...

....

...

...

....

.(1

-Pr(

t))

(1-P

r(t)

)(1

-Pr(

t))

(1-P

r(t)

)1

0.99

620.

9962

160.

9921

0.56

2031

0.99

770.

5295

461.

0000

0.52

682

0.98

690.

9832

170.

9930

0.55

8032

0.99

850.

5287

471.

0000

0.52

683

0.96

680.

9505

180.

9929

0.55

4033

0.99

890.

5281

481.

0000

0.52

684

0.91

220.

8671

190.

9950

0.55

1234

0.99

940.

5277

490.

9968

0.52

515

0.92

710.

8039

200.

9948

0.54

8435

1.00

000.

5277

501.

0000

0.52

516

0.91

530.

7358

210.

9959

0.54

6136

0.99

930.

5273

511.

0000

0.52

517

0.93

650.

6891

220.

9946

0.54

3237

1.00

000.

5273

521.

0000

0.52

518

0.94

640.

6522

230.

9945

0.54

0238

1.00

000.

5273

530.

9944

0.52

219

0.96

130.

6269

240.

9997

0.54

0139

1.00

000.

5273

541.

0000

0.52

2110

0.97

080.

6086

250.

9961

0.53

8040

0.99

890.

5268

551.

0000

0.52

2111

0.98

100.

5970

260.

9977

0.53

6841

1.00

000.

5268

....

..12

0.98

250.

5866

270.

9959

0.53

4542

1.00

000.

5268

....

..13

0.98

650.

5787

280.

9970

0.53

2943

1.00

000.

5268

....

..14

0.98

850.

5720

290.

9980

0.53

1944

1.00

000.

5268

....

..15

0.99

020.

5665

300.

9978

0.53

0745

1.00

000.

5268

..1.

0000

0.52

21

Page 160: Report on WP4 Case studies - European Commission · CONTENTS 7 The main contributors to this report are Bart Bakker, Vincent de Heij, Daniela Ichim, Frank Linder, Filippo Oropallo,

159

Life Tables Method as applied with probabilistic decision rule 2.The conditional survival function SJ(t) = P [T ≥ t |J,D] gives the probabilitythat educational attainment has not changed within t years after the lastobserved date of attainment D given empirical knowledge that this educationlevel has remained unchanged for J years.

SJ(t) = P[T ≥ t | J,D] =P[T ≥ t ∧ T ≥ J | D]

P[T ≥ J | D]=

=

{1, when t ≤ J, (trivial case)S(t)S(J)

, when t > J(A.1)

Using the notation of Life Tables Method above, SJ(t) can be written as:

SJ(t) = (1-Pr(J+1)). . . ..(1-Pr(t)) when t > J

An upper bound M has to be sought such that:

SJ (M)= (1-Pr(J+1)).(1-Pr(2)). . . (1-Pr(M)) ≥ 0.95

Notice that if J=0 (date of attainment ≈ interview date) we are back in thesituation of decision rule 1, as S0(t) = (1− Pr(1)).(1− Pr(2)) . . . (1− Pr(t))= S(t).

Decision rule 2 is in fact intended for J>0.

In the following table matrix SJ(t) is given for all combinations of t and J.The calculations are based on the probabilities as given in the example ofrule 1. All the combinations which give probability of 0.95 or higher areshaded in the table.

The table shows that for both J=1 and J=2 the upper bound M=3 (in termsof integers).

The shaded areas are mainly found for larger values of J. For example, ifJ=15, the upper bound M=24 (in terms of integers).This means that onecan consider the education level that was observed 15 years before interviewdate as still valid until 9 years after that interview date.

Notice that for J ≥ 20 probabilistic decision rule 2 considers the last observededucation level of a person valid for the rest of this person’s life.

One can recognise in the matrix elements SJ(J + 1) the proportions surviv-ing (second column in the table of the example for rule 1). This is becauseSJ(J + 1) = 1 − Pr(J + 1).

Page 161: Report on WP4 Case studies - European Commission · CONTENTS 7 The main contributors to this report are Bart Bakker, Vincent de Heij, Daniela Ichim, Frank Linder, Filippo Oropallo,

160

E.g. matrix element (2nd row, 1st column) = S1(2) = 0.9869, which is equalto 1-Pr(2).

Matrix SJ(t)

Page 162: Report on WP4 Case studies - European Commission · CONTENTS 7 The main contributors to this report are Bart Bakker, Vincent de Heij, Daniela Ichim, Frank Linder, Filippo Oropallo,

Appendix BA simulation study on the effects on variance and bias of the re-gression estimator with different selections of auxiliary variables

In a simulation study it will be shown how the variance and bias indicatoroperate in the case of an imaginary population, for which it is assumedthat every detail is known, varying from background variables ~X and targetvariable Y to response behaviour.

Description of the imaginary populationIt is assumed that the imaginary population consists of 25 thousand persons.

For each person in this population values are known of nine auxiliary variablesX1, X2, . . . , X9, and the target variable Y.

The nine auxiliary variables are all categorical variables, each of them withthree categories.

The target variable is assumed to be dichotomous with two possible values,zero or one.

The probability to end up in one of the categories of Xj , (j=1,. . . ,9), isassumed to be equal for each of the three categories.

The value of target variable Y is determined on the basis of a Bernoulliexperiment, in which probability p = P (Y = 1) is set to depend on thevalues of X1, X2, . . . , X5 in the following way:

log it (p) = logp

1− p= −I3 (X1)−2·I3 (X2)−3·I3 (X3)−2·I9 (X4, X5) (B.1)

The function I3(Xi) (i=1,. . . ,3) attaches one of three values −12, 0, 1

2in ran-

dom order to the three categories of the categorical variable Xi. Notice that,

Page 163: Report on WP4 Case studies - European Commission · CONTENTS 7 The main contributors to this report are Bart Bakker, Vincent de Heij, Daniela Ichim, Frank Linder, Filippo Oropallo,

162

because of the composition of logit(p), a value of −12

attached to a categoryof Xi (i=1,. . . ,3) increases the probability p = P (Y = 1), whereas a value of12

reduces this probability. It is also clear that X3 has a stronger effect onlogit(p) than X2 because of a larger negative coefficient, and for the samereason X2 has a stronger effect than X1. In other words, X 3 has a highercorrelation with variable Y than variable X 2 and X 2, in turn, has a strongercorrelation with Y than variable X 1.

The function I9(Xi, Xj) attaches the values −12,−3

8,−1

4,−1

8, 0, 1

8, 1

4, 3

8, 1

2in

random order to the nine different combined categories of a pair (Xi, Xj) (i,j=1,. . . ,3).

After applying the probability mechanism, as described, the imaginary pop-ulation appears to consist of 12,618 persons for which Y = 0 and 12,382persons for which Y = 1.

From the table below it can be seen that, as expected, the highest correlationbetween target variable Y and the auxiliary variables Xj is the one betweenY and X3. The correlation is weaker with the variables X 2, X 4 and X 5 andeven weaker where X 1 is concerned. Again, in line with all expectations,the target variable and the auxiliary variables X 7 en X 9 show hardly anycorrelation.

Sampling design and response modelsA so-called Bernoulli sample is taken from the imaginary population (N=25thousand people) with a sampling fraction p of 30 percent. The expectedvalue of the sample size is N.p= 7,500 persons with a standard error of√N.p.(1− p)=72 persons.

Next, a Bernoulli experiment is carried out which divides the sample popu-lation into a responding and a non-responding part.

The response probability attached to the Bernoulli process will be denotedas θ.

Three different response models, often mentioned in the statistical literature1,will be distinguished: Missing Completely At Random (MCAR), Missing AtRandom (MAR) and Not Missing At Random (NMAR).

Missing Completely At Random (MCAR)With Missing Completely At Random (MCAR) the response probability isindependent of a person’s characteristics, in other words every person has

1See for example Schouten (2007) or Bethlehem (2009).

Page 164: Report on WP4 Case studies - European Commission · CONTENTS 7 The main contributors to this report are Bart Bakker, Vincent de Heij, Daniela Ichim, Frank Linder, Filippo Oropallo,

163

Table B.1. Number of persons with Y = 0 and Y = 1 as determined by the logit(p) formula for differentvalues of the auxiliary variables X1, X2, X3, X4, X5, X7 and X

1)9 .

X 1 Category 2 Category 1 Category 3 TotalY = 0 4,920 4,186 3,512 12,618Y = 1 3,442 4,080 4,860 12,382X 2 Category 3 Category 2 Category 1Y = 0 5,588 4,267 2,763 12,618Y = 1 2,761 4,090 5,531 12,382X 3 Category 2 Category 1 Category 3Y = 0 6,454 4,274 1,890 12,618Y = 1 1,869 4,045 6,468 12,382X 4 Category 2 Category 1 Category 3Y = 0 5,129 4,085 3,404 12,618Y = 1 3,283 4,218 4,881 12,382X 5 Category 1 Category 3 Category 2Y = 0 4,313 4,239 4,066 12,618Y = 1 3,925 4,237 4,220 12,382X 7 Category 2 Category 1 Category 3Y = 0 4,307 4,270 4,041 12,618Y = 1 4,115 4,235 4,032 12,382X 9 Category 1 Category 2 Category 3Y = 0 4,270 4,184 4,164 12,618Y = 1 4,086 4,064 4,232 12,382

1) For each auxiliary variable the first column presents the category which delivers the highest numberof persons with Y = 0, whereas the last column presents the category which delivers the lowest numberof persons with Y = 0. Because of the random allocation among the categories of auxiliary variablesand target variable, sometimes category 1 has the highest number of persons allocated with Y = 0, andsometimes it is either category 2 or 3.

the same response probability. This implies that the expected value of theresponse mean (average over all respondents in the sample) will be equal tothe expected value of the sample mean. So, in the case of MCAR no bias iscaused by nonresponse if the population mean Y is estimated, because thenonresponse is not selective (see e.g. Bethlehem, 2009).

An example of an MCAR response model, is one in which the response prob-ability is defined as logit (θCMAR) = 0 for each person in the sample (i.e.θCMAR= 0.5). It is this model that will be used for the MCAR part of thesimulation process.

Missing At Random (MAR)With Missing At Random (MAR) the response probability is dependent onthe known background characteristics of a person, however not on the un-known target variable. In this case the population mean is expected to bedifferent from the response mean in the sample. Therefore, MAR will usu-ally result in biased estimation of the population mean. However, this can beadjusted by using the same background variables, which affect the responseprobabilities, as auxiliary variables in the regression model.

Page 165: Report on WP4 Case studies - European Commission · CONTENTS 7 The main contributors to this report are Bart Bakker, Vincent de Heij, Daniela Ichim, Frank Linder, Filippo Oropallo,

164

An example of an MAR response model, is one in which the response prob-ability is related to the variables X 1, X 2, X 6, and X 7 defined before in thefollowing way:

logit (θMAR) = −I3 (X1)− 2 · I3 (X2)− I3 (X6)− 2 · I3 (X7) (B.2)

It is this model that will be used for the MAR part of the simulation process.

Not Missing At Random (NMAR)With Not Missing At Random (NMAR) the response probability is depen-dent on the known characteristics of a person, as well on the unknown char-acteristics, in the target variable. As with the MAR model the populationmean will differ from the response mean in the sample. Accordingly, withan NMAR type of nonresponse the estimation of the population mean willin general suffer bias. However, in the NMAR case it is not possible to ad-just for bias by using the known characteristics of the person as auxiliaryinformation.

An example of an NMAR response model is where θNMAR has the followingform:

logit (θNMAR) = −I3 (X1)− I3 (X6)− 2 · I3 (Y) (B.3)

It is this model that will be used for the NMAR part of the simulationprocess.

Selecting auxiliary variables in the weighting modelWith the sampling design and the response models defined above a set of re-spondents will be generated from the imaginary population. The GeneralizedRegression Estimator estimates the population total of target variable Y, us-ing the data on the background variables of the respondents. Nine auxiliaryvariables X 1,. . . ,X 9 are available (see above), from which any combinationcan be tested in a model specification. The criterion for the best selection ofauxiliary variables in a model will be based on the outcome of the varianceand the bias indicator, wV and wB, which were introduced in section 3.4.2.In the table below 15 different models, a to o, are tested for this purpose.Many different sets can be made up from combinations of variables X 1 ,. . . ,X 9. From necessity the tests are restricted to some 15 models, which areselected in such a way to provide some basis to the conclusions drawn insection 3.4.2.

1. One conclusion of section 3.4.2 states that adding auxiliary variableswhich have low correlation with the target variable may even result in

Page 166: Report on WP4 Case studies - European Commission · CONTENTS 7 The main contributors to this report are Bart Bakker, Vincent de Heij, Daniela Ichim, Frank Linder, Filippo Oropallo,

165

Table B.2. Fifteen weighting models a,. . . ,o based on different sets of auxiliary variables

Model Set of auxiliary variables Propertiesa X1 Correlates with Y, θMAR and θNMAR.b X2 Correlates with Y, θMAR and θNMAR.c X3 Correlates with Y and θNMAR.d X4 Correlates with Y and θNMAR.e X5 Correlates with Y and θNMAR.f X6 Correlates with θMAR and θNMAR.g X7 Correlates with θMAR.h X8i X4 + X5 Correlates with Y and θNMAR.j X4 * X5 Correlates with Y and θNMAR.k X1 + X2 + X3 + X4 * X5 Model for Y.l X1 + X2 + X6 + X7 Model for θMAR.m X1 + X2 + X3 + X4 * X5 + X6 Model for θNMAR.n X1 + X2 + X3 + X4 * X5 + X6 + X7 Model for θMAR and Y.o X1 * X2 * X3 + X4 * X5 + X6 * X7 * X8 * X9 Bigger model than strictly necessary

an increase of the variance of the regression estimator. An example ofsuch a case is the last model o, in which the term X6 * X7 * X8 * X9

has no correlation with the target variable Y. The consequence for thevariance will be tested by applying variance indicator wV .

2. A second conclusion of section 3.4.2 states that if auxiliary variablescorrelate more strongly with the target variable, the bias indicator wBis supposed to give a lower value. Models b and c are examples of setsof one auxiliary variable (X 2 and X 3 respectively) which have a highercorrelation with Y than for example X 7 (in model g) and X 8 (in modelh). The term X4 * X5 (model j) has a stronger correlation stronger withY than the term X4 + X5 (model i). By comparing the values of wBfor the different models mentioned, the conclusions for the effect onbias should be validated.

3. The conclusion at the end of section 3.4.2 states that a higher corre-lation between auxiliary variables and target variable (e.g. model k)would tend to reduce the value of the bias indicator wB more thanthe correlation between auxiliary variables and response indicator (e.g.model f and g). This should be confirmed in the simulations by com-paring the values of wB for the different models.

To test the 15 models, for each of them the values of the variance indicatorand the bias indicator will be determined. It should be kept in mind thatlike the estimator of the population total, the indicators themselves also havestatistical margins. This might affect a correct assessment of the qualities ofthe different models. Therefore, it was decided to resample 10,000 times andfor each sample to select a new group of respondents based on the responsemodels. Each time the variance and bias indicator will be calculated again.

Page 167: Report on WP4 Case studies - European Commission · CONTENTS 7 The main contributors to this report are Bart Bakker, Vincent de Heij, Daniela Ichim, Frank Linder, Filippo Oropallo,

166

The comparison of the models will not be based on just one value, but on theaverage of the thus determined 10,000 values of the two indicators. Obviously,because of the averaging the margins are much smaller.

The three graphs in the figure below display the average values of the varianceand bias indicator for each of the fifteen models. The first graph describesthe MCAR response situation, in the second graph the associated responsemodel is MAR, and in the third graph NMAR. The bias indicator wB ispresented along the horizontal axis, whereas the variance indicator (in factthe square root wS=

√wV ) is presented vertically.

A position at the left side of the graph indicates a low upper bound for theabsolute value of the bias of the regression estimator. The more to the rightthe position in the graph, the higher this upper bound is. From what is shownhereafter, it will appear that any statement made for the upper bound willin general also be valid for the absolute value of the bias itself. A position inthe upper part of the graph indicates a higher variance.

It is immediately noticeable that weighting models with a strong correla-tion between auxiliary variables and target variable show low values for thevariance indicator and, as was discussed, also for the bias indicator. Of allfifteen weighting models in the table, the four models k, m, n and o whichhave a specification very much related to the logit(p) model that sets targetvariable Y, display the highest correlation. So, it is no coincidence that allthese four are all located in the bottom left-hand corner of the graph.

When comparing models a, b and c, one can see that model c has the lowestposition for both indicators wB and wS. This is not surprising, because as wasshown before, X 3 (in model c) has a higher correlation with target variable Ythan variable X2 (in model b) and even more so in comparison with variableX1 (in model a).

A similar result is found for models i and j. As mentioned above in thesecond conclusion, in model j (multiplicative form X4 * X5) there is a strongercorrelation with target variable Y than in model i (additive form X4 + X5).From the graphs it can be seen that model j is superior in terms of wB, and– but to a slightly lesser extent - also in terms of wS.The graphs also clearly show that the degree of correlation between auxil-iary variables and response probabilities hardly affects the value of the biasindicator. For example, in the case of model f which only correlates withresponse probabilities θMAR and θNMAR (and not with Y), indicator wB getsvalue one. A value of one is typical for an estimator that does not use anyauxiliary information.

Page 168: Report on WP4 Case studies - European Commission · CONTENTS 7 The main contributors to this report are Bart Bakker, Vincent de Heij, Daniela Ichim, Frank Linder, Filippo Oropallo,

167

Figure B.1. Average values of indicators wS=√wV ) and wB for fifteen models a to o (response situations

MCAR, MAR and NMAR) and a model, which is only based on initial weights (indicated with starsymbol).

Another example are the three models k, m and n. Model k is different fromthe other two, in the sense that its set of auxiliary variables only correlateswith Y, whereas models m and n have auxiliary variables that also correlatewith θNMAR and θMAR. However, all three models exhibit comparable valuesfor indicator wB.

Model l is another example that shows that correlation between auxiliaryvariables and response probabilities does not have much effect on the valueof the bias indicator. In the MAR situation the correlation is at its maximum,

Page 169: Report on WP4 Case studies - European Commission · CONTENTS 7 The main contributors to this report are Bart Bakker, Vincent de Heij, Daniela Ichim, Frank Linder, Filippo Oropallo,

168

but even then the corresponding wB is not really different from that in simplermodels, such as model b. It definitely has more bias than in the case of modelk, which is highly correlated with Y.

What the graphs show less convincingly is the negative effect on the value ofvariance indicator wS that addition of auxiliary variables in the model shoulddisplay, if these variables have no or hardly any correlation with the targetvariable. See, for example, models n and o, which compared with model kpossess additional variables that do not correlate with the target variable.Even then, values of wS are about the same for all the three models. It is onlyin the MAR response model that wS is higher for models n and o, thoughnot remarkably so.

How do variance and bias indicators correspond with variance andbias themselves?The variance indicator wV and the bias indicator wB are no direct estimatorsof the variance V and bias B of the regression estimator. The bias indicatorwB, for example, estimates a bias ratio2, not bias itself. This section exploreshow wV and wB correspond with V and B. Do low values for wV and forwB also imply low values for V and B? This will be checked in a simulationexperiment applied to the imaginary population.

For each of the fifteen models 10,000 samples are taken from the imaginarypopulation according to the sampling design, and for each separate responsemodel the corresponding set of respondents is made up. In this way it ispossible to estimate the variance V and bias B of the estimator of populationtotal tY of the target variable Y.

Variance V and bias B of the estimator tY are defined as:

V(tY)

= E(tY − E tY

)2

B(tY)

= E tY − tY(B.4)

Estimator V of variance V and estimator B of bias B are found by usingcalculation results of 10,000 simulations:

2The bias indicator wB is a ratio with in the numerator an upper bound for theabsolute value of the bias of the regression estimator and in the denominator an upperbound for the absolute value of the bias of an estimator based on only initial weights.

Page 170: Report on WP4 Case studies - European Commission · CONTENTS 7 The main contributors to this report are Bart Bakker, Vincent de Heij, Daniela Ichim, Frank Linder, Filippo Oropallo,

169

V (tY ) = 110,000

10,000∑i=1

(tY i − 1

10,000

10,000∑j=1

tY j

)2

B(tY ) = 110,000

10,000∑i=1

tY i − tY

(B.5)

with tY i estimator of population total based on calculations for the set ofrespondents i=1,. . . ,10,000 and tY the actual population total for target vari-able Y as measured in the imaginary population.

When comparing the values of the indicator wB and those of the estimatorB one should realise that wB gives an upper bound for bias for any possiblenonresponse mechanism. In other words, if wB is an upper bound for bias inthe situation of, for example, MAR it will also be an upper bound for biaswith MCAR, NMAR and any other nonresponse situation. Estimator B, onthe other hand, is a direct estimator of bias and is therefore dependent onthe nonresponse circumstances.

In the graph above it is clear that there is a strong correlation between wBand B for nearly every model in the MAR and NMAR response situation. Itis not surprising that with MCAR by definition bias B is almost negligible.However, not so for indicator wB, which means that in the MCAR situationindicator wB and the estimator for bias, B, do not correlate.

The intention of the simulations was to show that there is a basis for thestatement, made in section 3.4.2, that a low value for bias indicator wB notonly has positive implications for the absolute value of the upper bound ofthe bias, but also that bias is reduced. However, it is important not to takethe value of the indicator too literally. For example, a value of 0.84 for wB,as with model k in the NMAR situation, implies that for an estimator basedon auxiliary information the upper bound for the absolute value of bias is16 percent less than for one which does not use auxiliary information. Thedecline of bias B itself, dropping from 5,600 (model *) to 4,000 (model k),or 29 percent in relative terms, is much larger.

What is also notable in the MAR part of the same graph is the position ofthe models b and l. Their estimated bias B is rather low, as with the modelsk, m, n and o. An important difference is that wB is low (below 0.85) for k,m, n and o, but much less so (around 0.95) for b and l. So, although thereis not so much reduction in wB, bias B decreases considerably because ofthe strong correlation between response probabilities and auxiliary variablesin this specific MAR situation. Compared with the models k, m, n and o,

Page 171: Report on WP4 Case studies - European Commission · CONTENTS 7 The main contributors to this report are Bart Bakker, Vincent de Heij, Daniela Ichim, Frank Linder, Filippo Oropallo,

170

Figure B.2. Average values of bias indicator wB and B for fifteen models a to o (response situationsMCAR, MAR and NMAR) and a model based only on initial weights (indicated with star symbol).

the auxiliary variables in models b and l have a lower correlation with thetarget variable. It is therefore not surprising that wB is reduced by less wheremodels b and l are concerned, as it was argued before that the correlationbetween auxiliary variables and the target variable is the dominant factorin the formula of the upper bound of bias. The message of all this is thatif there is any reason to assume an MAR response situation, and not an

Page 172: Report on WP4 Case studies - European Commission · CONTENTS 7 The main contributors to this report are Bart Bakker, Vincent de Heij, Daniela Ichim, Frank Linder, Filippo Oropallo,

171

NMAR3 one, one should search for a model that takes not only the correlationbetween auxiliary variables and target variable into consideration, but alsothe correlation with the response probabilities.

As discussed before, the value range of bias depends strongly on the non-response situation. With MCAR bias should be negligible. With MARan appropriate weighting model should be capable of adjusting bias. WithNMAR bias will always remain whatever model with the known backgroundvariables is concerned. All these facts are confirmed in the graph. WithMCAR the value of the bias is between -4 and 4. With MAR the appropri-ate model is able to reduce bias from some 1,500 persons to no more than4 persons. In the NMAR situation bias remains larger than 4,000 persons,which is a rather a lot in a total of 12,618 persons.

In the last graph it is immediately noticeable that wS, the square root of

wV , correlates strongly with S(tY ) =

√V (tY ) for each of the three response

models. In the NMAR situation adding auxiliary variables with no correla-tion with the target variable may have a slight negative effect on S, comparefor example models o and k. For these models there is no difference in thevalue of wS, although the estimated standard error S is two persons higherin model o. In the MAR situation S is more severely affected by adding irrel-evant variables, as can be seen with model o, where the estimated standarderror increases by some 15 persons.

3In practice it will never be known exactly which nonresponse mechanism occurs. AtStatistics Netherlands it is customary to assume an MAR nonresponse situation.

Page 173: Report on WP4 Case studies - European Commission · CONTENTS 7 The main contributors to this report are Bart Bakker, Vincent de Heij, Daniela Ichim, Frank Linder, Filippo Oropallo,

172

Figure B.3. Average values of variance indicator wS =√wV and estimates for standard error S(tY ) =√

V (tY ) for fifteen models a to o (response situations MCAR, MAR and NMAR) and a model only basedon initial weights (indicated with star symbol).