theoretical solutions - methods of integrating data from different … · 1. preparation of...

56
Improvement in the use of administrative data sources (ESS.VIP ADMIN WP6 Pilot surveys and their applications) Agreement no. 07112.2016.004-2016.595 Annex No.4 Theoretical solutions - methods of integrating data from different sources (deterministic, stochastic) based on best solutions developed and identified for selected variables (marital status).

Upload: others

Post on 31-Aug-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Theoretical solutions - methods of integrating data from different … · 1. Preparation of theoretical solutions – methods of integrating data from different sources (deterministic

Improvement in the use of administrative data sources (ESS.VIP ADMIN WP6 Pilot surveys and their applications) Agreement no. 07112.2016.004-2016.595

Annex No.4

Theoretical solutions - methods of integrating data from different

sources (deterministic, stochastic) based on best solutions

developed and identified for selected variables (marital status).

Page 2: Theoretical solutions - methods of integrating data from different … · 1. Preparation of theoretical solutions – methods of integrating data from different sources (deterministic

Improvement in the use of administrative data sources (ESS.VIP ADMIN WP6 Pilot surveys and their applications) Agreement no. 07112.2016.004-2016.595

Annex No.4

1

Table of contents

Introduction ................................................................................................................................................... 2

1. Data linkage procedure ..................................................................................................................... 5

1.1. Sources of data .............................................................................................................................. 5

1.1.1. Analysis of information resources in PBSSP .......................................................................... 5

1.2. Outline of data linkage .................................................................................................................. 8

1.2.1. Representative surveys used in the project .......................................................................... 8

1.2.2. Integration of data using PESEL ........................................................................................... 11

1.2.3. Data integration using a created key .................................................................................. 13

1.2.4. Method of integrating representative surveys ................................................................... 15

2. Data quality ..................................................................................................................................... 18

2.1. Standardisation of data before integration ................................................................................. 18

2.1.1. Standardisation of register data .......................................................................................... 18

2.1.2. Standardisation of data in representative surveys ............................................................. 20

2.2. Quality of administrative registers .............................................................................................. 24

2.2.1. Marital status register and information systems on divorce and separation rulings ......... 24

2.2.2. The National System for Monitoring Family Benefits and the Central Register of Data on Beneficiaries of the Alimony Fund ...................................................................................................... 26

3. Quality assessment of the integrated data sets .............................................................................. 27

3.1. Quality of survey data integration............................................................................................... 27

3.1.1. Quality of the linkage between the PESEL register and the surveys ................................... 27

3.1.2. Detailed analysis of linkage quality ..................................................................................... 29

4. Estimation of the actual marital status using the integrated sets .................................................. 34

4.1. Determination of the actual marital status ................................................................................. 34

4.1.1. Description of the algorithm ............................................................................................... 34

4.1.2. Analysis of the undetermined marital status ...................................................................... 37

4.2. Estimation of the actual marital status ....................................................................................... 39

4.2.1. Post-stratification of the actual marital status .................................................................... 39

4.2.2. Analysis of actual marital status estimates ......................................................................... 41

Summary ..................................................................................................................................................... 46

List of figures ............................................................................................................................................... 48

List of Tables ................................................................................................................................................ 49

A Detailed tables - Information on the actual marital status ............................................................. 51

Page 3: Theoretical solutions - methods of integrating data from different … · 1. Preparation of theoretical solutions – methods of integrating data from different sources (deterministic

Improvement in the use of administrative data sources (ESS.VIP ADMIN WP6 Pilot surveys and their applications) Agreement no. 07112.2016.004-2016.595

Annex No.4

2

Introduction

This report is a summary of works carried out in the course of the following tasks:

1. Preparation of theoretical solutions – methods of integrating data from different sources

(deterministic and stochastic integration methods) based on best solutions developed and

identified (for selected variables).

2. Preparation of data imputation methods using information from administrative registers

and surveys.

3. Description of methods of calibrating data.

4. Testing of methods – empirical, pilot application of the developed methods of integrating

data from selected sources.

The main task was to estimate the actual marital status on the basis of existing, available

sources of data. The levels of the legal marital status include: (1) single; (2) married; (3)

surviving spouse; (4) divorced; (5) undetermined. On the other hand, in the case of the actual

marital status, the following additional levels were distinguished: (6) partner (cohabitation) and

(7) (legally) separated.

A number of assumptions were adopted in the project, which are reflected in this final report.

The most important are:

the reference date is 30.04.2017 (the day of the last update of the PESEL register, which

was made available for the survey),

the target population– population aged 15 + (Polish citizens),

the empirical study was only conducted for data from Wielkopolskie Province, and the

results are presented at the level of districts (LAU 1 level),

both deterministic and probabilistic linkage was used, which was described in detail in a

separate report dedicated to the review of literature related to data integration methods,

Page 4: Theoretical solutions - methods of integrating data from different … · 1. Preparation of theoretical solutions – methods of integrating data from different sources (deterministic

Improvement in the use of administrative data sources (ESS.VIP ADMIN WP6 Pilot surveys and their applications) Agreement no. 07112.2016.004-2016.595

Annex No.4

3

datasets which were made available contained only the necessary variables,

the issue of determining the current place of residence on the basis of registers was not

addressed,

the PESEL register was used as the main register, to which other information was linked,

the data linkage process was separated into three stages:

the first stage involved the linkage of available data from representative surveys

carried out by official statistics,

the second stage involved the linkage of information from administrative registers

which were made available during the project works,

during the third stage data obtained in the first and the second stage were linked to

the PESEL register.

This process is described in detail in Section 1.2 and illustrated in Figure 1.

the legal marital status was determined using data from the PESEL register, which was

updated with information from other registers and two representative surveys, taking into

account the time of update (from the register) or the interview date (in the case of

surveys),

the actual marital status was determined using two sources:

administrative registers – e.g. the "cohabiting partner" status,

representative surveys – e.g. the degree of kinship with the household head, and

information about the type of relationship, separations.

the "undetermined" category was treated as missing data, which was then reweighted.

The datasets combined during the project works contained only the relevant variables required

to complete the project task, namely to estimate the actual marital status of the target

population. All calculations were performed using the R statistical package and RStudio

environment.

Page 5: Theoretical solutions - methods of integrating data from different … · 1. Preparation of theoretical solutions – methods of integrating data from different sources (deterministic

Improvement in the use of administrative data sources (ESS.VIP ADMIN WP6 Pilot surveys and their applications) Agreement no. 07112.2016.004-2016.595

Annex No.4

4

The current version of the report consists of four chapters, which describe the procedure of

data linkage (Chapter 1), input data quality (Chapter 2), the quality of linking data from

representative surveys, and the PESEL register, as well as selected results concerning the marital

status (Chapter 3), and the estimation of the marital status (Chapter 4). The authors are aware

of the limitations resulting from the sole reliance on the PESEL register as the main source to

which other registers and representative surveys were linked, but the approach presented in

the report was based on the sets that had been made available. Since the main goal of the

project to conduct integration and estimation on the basis of several data sources, the

researchers focused mainly on these aspects.

Page 6: Theoretical solutions - methods of integrating data from different … · 1. Preparation of theoretical solutions – methods of integrating data from different sources (deterministic

Improvement in the use of administrative data sources (ESS.VIP ADMIN WP6 Pilot surveys and their applications) Agreement no. 07112.2016.004-2016.595

Annex No.4

5

1. Data linkage procedure

1.1. Sources of data

1.1.1. Analysis of information resources in PBSSP

Preparation for the completion of tasks planned in the VIP ADMIN project was preceded by an

in-depth analysis of both the statistical survey program of official statistics (PBSSP) and the

statistical analysis programme (POS). The purpose of this analysis was to identify datasets

containing information about the marital status. Given information about the identifier of the

survey and datsets used for its implementation, it was possible to browse the catalogue of

datasets of the National Statistical Information Data Records (ISODS) and locate an appropriate

dataset with its ISODS identifier, provided that it existed in the resources. As a result, it was

possible to compile a list of datasets that could be useful for purposes of conducting the tasks

under the VIP ADMIN project (Table 1). The list contains information about the source number

in the POS, the data administrator, the location of the dataset and the reference date for the

data.

Table 1. List of identified data sets containing information on the marital status.

Data source number (PBSSP) Data administrator

Dataset description from PBSSP 2016 Identification information Data as at

1.21.01-01-16 registry offices vital records (births) Data maintained by Statistical Office in Olsztyn

2010-12-31 2011-12-31 2012-12-31 2013-12-31 2014-12-31 2015-12-31 2016-12-31

1.21.02-01-16 registry offices vital records (marriages) Data maintained by Statistical Office in Olsztyn

2010-12-31 2011-12-31 2012-12-31 2013-12-31 2014-12-31 2015-12-31 2016-12-31

1.21.02-03-16 Ministry of Digitisation

PESEL register ISODS-1359

ISODS-1167

2017-02-16 2015-12-31

1.21.02-04-16 district courts information systems containing divorce and separation rulings

Data maintained by Statistical Office in Olsztyn

2010-12-31 2011-12-31 2012-12-31

Page 7: Theoretical solutions - methods of integrating data from different … · 1. Preparation of theoretical solutions – methods of integrating data from different sources (deterministic

Improvement in the use of administrative data sources (ESS.VIP ADMIN WP6 Pilot surveys and their applications) Agreement no. 07112.2016.004-2016.595

Annex No.4

6

2013-12-31 2014-12-31 2015-12-31 2016-12-31

1.21.03-03-16 commune offices registers of inhabitants and resident registers of foreigners

Data maintained by Statistical Office in Olsztyn

2010-12-31 2011-12-31 2012-12-31 2013-12-31 2014-12-31 2015-12-31 2016-12-31

1.21.09-01-16 registry offices vital records (deaths) Data maintained by Statistical Office in Olsztyn

2010-12-31 2011-12-31 2012-12-31 2013-12-31 2014-12-31 2015-12-31 2016-12-31

1.21.09-03-16 Ministry of Digitisation

PESEL register ISODS-1360 ISODS-1166 2017-02-16 2015-12-31

1.25.11-01-16 Ministry of Family, Labour and Social Policy

National System for Monitoring Social Assistance

ISODS-1203 ISODS-1204 ISODS-1205 ISODS-1206 ISODS-1399 ISODS-1400 ISODS-1401 ISODS-1402

2015-03-31 2015-06-30 2015-09-30 2015-12-31 2016-03-31 2016-06-30 2016-09-30 2016-12-31

1.25.11-04-16 district labour offices registers of unemployed and job-seekers

pup2015_osoby pup2016_osoby

2015-12-31 2016-12-31

1.25.15-01-16 Ministry of Family, Labour and Social Policy

National System for Monitoring Family Benefits

ISODS-1297 ISODS-1298 ISODS-1299 ISODS-1300 ISODS-1403 ISODS-1404 ISODS-1405 ISODS-1406

2015-03-31 2015-06-30 2015-09-30 2015-12-31 2016-03-31 2016-06-30 2016-09-30 2016-12-31

1.25.15-04-16 Ministry of Family, Labour and Social Policy

Central register of data on alimony fund beneficiaries

ISODS-1274 ISODS-1275 ISODS-1276 ISODS-1277 ISODS-1407 ISODS-1408 ISODS-1409 ISODS-1410

2015-03-31 2015-06-30 2015-09-30 2015-12-31 2016-03-31 2016-06-30 2016-09-30 2016-12-31

1.80.02-01-16 The Ministry of Digitisation

PESEL register ISODS-1244

ISODS-1434

2015-12-31 2017-05-04

Page 8: Theoretical solutions - methods of integrating data from different … · 1. Preparation of theoretical solutions – methods of integrating data from different sources (deterministic

Improvement in the use of administrative data sources (ESS.VIP ADMIN WP6 Pilot surveys and their applications) Agreement no. 07112.2016.004-2016.595

Annex No.4

7

The next stage of work in the VIP ADMIN project consisted in evaluating the collected datasets

taking into account their suitability for purposes of integration. After the content of all datasets

had been analysed, only those resources were selected that included the required variables

(PESEL identifier, date of birth, full address data), which are essential in order to create a

database containing information on the marital status. Information resources which meet the

these criteria include:

Vital records:

marriages,

births,

deaths,

Information systems containing divorce and separation rulings,

The National System for Monitoring Family Benefits ,

The Central Register of Data on Beneficiaries of the Alimony Fund,

The PESEL Register.

The key dataset in the list above is the PESEL register, to which information from other

information resources are to linked. These resources can be divided into two types: datasets

containing the PESEL variable (unique identification number for each person) and resources

without this variable. The first group includes vital records and information systems with divorce

and separation rulings.

The second group includes the National System for Monitoring Family Benefits and the Central

Register of Data on Beneficiaries of the Alimony Fund. Depending on which group a given

dataset belongs to, it is linked either using the PESEL identifier or by means of a key created by

combing the address and the date of birth, assuming exact linkage (an exact match on all

identifiers).

Page 9: Theoretical solutions - methods of integrating data from different … · 1. Preparation of theoretical solutions – methods of integrating data from different sources (deterministic

Improvement in the use of administrative data sources (ESS.VIP ADMIN WP6 Pilot surveys and their applications) Agreement no. 07112.2016.004-2016.595

Annex No.4

8

1.2. Outline of data linkage

1.2.1. Representative surveys used in the project

As in the case of administrative registers, the first stage was to indicate potential representative

surveys which contain information on the legal marital status and the actual status (e.g.

informal relationship). After analysing the PBSSP and POS, the following representative surveys

were selected: The Household Budget Survey (HBS), the European Union Statistics on Income

and Living Conditions (EU-SILC), the Labour Force Survey (LFS) and the Household Condition

Survey (HCS). Table 2 contains information about these data sets.

Table 2. List of surveys used for the purposes of the VIP ADMIN project

Data source number (PBSSP)

Data administrator Dataset description Identification information Refernce years

1.25.01(063) Central Statistical Office, Social Surveys and Living Conditions Department

HBS (data on housing and population)

None; the appropriate department needs to be contacted

2015–2016

1.25.08(066) Central Statistical Office, Social Surveys and Living Conditions Department

EU-SILC (data on housing and population)

None; the appropriate department needs to be contacted

2015–2016

1.25.02(064) Central Statistical Office, Social Surveys and Living Conditions Department

Condition of households (data on housing and population)

None; the appropriate department needs to be contacted

2015–2016

1.23.01(043) Central Statistical Office, Social Surveys and Living Conditions Department

LFS (data on housing and population)

None; the appropriate department needs to be contacted

2015–2016

In keeping with the goal of the project, the following questions addressed to members of

households were identified as relevant (original wording of the questions, numbers of forms

from http://forms.stat.gov.pl/BadaniaAnkietowe/2017/harmonogram.htm):

1. With regard to the degree of kinship with the household head:

Degree of kinship or relationship with the reference person (HBS BR-01a),

Degree of kinship or relationship with the household head (EU-SILC EU-SILC-G),

Degree of kinship with the household head (LFS ZG).

Page 10: Theoretical solutions - methods of integrating data from different … · 1. Preparation of theoretical solutions – methods of integrating data from different sources (deterministic

Improvement in the use of administrative data sources (ESS.VIP ADMIN WP6 Pilot surveys and their applications) Agreement no. 07112.2016.004-2016.595

Annex No.4

9

2. With regard to the informal relationship:

Are you married to or in an informal relationship with a person in this household?

(HBS BR-01a),

Do you live in a relationship with a person in this household? (EU-SILC EU-SILC-G),

Do you live in a relationship with a person in this household? (HCS).

3. With regard to the legal status – the legal marital status (HBS BR-01a, EU-SILC EU-SILC-G,

HCS, LFS ZG) – it should be noted that, in most cases, the legal marital status contains

information on legal separation, which is a category of the actual marital status.

The answer options for the questions addressed to respondents differed and had to be

standardised before integration. The standardisation procedure is described in section [marital-

status-coding].

The final report contains results of integrating data from two representative surveys – the

Labour Force Survey (LFS) and the European Union Statistics on Income and Living Conditions

(EU-SILC). In both cases, two separate datasets were obtained containing information about

sampled dwellings, individual persons and households.

It should be noted that information about informal relationships refers only to members of the

household. Based on what we know about current representative surveys, there are no

questions concerning informal relationships with persons outside the household. The only

exception is the Social Cohesion Survey but it cannot be used to update administrative sources

because it is not conducted every year.

To make the data linkage method easier to understand, it is illustrated in a schematic diagram

shown in Figure 1. The diagram includes information from both representative surveys and

available registers.

Page 11: Theoretical solutions - methods of integrating data from different … · 1. Preparation of theoretical solutions – methods of integrating data from different sources (deterministic

Improvement in the use of administrative data sources (ESS.VIP ADMIN WP6 Pilot surveys and their applications) Agreement no. 07112.2016.004-2016.595

Annex No.4

10

Figure 1. Schematic diagram of linking administrative registers and representative surveys, where PESEL is the main register . Continuous lines denote deterministic linkage (using a linkage key), and dashed lines denote linkage using an artificial key or probabilistic linkage

The first stage of linkage involved the creation of one database for individual datsets. For

example, the EU-SILC and the LFS were combined into one set containing all respondents for

whom information on the legal and actual marital status was available. Such a linkage is

acceptable, since the likelihood that a given person is present in both surveys is low and

Page 12: Theoretical solutions - methods of integrating data from different … · 1. Preparation of theoretical solutions – methods of integrating data from different sources (deterministic

Improvement in the use of administrative data sources (ESS.VIP ADMIN WP6 Pilot surveys and their applications) Agreement no. 07112.2016.004-2016.595

Annex No.4

11

individual persons cannot be identified. We should note that for both surveys it was first

necessary to link information about sampled dwellings and respondents.

The same approach was adopted in the case of registers: integrated databases were created

combining information from all the datasets obtained for the purposes of the project. Both the

Central Register of Data on Beneficiaries of the Alimony Fund and the National System for

Monitoring Family Benefits had been made available as sets of quarterly data. In each case, the

quarterly datasets were combined into one set, which was then deduplicated. Similar

procedures were applied in the case of sets from the marital status register and information

systems maintained by district courts with data for the years 2010–2016.

After creating datasets containing different registers and surveys (labelled as databases in the

figure) individual records were linked. In the case of registers containing PESEL, deterministic

linkage was used (continuous line), other datasets were linked using an artificial key or

probabilistic linkage (dashed line).

1.2.2. Integration of data using PESEL

The registers integrated in the first place were those including the PESEL variable, which is

crucial for deterministic data integration. These registers contained data for the period of seven

years, namely from 2010 to 2016. Because the PESEL variable was missing for relevant

categories (spouses, persons in separation, divorcing persons, parents and surviving spouses) in

some years, only data for selected years were used:

marriages contracted in the years 2011–2016,

divorces in 2016,

separations in 2016,

births in the years 2011–2016,

deaths in the years 2015–2016.

Page 13: Theoretical solutions - methods of integrating data from different … · 1. Preparation of theoretical solutions – methods of integrating data from different sources (deterministic

Improvement in the use of administrative data sources (ESS.VIP ADMIN WP6 Pilot surveys and their applications) Agreement no. 07112.2016.004-2016.595

Annex No.4

12

After deciding that resources containing information on the marital status could be used in the

project, it was necessary to determine the marital status for each person included in the

selected datasets. Datasets spanning more than one year were first merged; further actions

were identical for all data sets. Records for which PESEL was missing or which did not consists of

eleven digits were rejected. Then, five separate strata were created, each with a unique PESEL

number and information about the date of a specific event assigned to it (marriage, divorce,

birth of a child, separation and death of the spouse). Next, from each stratum the latest date of

an event was selected, e.g. for persons marrying more than once in the years 2011–2016, it was

the latest date of marriage, for persons who got divorced more than once, it was the latest date

of divorce, for persons in separation – the latest date of separation, for persons who became

parents – the latest date of birth and the parent’s marital status at the time of the child’s birth,

for datasets with "deaths", it was the latest date of the spouse’s death. This procedure was

necessary because the same person could have changed the marital status many times in the

reference period. In this procedure it is important to correctly determine which event was the

last one, and thus properly assign the marital status to a given person. For this purpose, a

unique PESEL stratum was created, consisting of the five strata distinguished earlier, and then,

the PESEL identifier was used to assign information collected in the course of previous activities,

i.e. information about the date and the type of event. As a result, a marital status database was

created, containing "the history of events" collected from the marital status register and

information systems containing divorce and separation, recorded for each person. This set

contained eleven variables: the PESEL identifier, five variables representing the occurrence of a

given event (marriage, divorce, birth of a child, etc.) and five other ones, containing respective

dates of each occurrence. One record (person) could contain information from up to five

sources (marriages, divorces, births, deaths, separations). If a given event did not occur for a

given person, there was no entry and the field was marked as missing data. In the next step, the

stratum described above could be combined with the PESEL register.

Page 14: Theoretical solutions - methods of integrating data from different … · 1. Preparation of theoretical solutions – methods of integrating data from different sources (deterministic

Improvement in the use of administrative data sources (ESS.VIP ADMIN WP6 Pilot surveys and their applications) Agreement no. 07112.2016.004-2016.595

Annex No.4

13

1.2.3. Data integration using a created key

The purpose of the next stage of works was to integrate the PESEL register with other

administrative registers which do not have the PESEL variable but include information on the

date of birth and full address data (TERYT code of the commune of residence (LAU 2 level),

name of locality, street, building/dwelling number). Datasets containing a set of such variables

can be integrated using a key consisting of a combination of variables. The following datasets

were taken into account:

The National System for Monitoring Family Benefits,

The Central Register of Data on Beneficiaries of the Alimony Fund,

The next step consisted in preparing datasets, that is, first of all, standardising and transforming

variables to be used as the linkage key. In the first place, it was necessary to determine the

unique population of persons receiving family benefits and benefits from the alimony fund. As

already mentioned, this information is collected quarterly, which means that the number of

beneficiaries in particular quarters varied. Some persons received financial assistance in all

quarters in the two reference years, while others only in some quarters. In order to select the

unique population of persons, data from all quarters from the years 2015–2016 were combined

into one dataset (separately for persons receiving family benefits and alimony benefits).

Table 3. The KSMŚR database - created using data from the National System for Monitoring Family Benefits for the years 2015–2016

Year Quarter Number of

persons The share of persons in the general

population of the created set (%)

2015 1 19 590 2.42

2015 2 26 358 3.26

2015 3 32 710 4.04

2015 4 64 487 7.97

2016 1 34 865 4.31

2016 2 33 383 4.13

2016 3 45 976 5.68

2016 4 551 654 68.19

Total 809 023 100.00

Page 15: Theoretical solutions - methods of integrating data from different … · 1. Preparation of theoretical solutions – methods of integrating data from different sources (deterministic

Improvement in the use of administrative data sources (ESS.VIP ADMIN WP6 Pilot surveys and their applications) Agreement no. 07112.2016.004-2016.595

Annex No.4

14

Table 4. The FA database - created using data from the Central Register of Data on Beneficiaries of the Alimony Fund for the years 2015–2016.

Year Quarter Number of

persons The share of persons in the general population of the created set (%)

2015 1 2 080 1.59

2015 2 9 585 7.33

2015 3 7 407 5.67

2015 4 3 850 2.94

2016 1 3 770 2.88

2016 2 3 368 2.58

2016 3 8 483 6.49

2016 4 92 197 70.52

Total 130 740 100.00

During the next stage a personal identifier was created, consisting of the date of birth, TERYT

code of the commune of residence, name of locality, street, building/dwelling number. Then,

using the identifier, the researchers selected only these persons who, in particular quarters of a

given year, occurred only once. Then, from this database, the researchers selected unique

persons with information on the marital status in the latest quarter. For instance, if a given

person occurred in all quarters, then the latest information on their marital status was the entry

for the 4th quarter of 2016, which is the record that was selected. By applying this procedure,

two data sets were created – one with information from the National System for Monitoring

Family Benefits (the KSMŚR database) and the second one – with information from the Central

Register of Data on Beneficiaries of the Alimony Fund (the FA database), containing only

information on the marital status of persons and variables necessary to link the datasets. The

number of persons in a given quarter and year, and their percentage share in the general

number of persons for both sets are presented in Table 3 and Table 4.

The next step was the integration of data from the sets obtained as a result of the above

procedures - KSMŚR and FA databases - with the PESEL dataset. These datasets were linked

using a linkage key consisting of the following variables: date of birth, TERYT code of the

commune of residence, name of locality, street, building/dwelling number. However, before

the datasets could be integrated, these variables had to be standardised and harmonised to

Page 16: Theoretical solutions - methods of integrating data from different … · 1. Preparation of theoretical solutions – methods of integrating data from different sources (deterministic

Improvement in the use of administrative data sources (ESS.VIP ADMIN WP6 Pilot surveys and their applications) Agreement no. 07112.2016.004-2016.595

Annex No.4

15

remove inconsistencies in TERYT codes and names of localities between the datasets, which are

described in section 2.1.1

1.2.4. Method of integrating representative surveys

The following statistical datasets were made available, some of which were used during the first

stage of the project:

LFS – data for persons (2015–2016), data for dwellings (2015–2016),

HCS – data for persons (2015–2016), data for dwellings (2015–2016),

EU-SILC – data for persons (2015–2016), data for dwellings (a part of 2015 and 2016),

HBS – data for persons (2015–2016) data for dwellings (2016).

Some respondents in the EU-SILC dataset, who had participated in the survey multiple times,

may have had a different address of residence in 2016, compared to 2015. The EU-SILC

database contained the current address of residence of respondents updated for 2016. This

means that there was no full information about the history of residence for 2015.

The HBS dataset contained data for dwellings only for 2016, since, according to information

provided by the administrator, in 2015 data were not recorded in an electronic form. As a result,

the dataset was had to limited to persons surveyed in 2016.

In the case of the HCS survey, there was a problem with identifiers of dwellings. The dataset

containing addresses for one year included dwellings and households with non-unique

identifiers. For this reason, no attempt was made to integrate the HCS dataset with other

sources at this stage; only the LFS and EU-SILC datasets were used. Table 5 presents information

about the size of sample for Wielkopolskie province in each of two surveys.

Table 5. Realised sample size in Wielkopolskie Province in LFS and EU-SILC (unique records)

Year Set LFS EU-SILC

2015 persons 10 344 2 520

dwellings 4 269 815

Page 17: Theoretical solutions - methods of integrating data from different … · 1. Preparation of theoretical solutions – methods of integrating data from different sources (deterministic

Improvement in the use of administrative data sources (ESS.VIP ADMIN WP6 Pilot surveys and their applications) Agreement no. 07112.2016.004-2016.595

Annex No.4

16

2016 persons 8 566 2 442

dwellings 3 642 810

Because of missing identifiers in the representative surveys, it was necessary to identify

variables that could be used to link information with other sources. Such variables can be

divided into two groups – variables related to dwellings and those related to persons. Variables

in the first group indicate which dwellings were sampled; variables in the second one describe

person characteristics which were used for linkage.

The first set included (1) commune code, (2) street name, (3) building number (4) dwelling

number. Street names were standardised by removing abbreviations such as "ul.", "Al." or "os."

and then converted into lowercase to make sure that the letter size does not affect the linkage

process. Variables relating to persons included (1) sex, (2) year of birth, (3) month of birth and

(4) day of birth. These data were also standardised, e.g. numbers of months were recorded as

integers instead of text strings (e.g. '01', '02').

The representative surveys required probabilistic linkage, which consists in pairwise comparison

of records from a larger set with those from a smaller set. To reduce the number of

combinations, the following information was used for blocking:

commune code,

name of locality,

building number,

dwelling number,

sex,

year of birth,

month of birth.

Then, it turned out that street names and days of birth may differ. In the case of street names,

the problem was solved by using the method of comparing distances between texts - Jaro-

Page 18: Theoretical solutions - methods of integrating data from different … · 1. Preparation of theoretical solutions – methods of integrating data from different sources (deterministic

Improvement in the use of administrative data sources (ESS.VIP ADMIN WP6 Pilot surveys and their applications) Agreement no. 07112.2016.004-2016.595

Annex No.4

17

Winkler function. The reason why it was necessary to conduct such comparisons was that some

street names had various versions, e.g. "powstańców wielkopolskich" and "powstańców wlkp.",

"św. marcin" and "święty marcin" or "księdza serafina opałki" and "ks. serefina opałki". It should

be noted that before the linkage procedure, street names were iteratively standardised, which

is why this variable is not included on the list of variables with possible differences.

Unfortunately, given the number of possible variants and time constraints, we were not able to

fully standardise addresses between the surveys and the PESEL register.

Inconsistencies in the day of birth can be attributed to errors made by interviewers while

entering imprecise information for a given respondent or entering their own value if the

respondent could not indicate the exact date of birth (e.g. there were records where all persons

from the household had the same date of birth).

Page 19: Theoretical solutions - methods of integrating data from different … · 1. Preparation of theoretical solutions – methods of integrating data from different sources (deterministic

Improvement in the use of administrative data sources (ESS.VIP ADMIN WP6 Pilot surveys and their applications) Agreement no. 07112.2016.004-2016.595

Annex No.4

18

2. Data quality

2.1. Standardisation of data before integration

2.1.1. Standardisation of register data

Standardisation of TERYT

The PESEL dataset contains separate TERYT codes for urban and rural parts of urban-rural

communes and separate districts of the five largest Polish cities: Warsaw, Kraków, Łódź,

Wrocław and Poznań (Table 6). Standardisation of the TERYT code consisted in finding those

records for which the TERYT code contained "4" or "5" in the last (seventh) position and then

replacing this last digit with "3". For example: in the PESEL register the town of Czempiń is

classified as an urban-rural commune with the code 3011024, while its rural part has the code

3011025. After standardisation, both territorial units receive the same general code, which

denotes an urban-rural commune, namely 3011023. Codes for districts of Warsaw, Krakow,

Łódź, Wrocław and Poznań were changed in a similar way - the specific code digits were

replaced with the general code ending in "011". Thus, the first 4 digits of the code remained

unchanged, while the subsequent three were changed to "011". In the case of Poznań, the

TERYT code was changed as follows: Poznań consists of five districts: Poznań-Grunwald

(3064029), Poznań-Jeżyce (3064039), Poznań-Nowe Miasto (3064049), Poznań-Stare Miasto

(3064059), Poznań-Wilda (3064069). The first four digits were retained ("3064"), while the three

subsequent ones were replaced with "011"; in this way all districts of the city of Poznań

received the general code for this city, i.e. 3064011. The same procedure was applied for the

other four cities.

TERYT standardisation in the dataset consisting of data from the Central Register of Data on

Beneficiaries of the Alimony Fund (the FA database - Table 7) and from the National System for

Monitoring of Family Benefits (the KSMŚR database, Table 8) involved the creation of a TERYT

commune identifier by combining two digits of the NTS code starting from the second position

and five digits starting from the sixth position (located in column nts_kod_QUIC and

nts_kod_ALIM).

Page 20: Theoretical solutions - methods of integrating data from different … · 1. Preparation of theoretical solutions – methods of integrating data from different sources (deterministic

Improvement in the use of administrative data sources (ESS.VIP ADMIN WP6 Pilot surveys and their applications) Agreement no. 07112.2016.004-2016.595

Annex No.4

19

Table 6. The TERYT commune code and the name of the locality in the PESEL register

adr_meld_kod_gmn_PESEL adr_meld_naz_msc_PESEL

2261011 GDAŃSK

2605085 WÓLKA ZYCHOWA

1213072 POLANKA WIELKA

2465011 DĄBROWA GÓRNICZA

1437015 KOCEWO

2207011 KWIDZYN

1425011 PIONKI

1061069 ŁÓDŹ-WIDZEW

3064029 POZNAŃ-GRUNWALD

1061069 ŁÓDŹ-WIDZEW

2404042 WANATY

1009014 DZIAŁOSZYN

3211044 POLICE

1219022 WINIARY

0219084 ŻARÓW

0264029 WROCŁAW-FABRYCZNA

1816025 BŁAŻOWA GÓRNA

Standardisation of names of localities

Names of localities recorded in the PESEL dataset for Warsaw, Krakow, Łódź, Wrocław and

Poznań include the names of city districts, e.g. Poznań-Grunwald. For this reason, to match the

standardised TERYT codes (0264011, 1061011, 1261011, 1465011, 3064011), the corresponding

names of localities were changed accordingly: Wrocław, Łódź, Krakow, Warsaw, Poznań. In the

names of variables used in the linkage key, spaces were removed and capital letters were

changed to lowercase.

Table 7. Standardised locality names and TERYT commune codes in the Central Register of Data on Beneficiaries of the Alimony Fund (the FA database)

adr_naz_msc_ALIM nts_kod_ALIM

KRAKOW 5126261011PP

ŁÓDŹ 5106261011PP

POZNAŃ 5306264011PP

WARSAW 5146265011PP

WROCŁAW 5026264011PP

Page 21: Theoretical solutions - methods of integrating data from different … · 1. Preparation of theoretical solutions – methods of integrating data from different sources (deterministic

Improvement in the use of administrative data sources (ESS.VIP ADMIN WP6 Pilot surveys and their applications) Agreement no. 07112.2016.004-2016.595

Annex No.4

20

Table 8. Standardised locality names and TERYT commune codes in the National System for Monitoring Family Benefits (the KSMŚR database)

adr_naz_msc_QUIC nts_kod_QUIC

KRAKOW 5126261011PP

ŁÓDŹ 5106261011PP

POZNAŃ 5306264011PP

WARSAW 5146265011PP

WROCŁAW 5026264011PP

The operations described above were performed for the PESEL dataset, the dataset containing

information from the National System for Monitoring of Family Benefits (the KSMŚR database)

and the dataset with information from the Central Register of Data on Beneficiaries of the

Alimony Fund (the FA database).

The variables (TERYT code and locality name) in these datasets had to be standardised because

in order to enable the integration of these dataset, and therefore had to have the same format.

Without these three operations, any attempt to link the PESEL dataset with the other two

would have resulted in a substantial loss of information and would have made it impossible to

fully exploit the potential which these sets undoubtedly have.

2.1.2. Standardisation of data in representative surveys

Standardisation of street and locality names

Like in the case of administrative registers, it was also necessary to standardise TERYT codes,

names of streets and localities in datasets with survey data.

Names of localities were standardised in the first place to match names included in the PESEL

register. Before standardising street names, all abbreviations were removed, such as ul., Al. or

os. to match the format used in the PESEL register. Data in both sets were standardised and

converted to lowercase, unnecessary spaces were removed, and NA values were entered where

street name was missing. In the course of this procedure, more than ten records had been

identified that existed only in representative surveys and were not observed in the PESEL

register. Table 9 presents an example of a locality that existed only in the LFS and the EU-SILC. In

Page 22: Theoretical solutions - methods of integrating data from different … · 1. Preparation of theoretical solutions – methods of integrating data from different sources (deterministic

Improvement in the use of administrative data sources (ESS.VIP ADMIN WP6 Pilot surveys and their applications) Agreement no. 07112.2016.004-2016.595

Annex No.4

21

the Table its occurrence is denoted by YES, while the number in brackets indicates the number

of records associated with that locality. The table is the result of, among other things, analysing

whether the TERYT code matches the locality or the street name from the LFS and the EU-SILC,

in comparison with the PESEL register. Problematic communes and localities were excluded

from further analysis.

Table 9. List of localities observed only in the surveys (different TERYT code)

TERYT Locality LFS EU-SILC

3003105 piłka YES (1)

3003105 popielarze YES (1)

3006012 panienka YES (13)

3007085 szulec YES (8)

3010015 kolebki YES (5)

3010062 brzeźno-parcele YES (4)

3020032 czarnuszka YES (6)

3023052 dolina YES (5)

3027075 józinki YES (3)

3030012 janowo YES (2)

Coding of the marital status

All the above surveys contained information on the legal marital status, which had been elicited

from respondents using questions with the same wording (no measurement error). On the

other hand, the answer options in the LFS and EU-SILC (and other surveys not considered in this

report) differed. Table 10 presents the coding of the marital status in the LFS and the EU-SILC. It

should be noted that the EU-SILC distinguishes between divorced persons and those in a state of

(legal) separation. Nonetheless, codes for levels (in the first column) were different and needed

to be standardised. Table 11 presents sample sizes in the LFS and EU-SILC for codes defined in

Table 10 (before standardisation).

Table 10. Coding of legal marital status in LFS and EU-SILC

Code LFS EU-SILC

1 single man, single woman single man/woman

2 married man, married woman married man/married woman

3 widower, widow in legal separation

Page 23: Theoretical solutions - methods of integrating data from different … · 1. Preparation of theoretical solutions – methods of integrating data from different sources (deterministic

Improvement in the use of administrative data sources (ESS.VIP ADMIN WP6 Pilot surveys and their applications) Agreement no. 07112.2016.004-2016.595

Annex No.4

22

4 divorced, in separation widower/widow

5 – divorced man/woman

Table 11. Size of LFS and EU-SILC sample for Wielkopolskie province by sex and marital status (combined unweighted samples, before standardisation)

Year Marital Status Men Men (%)

Women Women (%)

LFS

2015 1 2434 28.63 1913 19.81

2 5590 65.74 5697 58.98

3 296 3.46 1664 17.23

4 185 2.18 385 3.99

2016 1 2050 29.23 1595 20.40

2 4524 64.50 4581 58.59

3 262 3.74 1283 16.41

4 178 2.54 360 4.60

EU-SILC

2015 1 300 31.15 288 20.63

2 605 62.82 609 55.11

3 1 0.10 2 0.18

4 26 2.70 211 19.10

5 31 3.22 55 4.98

2016 1 276 29.21 197 18.67

2 608 64.34 609 57.73

3 1 0.11 3 0.28

4 31 3.28 192 18.20

5 29 3.07 54 5.12

It should be noted that the EU-SILC sample for Wielkopolskie province included only 3–4

respondents declaring legal separation. For this reason, results for this level should be

interpreted with caution.

There was a similar problem with respect to the variable defining the degree of kinship with the

household head. Even the question respondents were asked was different. In the LFS, it read:

"The degree of kinship with the head of the household" while in EU-SILC: "The degree of kinship

or relationship with the head of the household".

Page 24: Theoretical solutions - methods of integrating data from different … · 1. Preparation of theoretical solutions – methods of integrating data from different sources (deterministic

Improvement in the use of administrative data sources (ESS.VIP ADMIN WP6 Pilot surveys and their applications) Agreement no. 07112.2016.004-2016.595

Annex No.4

23

Answers in the LFS were more aggregated compared to those in the EU-SILC. For example

"grandfather, grandmother, granddaughter, grandson, great-grandson, great-granddaughter",

versus "grandfather, grandmother" / "granddaughter, grandson", or "uncle, aunt, further

relative" versus "other relative". Nonetheless, the level which was of most interest to us in the

context of the actual marital status, i.e. “partner”, was coded in the same way in both surveys.

Table 12. Coding of the degree of kinship with the head of the household in LFS and EU-SILC

Code LFS EUSILC

1 head of the household head

2 husband, wife husband, wife

3 partner partner

4 son, daughter son, daughter

5 son-in-law, daughter-in-law father, mother

6 father, mother, father-in-law, mother-in-law father-in-law, mother-in-law

7 grandfather, grandmother, granddaughter, grandson, great-grandson, great-granddaughter

grandfather, grandmother

8 brother, sister son-in-law, daughter-in-law

9 uncle, aunt, further relative brother, sister

10 not related family member (e.g. home help) granddaughter, grandson

11 – other relative

12 – other person

Table 13 presents information about the sample size broken down by survey, year and sex,

including information whether or not the respondent is in a cohabiting union with the head of

the household. The size of the sample in both surveys is very small and account for about 2% of

the all respondents surveyed in the entire province.

Table 13. The size of LFS and EU-SILC samples for Wielkopolskie province by sex and information on whether the respondent’s partner is the household head (combined unweighted samples)

Year Partner? Men Men (%) Women Women (%)

LFS

2015 NO 8446 99,33 9548 98.85

YES 57 0.67 111 1.15

2016 NO 6953 99,13 7666 98,04

Page 25: Theoretical solutions - methods of integrating data from different … · 1. Preparation of theoretical solutions – methods of integrating data from different sources (deterministic

Improvement in the use of administrative data sources (ESS.VIP ADMIN WP6 Pilot surveys and their applications) Agreement no. 07112.2016.004-2016.595

Annex No.4

24

YES 61 0.87 153 1.96

EU-SILC

2015 NO 946 98.23 1088 98,46

YES 17 1,77 17 1.54

2016 NO 927 98,10 1037 98,29

YES 18 1.90 18 1.71

2.2. Quality of administrative registers

2.2.1. Marital status register and information systems on divorce and separation rulings

When analysing the collected datasets, one cannot ignore the problem of data quality in the

context of operations that are to be carried out on them. The quality of input data contained in

all the datasets affects the quantity and quality of information obtained in the output. For

datasets where the PESEL number is the linkage key, the most important requirement is that the

number of missing values for this variable should be as small as possible. In addition, it is

necessary to check its format to make sure that it consists of eleven digits. Only records that

meet these two requirements can be used in further analysis. This verification was conducted

for datasets with information on marriages, divorces, separations, births and deaths, collected

in the period 2010–2016. The data were checked for missing PESEL values and cells not

containing eleven digits. Tables presented below provide the completeness of the PESEL

variable in the marital status register and information systems on divorce and separation rulings

(Table 14, Table 15, Table 16, Table 17, Table 18).

Table 14. Completeness of PESEL variable in the datasets on marriages in the period 2010–2016 (in %)

Item Year PESEL Number of

records %

1 2010 missing 8 705 1.91

2 2010 present 447 969 98.09

𝛴 456 674 100.00

3 2011 missing 8 032 1.95

4 2011 present 404 910 98.05

𝛴 412 942 100.00

5 2012 missing 3 980 0.98

Page 26: Theoretical solutions - methods of integrating data from different … · 1. Preparation of theoretical solutions – methods of integrating data from different sources (deterministic

Improvement in the use of administrative data sources (ESS.VIP ADMIN WP6 Pilot surveys and their applications) Agreement no. 07112.2016.004-2016.595

Annex No.4

25

6 2012 present 403 720 99.02

𝛴 407 700 100.00

7 2013 missing 3 981 1.10

8 2013 present 356 811 98.90

𝛴 360 792 100.00

9 2014 missing 4 015 0.14

10 2014 present 372 961 99.86

𝛴 376 976 100.00

11 2015 missing 3 102 0.82

12 2015 present 374 562 99.18

𝛴 377 664 100.00

13 2016 missing 3 184 0.82

14 2016 present 383 726 99.18

𝛴 386 910 100.00

Table 15. Completeness of PESEL variable in dataset on divorces in 2016 (in %)

Item Year PESEL Number of records %

1 2016 missing 86 869 68.40

2 2016 present 40 125 31.60

𝛴 126 994 100.00

Table 16. Completeness of PESEL variable in the dataset on separations in 2016 (in %)

Item Year PESEL Number of

records %

1 2016 missing 2 389 64.92

2 2016 present 1 291 35.08

𝛴 3 680 100.00

Table 17. Completeness of PESEL variable in datasets on births in the period 2010–2016 (in %)

Item Year PESEL Number of records %

1 2010 missing 27 474 3.32

2 2010 present 799 126 96.68

𝛴 826 600 100.00

3 2011 missing 23 829 3.07

4 2011 present 753 003 96.93

𝛴 776 832 100.00

5 2012 missing 18 689 2.42

6 2012 present 753 825 97.58

𝛴 772 514 100.00

Page 27: Theoretical solutions - methods of integrating data from different … · 1. Preparation of theoretical solutions – methods of integrating data from different sources (deterministic

Improvement in the use of administrative data sources (ESS.VIP ADMIN WP6 Pilot surveys and their applications) Agreement no. 07112.2016.004-2016.595

Annex No.4

26

7 2013 missing 17 246 2.33

8 2013 present 721 906 97.67

𝛴 739 152 100.00

9 2014 missing 16 781 2.24

10 2014 present 733 539 97.76

𝛴 750 320 100.00

11 2015 missing 17 781 2.41

12 2015 present 720 835 97.98

𝛴 738 616 100.00

13 2016 missing 15 408 2.02

14 2016 present 749 106 97.98

𝛴 764 514 100.00

Table 18. Completeness of widower’s/widow’s PESEL variable in dataset on deaths in 2016 (in %)

Item Year PESEL Number of records %

1 2016 missing 3 311 2.21

2 2016 present 146 427 97.79

𝛴 149 738 100.00

Analysis of the quality of the datasets, particularly as regards the variables essential for

linking, indicates that there are missing values in the PESEL variable, which serves as the

identifier enabling the linkage with the PESEL register. However, the number of missing

values in the analysed datasets varies. In the sets on marriages and births, the number of

missing values in the PESEL variable is small and does not exceed 4% of all observations for

in each set in a given year. The situation is much worse in the sets containing about

separations, divorces and deaths. In the case of the first two sets, out of the seven reference

years, only the sets for 2016 contain PESEL numbers, whereas in the case of the third set

(deaths) this variable is available for the years 2015–2016. This means that not all the sets

can be used in further works.

2.2.2. The National System for Monitoring Family Benefits and the Central Register of Data on Beneficiaries of the Alimony Fund

Analysis of data quality was also conducted for datasets obtained from the National System for

Monitoring of Family Benefits and the Central Register Data on Beneficiaries of the Alimony

Fund. The analysis focused on the key variables used for linking data with the PESEL dataset, i.e.

Page 28: Theoretical solutions - methods of integrating data from different … · 1. Preparation of theoretical solutions – methods of integrating data from different sources (deterministic

Improvement in the use of administrative data sources (ESS.VIP ADMIN WP6 Pilot surveys and their applications) Agreement no. 07112.2016.004-2016.595

Annex No.4

27

the TERYT code of the commune of residence, locality name, street name, building/dwelling

number. In the case of deterministic linkage, involving the use of a specific linkage key, values

used for the purpose of linking must contain exactly the same sequence of characters in order

for records to be successfully matched. Even slightest deviations will prevent linkage, which, in

turn, will cause a loss of some information in the output. For this reason, the content of

variables which are used as the linkage key between the registers has to be analysed to identify

potential errors and then, if possible, correct them using available methods and tools. Formats

of the TERYT code and locality names in the different registers are presented in Section 2.1.1 in

Table 7 and Table 8.

Another aspect analysed in detail was information on the marital status of persons included in

the National System for Monitoring Family Benefits and the Central Register of Data on

Beneficiaries of the Alimony Fund. The marital status of the applicant can be verified using the

auxiliary variable describing the degree of kinship with a person applying for the benefit

defined as spouse - parent/guardian of the child. After comparing records of the marital status

variable with information about the marital status based on the degree of kinship, we found

cases where the same person had different marital statuses. Such records were classified as

incorrect and excluded from the integration procedure.

3. Quality assessment of the integrated data sets

3.1. Quality of survey data integration

3.1.1. Quality of the linkage between the PESEL register and the surveys

Table 19 contains a summary of information concerning the number of records which were

linked with the PESEL register. As indicated in the first chapter, probabilistic linkage was used,

which was based on the assumption that the street name could be incorrect or the respondent

could have provided an incorrect birth date.

Page 29: Theoretical solutions - methods of integrating data from different … · 1. Preparation of theoretical solutions – methods of integrating data from different sources (deterministic

Improvement in the use of administrative data sources (ESS.VIP ADMIN WP6 Pilot surveys and their applications) Agreement no. 07112.2016.004-2016.595

Annex No.4

28

In the process of probabilistic linkage, each respondent is assigned a linkage weight1 from an

interval [0,1][0,1], which defines the likelihood that two records refer to the same person. The

threshold value adopted in the project was equal to 0.88, and was determined on the basis of

the analysis of linked records.

Table 19 contains two types of linkage, denoted A and B. The first type refers to respondents

who were assigned the weight of 1, which represents a perfect match (all variables match). The

second type of linkage refers to situations where the weight was less than 1, which means that

the likelihood of two records referring to the same person is less than 1.

In the case of EU-SILC, the percentage of linked records where all variables matched was very

high and amounted to 88% in 2015 and 70% in 2016, compared to 55.6% in 2015 and 58% in

2016 for the LFS. A possible cause of this difference is that households surveyed in 2015 did not

change their address, which was recorded for households in 2016.

Table 19. Results of probabilistic linkage of the PESEL register with LFS and EU-SILC

Survey Year Type of linkage

Records

linked Sample

LFS 2015 A 2 600 4 670

B 356 4 670

2016 A 4 762 8 135

B 573 8 135

EU-SILC 2015 A 511 704

B 19 704

2016 A 1 350 2 016

B 69 2 016

After including records with linkage weights of less than 1, we managed to link another 7% of

respondents from the LFS in both years; in the case of the EU-SILC, additional matches

accounted for only 2.5 % in 2015 and 3.4% in 2016. Generally, the number of uncertain records,

1 Note: this weight is not the same as the sampling weight or the final weight used to estimate characteristics of the target population.

Page 30: Theoretical solutions - methods of integrating data from different … · 1. Preparation of theoretical solutions – methods of integrating data from different sources (deterministic

Improvement in the use of administrative data sources (ESS.VIP ADMIN WP6 Pilot surveys and their applications) Agreement no. 07112.2016.004-2016.595

Annex No.4

29

i.e. those with the likelihood ranging from [0,1)[0,1) was 1,017, compared to 9,223 with the

weight equal to 1.

3.1.2. Detailed analysis of linkage quality

Table 20 presents descriptive statistics for the linkage weight, reflecting the integration error,

without a distinction between the surveys. In keeping with the specified threshold, the lowest

weight value was 0.88, while the first decile was equal to 0.97, which means that 10% of

observations were assigned weights lower than 0.97. This indicates the presence of a linkage

error, which should be taken into account during the estimation. The smaller the linkage weight

is, the greater the uncertainty of results obtained on the basis of linked records.

Table 20. Descriptive statistics of linkage weights of respondents

Min Decile 1 Average Median Decile 9 Max Standard deviation

0.88 0.97 0.99 1.00 1.00 1.00 0.03

In the following steps, we verified hypotheses about the relationship between the uncertainty

of record linkage and other variables used for linkage. The purpose was to determine the causes

of uncertainty so that they could be controlled.

Table 21 contains information about the percentage of linkage weights that were less than 1 for

Wielkopolskie province by sex. The weight of less than 1 implies that for some records may have

been linked incorrectly, as a result of the wrong information about the respondent's date of

birth. The percentage of uncertain records was slightly higher for women (10.51%) than for men

(9.18%).

We used the 𝜒2 test with continuity correction for 2x2 tables 2 × 2in order to verify the null

hypothesis (H0)𝐻0 about the lack of correlation between the linkage weight of less than 1 and

sex. The p-value of the test statistic is smaller than the significance level (0.05), which suggests

that there is a relationship between these variables (𝜒2 statistic = 4.8526, df = 1, p-value =

0.02761). However, this result should be treated with caution, since there may be other

variables which correlate with one particular sex (e.g. age) and contribute to the rejection of the

Page 31: Theoretical solutions - methods of integrating data from different … · 1. Preparation of theoretical solutions – methods of integrating data from different sources (deterministic

Improvement in the use of administrative data sources (ESS.VIP ADMIN WP6 Pilot surveys and their applications) Agreement no. 07112.2016.004-2016.595

Annex No.4

30

null hypothesis. Nonetheless, this difference is worth noting, since it may be important from the

perspective of further analysis.

Table 21. Distribution of linkage weights by sex in Wielkopolskie province

Sex Linkage weight less than 1? %

Men No 90.82

Yes 9.18

Women No 89.49

Yes 10.51

In the next step we analysed the variation in weights for age groups: 15–25, 25–35, 35–45, 45–

55, 55–65 and 65+. Results are presented in Table 22. The largest percentage of weights lower

than 1 is observed for the group [25,35)[25,35) and for the oldest groups of respondents:

[55,65)[55,65) and 65+. In the case of the second group, this can be explained by older

respondents’ reluctance to provide exact information or by their memory problems. Once again,

the 𝜒2 test was performed to check if the variables are independent. The test statistic and the

p-value indicate that the null hypothesis𝐻0 should be rejected, which implies that differences

between groups are significant (𝜒2=26.858, df = 5, p-value <0.0001). This means that, most

probably, the uncertainty of record linkage results from the age of respondents.

Table 22. Distribution of linkage weights by age in Wielkopolskie province

Age group Linkage weight less than 1? %

[15,25) No 92.01

[15,25) Yes 7.99

[25,35) No 89.99

[25,35) Yes 10.01

[35,45) No 90.32

[35,45) Yes 9.68

[45,55) No 92.40

[45,55) Yes 7.60

[55,65) No 88.02

[55,65) Yes 11.98

65+ No 89.40

65+ Yes 10.60

Page 32: Theoretical solutions - methods of integrating data from different … · 1. Preparation of theoretical solutions – methods of integrating data from different sources (deterministic

Improvement in the use of administrative data sources (ESS.VIP ADMIN WP6 Pilot surveys and their applications) Agreement no. 07112.2016.004-2016.595

Annex No.4

31

Table 23. Distribution of linkage weights by marital status in Wielkopolskie province

Legal marital status Linkage weight less than 1? %

Single No 90.83

Yes 9.17

Married No 90.00

Yes 10.00

Divorced / in separation No 89.72

Yes 10.28

Widowed No 88.24

Yes 11.76

The share of weights lower than 1 differs slightly depending on the level of the legal marital

status variable. These fractions are presented in Table 23. As can be seen, the largest

percentage of such weights is found in the group of respondents classified as “divorced / in

separation” and “widowed” (this pattern is also correlated with age). It means that in the case

of these groups, the uncertainty arising from (survey) sampling increases and should be taken

into account when determining the percentage of persons with a given marital status (on the

basis of integrated data sources). After testing this contingency table with the 𝜒2 test, no

correlation between the percentage of weights smaller than 1 and the marital status was found

(𝜒2 = 3.1034, df = 3, p-value = 0.376). This may suggest that the linkage error is not informative

(is non-random), i.e. it does not depend on the variable of interest but on other variables that

can be controlled (is random), for example age.

During the last step of data analysis, we evaluated the spatial distribution of linkage weights

smaller than 1 across districts (LAU 1 level) of Wielkopolskie province. Results of this analysis are

presented in Figure 2. Detailed information is shown in Table 24, which contains percentages,

together with the number of records with linkage weights smaller than 1. The higher the

percentage is, the greater the uncertainty associated with the use of those linked records for

the purposes of estimation. The highest percentage was recorded for the following districts: the

city of Poznań (3,064), poznański (3,021), gnieźnieski (3,003), turecki (3,027) and wągrowiecki

(3,038).

Page 33: Theoretical solutions - methods of integrating data from different … · 1. Preparation of theoretical solutions – methods of integrating data from different sources (deterministic

Improvement in the use of administrative data sources (ESS.VIP ADMIN WP6 Pilot surveys and their applications) Agreement no. 07112.2016.004-2016.595

Annex No.4

32

A chi-squared test was conducted to check whether there was a statistical correlation between

the district variable and the linkage weight lower than 1. It was found that such a correlation𝜒2

does exist (𝜒2=914.64, p-value = 0.0005; 2,000 Monte Carlo iterations to estimate the p-value).

Figure 2.Distribution of linkage weights less than 1 for districts of Wielkopolskie province

Table 24 Percentage and number of records with linkage weights of less than 1 for districts of Wielkopolskie province

District code Number % District code Number %

3064 332 25.13 3002 9 3.40

3021 110 21.91 3005 9 6.16

Page 34: Theoretical solutions - methods of integrating data from different … · 1. Preparation of theoretical solutions – methods of integrating data from different sources (deterministic

Improvement in the use of administrative data sources (ESS.VIP ADMIN WP6 Pilot surveys and their applications) Agreement no. 07112.2016.004-2016.595

Annex No.4

33

3003 77 24.44 3014 6 5.94

3061 56 15.64 3006 5 1.50

3027 45 25.71 3020 5 1.90

3062 33 10.78 3030 5 1.36

3011 32 13.56 3008 4 1.57

3004 30 8.50 3029 4 1.45

3019 28 6.02 3031 4 2.33

3063 27 12.44 3015 3 1.29

3028 24 21.05 3007 2 0.86

3009 20 6.54 3012 2 0.81

3016 19 14.07 3013 2 1.20

3017 19 3.53 3001 1 0.84

3022 19 8.80 3018 1 0.62

3024 19 6.86

3023 18 12.50

3025 14 9.33

3026 14 6.57

3010 10 1.98

Page 35: Theoretical solutions - methods of integrating data from different … · 1. Preparation of theoretical solutions – methods of integrating data from different sources (deterministic

Improvement in the use of administrative data sources (ESS.VIP ADMIN WP6 Pilot surveys and their applications) Agreement no. 07112.2016.004-2016.595

Annex No.4

34

4. Estimation of the actual marital status using the integrated sets

4.1. Determination of the actual marital status

4.1.1. Description of the algorithm

The actual marital status was determined on the basis of 9 sources2, taking into account the

date of the last update or the interview date (in the case of surveys). Table 25 presents

information about the number of sources used and the number of source combinations, which

could be used to obtain information about the marital status. For example, if the number of

sources is 1, it means that the actual marital status could be determined on the basis of only

one source. If the number of sources was 2, the marital status was determined on the basis of

two sources, and there were 18 pairs of sources (e.g. PESEL + marriages, PESEL + deaths, etc.).

Table 25. Number of sources used to determine the actual marital status

Number of sources

Combinations of sources

1 9

2 18

3 29

4 29

5 15

6 2

Table 32 contains detailed information about the sources used. Selected data are presented in

Table 26. The column names represent: No = combination number, Mrg = marriages, Dvc =

divorces, Sep = separations, Bth = births, FA = Alimony Fund, SB = social benefits, Srv = LFS and

EU-SILC, N = domain size. 11 denotes that the actual marital status was observed in a given

source (regardless of its level), and empty fields denote the lack of marital status in a given

source (it was undetermined). For example, with respect to combination No. 1, for 2,473,131

persons the marital status was established only on the basis of the PESEL register (71.8% of all

2 The list of sources included: PESEL, marriages, divorces, separations, births, the Alimony Fund, social benefits, combined LFS

and EU-SILC surveys. These are the datasets indicated in the diagram of data linkage presented in Figure [linkage-scheme].

Page 36: Theoretical solutions - methods of integrating data from different … · 1. Preparation of theoretical solutions – methods of integrating data from different sources (deterministic

Improvement in the use of administrative data sources (ESS.VIP ADMIN WP6 Pilot surveys and their applications) Agreement no. 07112.2016.004-2016.595

Annex No.4

35

observations). In the case of combination No. 6, all columns contain empty fields. This means

that it was not possible to determine the legal marital status for 77,464 persons.

Table 26. Information about the coexistence of marital status information in the data sources

No PESEL Mrg Dvc Sep Bth Dth FA SB Srv N

1 1 2 817 123

2 1 1 346 453

3 1 1 139 716

4 1 1 99 728

5 1 1 1 95 986

6 77 464

7 1 1 1 63 408

8 1 1 1 1 47 828

9 1 1 1 28 320

10 1 1 16 703

...

Persons aged 15 to 21 years were classified as "single". This choice is the consequence of Polish

law, Article 10 § 1. of the Family and Guardianship Code: “No person younger under the age of

eighteen may enter into marriage. However, for important reasons a family court may permit a

sixteen-year-old woman to marry, when it is apparent that it will be consistent with the good of

the formed family. Additionally, after internal consultations between project members, a

decision was made to extend this limit up to the age of 21.

Table 27. Updated legal marital status obtained from the PESEL register (rows), and obtained after updating PESEL with information from other sources (columns)

PES

EL/I

nt.

sin

gle

man

sin

gle

wo

man

mar

ried

man

mar

ried

wo

man

div

orc

ed m

an

div

orc

ed

wo

man

wid

ow

er

wid

ow

un

det

erm

ined

single (man) 414 684 - 1 774 - 74 - 31 - - 416 563

single (woman)

4 330 831

- 1 192 - 45 - 71 332 143

married 1 627 - 777 641

- 293 - 149 - - 779 710

married woman

- 6 198 - 804 860

- 790 - 583 - 812 431

Page 37: Theoretical solutions - methods of integrating data from different … · 1. Preparation of theoretical solutions – methods of integrating data from different sources (deterministic

Improvement in the use of administrative data sources (ESS.VIP ADMIN WP6 Pilot surveys and their applications) Agreement no. 07112.2016.004-2016.595

Annex No.4

36

divorced man

1 065 - 1 368 2 63 083 - 45 - 65 563

divorced woman

- 1 117 - 1 998 - 93 037 - 373 - 96 525

widower 98 - 425 - 7 - 37 930 - - 38 460

widow - 333 - 2 232 - 36 - 192 549 + 3

195 153

undetermined

10 969 10 292 551 637 12 61 23 146 57 884

80 575

∑ 428 270 348 717

781 598

810 833

63 441 93 949 38 172 193 722

57 884

2 817 123

Table 28. Updated actual and legal marital status

Actual /Legal single (man)

single (woman

)

married

man married woman

divorced man

divorced woman widower widow

undetermin

ed

single (man) 99.37 - - - - - - - -

single (woman)

- 98.86 - - - - - - -

married man - - 99.87 - - - - - -

married woman

- - - 99.80 - - - - -

divorced man - - - - 100 - - - -

divorced woman

- - - - - 100 - - -

widower - - - - - - 100 - -

widow - - - - - - - 100 -

Partner 0.63 - - - - - - - -

female partner

- 1.14 - - - - - - -

in separation - - 0.13 0.20 - - - - -

undetermined - - - - - - - - 100

Table 27 contains information on the updated legal marital status for the entire population of

persons in the PESEL register after it was integrated with the other sources of information. The

columns refer to the marital status according to PESEL, and rows to the marital status

established on the basis of the other sources. By integrating the register and survey data, it was

possible to determine (impute) the marital status for 22,691 persons who had an undetermined

status in the PESEL register. Table 28 presents information on the actual and legal marital

Page 38: Theoretical solutions - methods of integrating data from different … · 1. Preparation of theoretical solutions – methods of integrating data from different sources (deterministic

Improvement in the use of administrative data sources (ESS.VIP ADMIN WP6 Pilot surveys and their applications) Agreement no. 07112.2016.004-2016.595

Annex No.4

37

status. The integrated sources provided information about any changes in the legal marital

status from "single man/woman" to "male/female partner", and from "married man/woman" to

"in separation".

4.1.2. Analysis of the undetermined marital status

The next step was to analyse the category of undetermined marital status, which was present

for 2.07% of persons aged 15 + from wielkopolskie province. In the first place, we carried out a

spatial analysis at the level of districts to determine whether there was a spatial correlation with

missing data. Results are presented in Figure 3, while details concerning percentages are

presented in Table 31.

The largest percentage of people with undetermined marital status in the population aged 15+

was observed in pilecki district (7.78%), szamotulski district (7.58%), and in the city of Poznań

(4.92%). High values for pilecki and szamotulski districts result from very high percentages of

undetermined marital status in the following municipalities: Szamotuły - 21.66% and Piła -

11.69%. In other districts this percentage was below 4%. In contrast, the lowest percentages

(below 0.05%) were recorded in the following districts: the municipality of Konin, krotoszyński

district, średzki district, jarociński district and pleszewski district. Figure 3 suggests that persons

with undetermined marital status are clustered in the northern part of wielkopolskie province.

Page 39: Theoretical solutions - methods of integrating data from different … · 1. Preparation of theoretical solutions – methods of integrating data from different sources (deterministic

Improvement in the use of administrative data sources (ESS.VIP ADMIN WP6 Pilot surveys and their applications) Agreement no. 07112.2016.004-2016.595

Annex No.4

38

Figure 3. Fraction of undetermined actual marital status in districts of wielkopolskie province (population aged 15+)

To facilitate a more in-depth analysis of the problem of determining the actual marital status,

Figure 4 presents single-year of age groups from 15 to 90 and above. Red denotes

undetermined actual marital status, whereas blue is represents categories of determined

marital status. For young persons (aged 21-25), there is a relatively high percentage of

undetermined actual marital status in comparison with other groups. This percentage is lower

for people aged 25–45 and then rises to the level of 4.4% for people aged 90+.

Page 40: Theoretical solutions - methods of integrating data from different … · 1. Preparation of theoretical solutions – methods of integrating data from different sources (deterministic

Improvement in the use of administrative data sources (ESS.VIP ADMIN WP6 Pilot surveys and their applications) Agreement no. 07112.2016.004-2016.595

Annex No.4

39

The above information indicates that to take into account information about the undetermined

actual marital status, both a demographic and spatial factor should be considered in estimation.

Figure 4. Fraction of undetermined actual marital status by age in wielkopolskie province

This indicates that both the demographic and spatial factor of undetermined actual marital

status should be taken into account in the estimation.

4.2. Estimation of the actual marital status

4.2.1. Post-stratification of the actual marital status

For the purpose of estimating the actual marital status, the "undetermined" category was

treated as missing data. As a result, it was necessary to create artificial weights to enable

generalisation of the results. In the first place, the weight for each respondent was determined

according to the following formula (1):

𝑑𝑖 = {1 if actual marital status was other than undetermined,0 if actual marital status was undetermined,

where 𝑖 = 1, . . . , 𝑁 denotes the person’s number in the PESEL register, and 𝑁 is the size of

population of wielkopolskie province in the PESEL register. Naturally, the sum of weights was

Page 41: Theoretical solutions - methods of integrating data from different … · 1. Preparation of theoretical solutions – methods of integrating data from different sources (deterministic

Improvement in the use of administrative data sources (ESS.VIP ADMIN WP6 Pilot surveys and their applications) Agreement no. 07112.2016.004-2016.595

Annex No.4

40

not equal to the size of the population (∑ 𝑑𝑖𝑖 ≠ 𝑁), therefore we had to adjust them to match

the size of the population. Additionally, weights did not add up to the number of women and

men, persons at a given age or within districts. To solve this problem, post-stratification was

applied to ensure that domains obtained by cross-classifying the following variables were

consistent with the size of the general population:

district (35 categories),

sex (2 categories),

age (72 categories; 15, 16, ..., 90+).

In total, 5,320 strata were created by cross-classifying the three variables (36 × 2 × 72). Then a

correction factor for each domain was determined according to the formula (2):

𝑤𝑑𝑠𝑎 =𝑁𝑑𝑠𝑎

𝑛𝑑𝑠𝑎,

where 𝑛𝑑𝑠𝑎 = ∑ 𝑑𝑖,𝑑𝑠𝑎𝑖 is the size of a given domain, i.e. the sum of weights assigned to the

respondent in a given section/stratum, 𝑁𝑑𝑠𝑎 is the population size in a given domain, and the

subscripts refer, respectively, to: d – district, s – sex, and a – age. Such correction factors were

then multiplied by output counts for the respective domains. In this way, we obtained the size

of the entire population, i.e. persons with known marital status and those for whom it was

undetermined.

Table 29 presents descriptive statistics of weights obtained by applying the formula (2). The

distribution of weights is strongly right-skewed, with the median equal to 1.01, and the mean of

1.03. Figure 5 presents a histogram of the distribution of weights, which shows that weight

values for some domains are relatively high.

Table 29. Descriptive statistics of weights used to generalise the actual marital status to Wielkopolskie Province

Min Quartile 1 Median Mean Quartile 3 Max

1.00 1.00 1.01 1.02 1.01 1.35

Page 42: Theoretical solutions - methods of integrating data from different … · 1. Preparation of theoretical solutions – methods of integrating data from different sources (deterministic

Improvement in the use of administrative data sources (ESS.VIP ADMIN WP6 Pilot surveys and their applications) Agreement no. 07112.2016.004-2016.595

Annex No.4

41

Figure 5. Distribution of weights used to generalise the actual marital status in Wielkopolskie Province

Weights obtained after post-stratification were used in the next stange to estimate the

percentage and the number of persons in terms of the actual marital status at the level of

districts in Wielkopolskie Province.

4.2.2. Analysis of actual marital status estimates

Table 30 contains information about the actual marital status estimated for the population aged

15+ in Wielkopolskie Province on the basis of selected sources before post-stratification

(unweighted percentage column) and after post-stratification (weighted percentage column)

and in comparison with the Census 2011.

When one compares estimates for the categories "male/female partner" and "in separation"

with data from the Census 2011, we can notice a big difference, which is mostly due to the

imperfect quality of data sources used to determine the actual marital status of particular

Page 43: Theoretical solutions - methods of integrating data from different … · 1. Preparation of theoretical solutions – methods of integrating data from different sources (deterministic

Improvement in the use of administrative data sources (ESS.VIP ADMIN WP6 Pilot surveys and their applications) Agreement no. 07112.2016.004-2016.595

Annex No.4

42

persons (e.g. only the relationship with the head of the household). For this reason, the three

underestimated categories should be analysed with caution.

Only weighted percentages are taken into account in further analysis, since they are based on

weights used to generalise the results. In Wielkopolskie Province nearly 30% women and about

29% men were married. The largest difference, due to women’s life expectancy, can be seen in

for persons classified as "widower/widow". Other differences occur in the group of single

persons, where the share of single men (ca. 15.5%) was higher than that of single women (ca.

12.3%), and in the group of divorced persons. Other categories of the legal marital status were

represented by very small populations (below 1% in Wielkopolskie Province).

Table 30. Estimates of actual marital status for the population aged 15+ in Wielkopolskie Province and according to Census 2011

Actual marital status

Unweighted Weighted NSP 2011

Single man 15.43 15.35 15.72

Single woman 12.50 12.40 12.56

Male partner 0.10 0.10 0.93

Female partner 0.14 0.14 0.90

Married 28.29 28.42 28.35

Married woman 29.33 29.27 28.31

Divorced 2.30 2.32 1.42

Divorced woman 3.41 3.41 2.24

Widower 1.38 1.42 1.41

Widow 7.02 7.08 7.55

In separation 0.10 0.10 0.42

Undetermined 2.07 – 0.18

Total 2 891 600

Figures 6-8 show the spatial distribution of percentages for particular categories of marital

status based on information from the PESEL register and additional sources.

Page 44: Theoretical solutions - methods of integrating data from different … · 1. Preparation of theoretical solutions – methods of integrating data from different sources (deterministic

Improvement in the use of administrative data sources (ESS.VIP ADMIN WP6 Pilot surveys and their applications) Agreement no. 07112.2016.004-2016.595

Annex No.4

43

Figure 6. Fraction of actual marital status: single man, single woman, married man and married woman for population aged 15+ across districts of Wielkopolskie Province

Page 45: Theoretical solutions - methods of integrating data from different … · 1. Preparation of theoretical solutions – methods of integrating data from different sources (deterministic

Improvement in the use of administrative data sources (ESS.VIP ADMIN WP6 Pilot surveys and their applications) Agreement no. 07112.2016.004-2016.595

Annex No.4

44

Figure 7. Fraction of actual marital status: divorced man, divorced woman, widower and widow for population aged 15+ across districts of Wielkopolskie Province

Page 46: Theoretical solutions - methods of integrating data from different … · 1. Preparation of theoretical solutions – methods of integrating data from different sources (deterministic

Improvement in the use of administrative data sources (ESS.VIP ADMIN WP6 Pilot surveys and their applications) Agreement no. 07112.2016.004-2016.595

Annex No.4

45

Figure 8. Fraction of actual marital status: male partner, female partner and persons in separation for the population aged 15 + across districts of Wielkopolskie Province

Page 47: Theoretical solutions - methods of integrating data from different … · 1. Preparation of theoretical solutions – methods of integrating data from different sources (deterministic

Improvement in the use of administrative data sources (ESS.VIP ADMIN WP6 Pilot surveys and their applications) Agreement no. 07112.2016.004-2016.595

Annex No.4

46

Summary

This report is a summary of the work done by two project teams from the Statistical Office in

Poznań: the Centre for Urban Statistics and the Centre for Small Area Estimation. The teams

completed the following tasks:

identified sources of data (registers and surveys), which can be used for estimating the

legal and actual marital status,

standardised data to enable the integration of sources,

integrated selected data from administrative registers,

integration selected data from two representative surveys (LFS and EU-SILC),

made a preliminary assessment of the integration of representative surveys and registers,

imputed values of legal and actual marital status based on registers and representative

surveys,

conducted a preliminary analysis of undetermined marital status, which was treated as

missing data that had to be imputed or calibrated,

conducted post-stratification (as a special case of calibration) to evaluate the actual marital

status at the level of districts in Wielkopolskie Province.

The most important conclusions concerning data integration for the purpose of estimating the

actual marital status include:

sets of answer options in questions concerning the legal marital status are inconsistent; in

additional, answer options in statistical surveys include the category of actual marital

status,

there are no questions concerning civil unions with persons from outside the household. In

the future, information from other surveys may be considered, e.g. Social Cohesion (Are

you currently in a close relationship (marriage or informal relationship), even without living

together?),

Page 48: Theoretical solutions - methods of integrating data from different … · 1. Preparation of theoretical solutions – methods of integrating data from different sources (deterministic

Improvement in the use of administrative data sources (ESS.VIP ADMIN WP6 Pilot surveys and their applications) Agreement no. 07112.2016.004-2016.595

Annex No.4

47

representative surveys are currently not suited for regular integration with administrative

registers for both legal and practical reasons (e.g. the problem of standardisation of

addresses of dwellings),

there is no single statistical identifier which can be used to link different sources of

information,

it is difficult to determine the final reference date for data from numerous sources.

The project can be treated as a proposal for developing the concept of a rolling census based on

different sources, updated using data from representative surveys.

Page 49: Theoretical solutions - methods of integrating data from different … · 1. Preparation of theoretical solutions – methods of integrating data from different sources (deterministic

Improvement in the use of administrative data sources (ESS.VIP ADMIN WP6 Pilot surveys and their applications) Agreement no. 07112.2016.004-2016.595

Annex No.4

48

List of figures

Figure 1. Schematic diagram of linking administrative registers and representative surveys, where PESEL

is the main register . Continuous lines denote deterministic linkage (using a linkage key), and dashed

lines denote linkage using an artificial key or probabilistic linkage ............................................................ 10

Figure 2.Distribution of linkage weights less than 1 for districts of Wielkopolskie province ..................... 32

Figure 3. Fraction of undetermined actual marital status in districts of wielkopolskie province (population

aged 15+) ..................................................................................................................................................... 38

Figure 4. Fraction of undetermined actual marital status by age in wielkopolskie province ..................... 39

Figure 5. Distribution of weights used to generalise the actual marital status in Wielkopolskie Province 41

Figure 6. Fraction of actual marital status: single man, single woman, married man and married woman

for population aged 15+ across districts of Wielkopolskie Province .......................................................... 43

Figure 7. Fraction of actual marital status: divorced man, divorced woman, widower and widow for

population aged 15+ across districts of Wielkopolskie Province ................................................................ 44

Figure 8. Fraction of actual marital status: male partner, female partner and persons in separation for the

population aged 15 + across districts of Wielkopolskie Province ............................................................... 45

Page 50: Theoretical solutions - methods of integrating data from different … · 1. Preparation of theoretical solutions – methods of integrating data from different sources (deterministic

Improvement in the use of administrative data sources (ESS.VIP ADMIN WP6 Pilot surveys and their applications) Agreement no. 07112.2016.004-2016.595

Annex No.4

49

List of Tables

Table 1. List of identified data sets containing information on the marital status. ...................................... 5

Table 2. List of surveys used for the purposes of the VIP ADMIN project .................................................... 8

Table 3. The KSMŚR database - created using data from the National System for Monitoring Family Benefits for the years 2015–2016 ............................................................................................................... 13

Table 4. The FA database - created using data from the Central Register of Data on Beneficiaries of the Alimony Fund for the years 2015–2016. ..................................................................................................... 14

Table 5. Realised sample size in Wielkopolskie Province in LFS and EU-SILC (unique records) .................. 15

Table 6. The TERYT commune code and the name of the locality in the PESEL register ............................ 19

Table 7. Standardised locality names and TERYT commune codes in the Central Register of Data on Beneficiaries of the Alimony Fund (the FA database) ................................................................................. 19

Table 8. Standardised locality names and TERYT commune codes in the National System for Monitoring Family Benefits (the KSMŚR database)........................................................................................................ 20

Table 9. List of localities observed only in the surveys (different TERYT code) .......................................... 21

Table 10. Coding of legal marital status in LFS and EU-SILC ........................................................................ 21

Table 11. Size of LFS and EU-SILC sample for Wielkopolskie province by sex and marital status (combined unweighted samples, before standardisation) ........................................................................................... 22

Table 12. Coding of the degree of kinship with the head of the household in LFS and EU-SILC ................ 23

Table 13. The size of LFS and EU-SILC samples for Wielkopolskie province by sex and information on whether the respondent’s partner is the household head (combined unweighted samples) ................... 23

Table 14. Completeness of PESEL variable in the datasets on marriages in the period 2010–2016 (in %) 24

Table 15. Completeness of PESEL variable in dataset on divorces in 2016 (in %) ...................................... 25

Table 16. Completeness of PESEL variable in the dataset on separations in 2016 (in %) ........................... 25

Table 17. Completeness of PESEL variable in datasets on births in the period 2010–2016 (in %) ............. 25

Table 18. Completeness of widower’s/widow’s PESEL variable in dataset on deaths in 2016 (in %) ........ 26

Table 19. Results of probabilistic linkage of the PESEL register with LFS and EU-SILC ............................... 28

Table 20. Descriptive statistics of linkage weights of respondents............................................................. 29

Page 51: Theoretical solutions - methods of integrating data from different … · 1. Preparation of theoretical solutions – methods of integrating data from different sources (deterministic

Improvement in the use of administrative data sources (ESS.VIP ADMIN WP6 Pilot surveys and their applications) Agreement no. 07112.2016.004-2016.595

Annex No.4

50

Table 21. Distribution of linkage weights by sex in Wielkopolskie province .............................................. 30

Table 22. Distribution of linkage weights by age in Wielkopolskie province .............................................. 30

Table 23. Distribution of linkage weights by marital status in Wielkopolskie province ............................. 31

Table 24 Percentage and number of records with linkage weights of less than 1 for districts of Wielkopolskie province ............................................................................................................................... 32

Table 25. Number of sources used to determine the actual marital status ............................................... 34

Table 26. Information about the coexistence of marital status information in the data sources .............. 35

Table 27. Updated legal marital status obtained from the PESEL register (rows), and obtained after updating PESEL with information from other sources (columns) ............................................................... 35

Table 28. Updated actual and legal marital status ...................................................................................... 36

Table 29. Descriptive statistics of weights used to generalise the actual marital status to Wielkopolskie Province ....................................................................................................................................................... 40

Table 30. Estimates of actual marital status for the population aged 15+ in Wielkopolskie Province and according to Census 2011 ........................................................................................................................... 42

Table 31. The number and the percentage of the undetermined actual marital status in districts of Wielkopolskie Province (population aged 15 +) .......................................................................................... 51

Table 32. Information about the coexistence of information about the legal and actual marital status in used data sources ........................................................................................................................................ 52

Page 52: Theoretical solutions - methods of integrating data from different … · 1. Preparation of theoretical solutions – methods of integrating data from different sources (deterministic

Improvement in the use of administrative data sources (ESS.VIP ADMIN WP6 Pilot surveys and their applications) Agreement no. 07112.2016.004-2016.595

Annex No.4

51

A Detailed tables - Information on the actual marital status

Comments to Table 31 Explanations: NO.=number of combination, Mał.=marriages,

Sep=divorces, Sep=separations, Urodz=births, FA=the Alimony Fund, Świad=social benefits,

Bad=LFS+ EU-SILC surveys, N=the size of a given section. Value 1 means that marital status

existed in a given source (regardless of the level). For example combination no. 1 with the size

of 2,473,131 means that for such a number of persons information on the marital status came

only from the PESEL register (71.8% of all observations).

On the other hand, combination no. 6 means that 77,464 persons (2.25%) have the

undetermined marital status (regardless of age).

Table 31. The number and the percentage of the undetermined actual marital status in districts of Wielkopolskie Province (population aged 15 +)

TERYT District Percentage Number

3019 Piła District 10.20 11538

3024 Szamotuły District 8.48 6195

3064 The City of Poznań 5.14 21768

3002 Czarnków-Trzcianka District 4.73 3440

3003 Gniezno District 4.35 5099

3004 Gostyń District 3.95 2484

3031 Złotów District 3.27 1882

3011 Kościan District 3.23 2110

3063 The City of Leszno 3.15 1659

3001 Chodzież District 2.77 1076

3016 Oborniki District 2.37 1117

3027 Turek District 2.14 1500

3022 Rawicz District 1.99 978

3013 Leszno District 1.83 815

3007 Kalisz District 1.60 1100

3010 Konin District 1.58 1681

3023 Słupca District 1.57 781

3018 Ostrzeszów District 1.47 681

3005 Grodzisk Wielkopolski District 1.43 589

3008 Kępno District 1.11 514

Page 53: Theoretical solutions - methods of integrating data from different … · 1. Preparation of theoretical solutions – methods of integrating data from different sources (deterministic

Improvement in the use of administrative data sources (ESS.VIP ADMIN WP6 Pilot surveys and their applications) Agreement no. 07112.2016.004-2016.595

Annex No.4

52

3009 Koło District 1.10 814

3026 Śrem District 0.98 481

3014 Międzychód District 0.92 283

3017 Ostrów Wielkopolski District 0.91 1221

3028 Wągrowiec District 0.88 496

3021 Poznań District 0.80 2277

3061 The City of Kalisz 0.67 569

3015 Nowy Tomyśl District 0.58 351

3029 Wolsztyn District 0.51 236

3030 Września District 0.48 301

3062 the City of Konin 0.47 303

3012 Krotoszyn District 0.38 244

3025 Środa Wielkopolska District 0.24 112

3006 Jarocin District 0.20 116

3020 Pleszew District 0.15 76

Table 32. Information about the coexistence of information about the legal and actual marital status in used data sources

NO PESEL Mrg Dvc Sep Bth Dth FA SB Srv N

1 1 2473131

2 1 1 346453

3 1 1 139716

4 1 1 99728

5 1 1 1 95986

6 77464

7 1 1 1 63408

8 1 1 1 1 47828

9 1 1 1 28320

10 1 1 16703

11 1 1 14231

12 1 7902

13 1 1 7655

14 1 1 1 1 4593

15 1 1 1 4422

16 1 1 1 1 1 1913

17 1 1 1 1340

18 1 1 1 1197

19 1 1 1119

20 1 1088

Page 54: Theoretical solutions - methods of integrating data from different … · 1. Preparation of theoretical solutions – methods of integrating data from different sources (deterministic

Improvement in the use of administrative data sources (ESS.VIP ADMIN WP6 Pilot surveys and their applications) Agreement no. 07112.2016.004-2016.595

Annex No.4

53

21 1 1 1 1 960

22 1 1 1 911

23 1 1 1 890

24 1 1 674

25 1 1 1 1 565

26 1 508

27 1 1 425

28 1 1 1 387

29 1 1 1 264

30 1 262

31 1 1 1 1 224

32 1 1 1 1 221

33 1 1 1 220

34 1 1 1 208

35 1 1 1 1 185

36 1 1 1 185

37 1 1 1 1 1 177

38 1 1 1 163

39 1 96

40 1 1 1 95

41 1 1 1 73

42 1 1 1 1 70

43 1 1 1 1 62

44 1 1 1 1 53

45 1 50

46 1 1 1 1 1 49

47 1 1 41

48 1 1 39

49 1 1 1 1 39

50 1 1 1 38

51 1 1 1 1 34

52 1 1 1 33

53 1 1 28

54 1 1 1 1 27

55 1 1 25

56 1 1 1 1 1 20

57 1 1 1 1 1 16

58 1 1 1 15

59 1 1 1 1 14

Page 55: Theoretical solutions - methods of integrating data from different … · 1. Preparation of theoretical solutions – methods of integrating data from different sources (deterministic

Improvement in the use of administrative data sources (ESS.VIP ADMIN WP6 Pilot surveys and their applications) Agreement no. 07112.2016.004-2016.595

Annex No.4

54

60 1 1 1 14

61 1 1 1 14

62 1 1 1 1 1 14

63 1 1 1 1 1 1 13

64 1 1 1 13

65 1 1 1 1 10

66 1 1 1 1 1 1 9

67 1 1 1 1 8

68 1 1 8

69 1 1 1 1 7

70 1 1 1 1 1 5

71 1 1 1 1 5

72 1 1 1 5

73 1 1 1 1 1 4

74 1 1 1 4

75 1 1 1 1 1 3

76 1 1 1 1 1 3

77 1 1 1 3

78 1 1 1 1 3

79 1 1 1 1 3

80 1 1 1 1 2

81 1 1 1 1 1 2

82 1 1 1 1 1 2

83 1 1 1 1 2

84 1 1 1 1 1 2

85 1 1 2

86 1 2

87 1 1 1 2

88 1 1 1 2

89 1 1 1 1 1 1

90 1 1 1 1 1

91 1 1 1 1 1 1

92 1 1 1 1 1

93 1 1 1 1 1

94 1 1 1 1 1

95 1 1 1 1

96 1 1 1 1 1

97 1 1 1 1 1

98 1 1 1 1 1

Page 56: Theoretical solutions - methods of integrating data from different … · 1. Preparation of theoretical solutions – methods of integrating data from different sources (deterministic

Improvement in the use of administrative data sources (ESS.VIP ADMIN WP6 Pilot surveys and their applications) Agreement no. 07112.2016.004-2016.595

Annex No.4

55

99 1 1 1 1

100 1 1 1

101 1 1 1

102 1 1 1