methods of detecting and treating outliers used in...

15
1 ¡Silencio, por favor! El desarrollo de la oralidad en secundaria. Debates y propuestas Ramón Pérez Parejo Universidad de Extremadura IES San Fernando (Badajoz) Febrero de 2019 Comprensión y expresión oral en el aula de Primaria. Debates y propuestas ÍNDICE INTRODUCCIÓN: 10 CLAVES DE PARTIDA BLOQUE 1. ADOCTRINAMIENTO 1.1. ORALIDAD Y ENFOQUES COMUNICATIVOS 1.2.PRUEBAS DE DIAGNÓSTICO 1.3.CURRÍCULUM DE INFANTIL, PRIMARIA Y SECUNDARIA (¿LIBROS DE TEXTO?) BLOQUE 2. PROPUESTA DE PROGRAMACIÓN BLOQUE 3.PROPUESTAS DE ACTIVIDADES 2.1. COMPRENSIÓN ORAL 2.2. EXPRESIÓN ORAL BLOQUE 4. REPERTORIO DE ERRORES FRECUENTES EN EXPOSICIONES ORALES 3.1. DECÁLOGO DE ERRORES HABITUALES EN LAS PRESENTACIONES ORALES 3.2. ERRORES LINGÜÍSTICOS DE LAS PRESENTACIONES ORALES BLOQUE 5. EVALUACIÓN Y PROPUESTA DE RÚBRICA CONCLUSIONES BIBLIOGRAFÍA

Upload: others

Post on 19-Nov-2020

12 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: METHODS OF DETECTING AND TREATING OUTLIERS USED IN ...icos2017.fzs.ba/images/Presentations/Day_1_A1a/ICOS2017_Marink… · Non-sampling errors (with respect to nature) •Stochastic

METHODS OF DETECTING AND TREATING OUTLIERS USED IN REPUBLIKA SRPSKA INSTITUTE OF

STATISTICS

Darko Marinković, Aleksandra Djonlaga

Republika Srpska Institute of Statistics

Page 2: METHODS OF DETECTING AND TREATING OUTLIERS USED IN ...icos2017.fzs.ba/images/Presentations/Day_1_A1a/ICOS2017_Marink… · Non-sampling errors (with respect to nature) •Stochastic

Introduction

• Difference between “true value” of the parameter and its survey estimate – Total survey error (TSE)1

• Sampling error is part of TSE that is under control of statisticians

• Adequate allocation of the sample and usage of auxiliary information in design and estimation stage usually solves the problem

• Non-sampling errors are caused by all other survey operations except sampling

• Keeping non-sampling errors under control can be a real challenge to statisticians

1Beimer, P.P and Lyberg, L.E (2003)

Page 3: METHODS OF DETECTING AND TREATING OUTLIERS USED IN ...icos2017.fzs.ba/images/Presentations/Day_1_A1a/ICOS2017_Marink… · Non-sampling errors (with respect to nature) •Stochastic

Non-sampling errors (with respect to source)

• Specification error (concept implied by the survey question and the concept that should be measured in the survey differ)

• Frame error (construction of the sampling frame for the survey –possible over/undercoverage, misclassifications, duplications)

• Non-response error (arises from incomplete or completely missing data);

• Measurement error (response differs from the true value because of interviewer, respondent, questionnaire design, collection method, information system,…)

• Processing error (editing of data, data entry, coding, the assignment of survey weights, and the tabulation of survey data)

Page 4: METHODS OF DETECTING AND TREATING OUTLIERS USED IN ...icos2017.fzs.ba/images/Presentations/Day_1_A1a/ICOS2017_Marink… · Non-sampling errors (with respect to nature) •Stochastic

Non-sampling errors (with respect to nature)

• Stochastic - occur due to accidental sources of errors • They do not significantly bias the parameter estimates in any specific

direction, because they are cancelled out if a large enough sample is used

• Systematic - consistently affect data in one or more survey phases (data capturing, data entry, data coding, data processing…) and tend to accumulate over the entire sample• If not treated properly, can cause serious bias in survey estimates

Page 5: METHODS OF DETECTING AND TREATING OUTLIERS USED IN ...icos2017.fzs.ba/images/Presentations/Day_1_A1a/ICOS2017_Marink… · Non-sampling errors (with respect to nature) •Stochastic

Non-sampling errors (with respect to impact or influence)• Impact or influence – dependant on definition of target parameter,

applied estimator, domain of interest

• Values that significantly affect estimates of target parameter if not treated properly (or if included or excluded from estimation)

• Does not necessarily correspond to extreme values, can be bounded to large sampling weight to be influential

• Eg. Extreme value bounded to large sampling weight – classification problem?

Page 6: METHODS OF DETECTING AND TREATING OUTLIERS USED IN ...icos2017.fzs.ba/images/Presentations/Day_1_A1a/ICOS2017_Marink… · Non-sampling errors (with respect to nature) •Stochastic

Non-sampling errors (with respect to appearance)• If we can, they appear as:

• Missing values (data not collected for some reason)

• Outliers (values not coherent with expected model)

• Data inconsistencies (data not coherent with prespecified set of mathematical and/or logical rules)

• Can we detect all non-sampling errors? (inliers)

Page 7: METHODS OF DETECTING AND TREATING OUTLIERS USED IN ...icos2017.fzs.ba/images/Presentations/Day_1_A1a/ICOS2017_Marink… · Non-sampling errors (with respect to nature) •Stochastic

Outliers

• Fall outside of an overall trend or do not follow a model that is assumed for a specific phenomenon that is of interest

• Can be defined with respect to one variable (univariate) or more (bivariate or multivariate)

• Can be defined with respect to the whole population or specific domains of interest

• Does not have to be an erroneous observation – it can be true change in phenomenon of interest – distinction is of crucial importance

Page 8: METHODS OF DETECTING AND TREATING OUTLIERS USED IN ...icos2017.fzs.ba/images/Presentations/Day_1_A1a/ICOS2017_Marink… · Non-sampling errors (with respect to nature) •Stochastic

Outliers

• From sampling perspective:• Representative outliers – observations identified as outliers, but they are not

erroneous - they potentially represent other elements in the population and contain information about the (higher) variability of the investigated phenomenon

• Non-representative outliers:• observations identified as outliers but are unique correct cases that do not represent any

other element in the population (shouldn’t be extrapolated to population)

• observations which are identified as outliers and erroneous, and are unique cases whose unknown true values are not outliers (must be treated)

Page 9: METHODS OF DETECTING AND TREATING OUTLIERS USED IN ...icos2017.fzs.ba/images/Presentations/Day_1_A1a/ICOS2017_Marink… · Non-sampling errors (with respect to nature) •Stochastic

Detection and treatment of outliers

• How different does the value have to be from the rest of the data to be an outlier?

• No simple answer – balance between objectives of the survey (estimates, domains of interest, precision) and available resources on the other

• Representative outliers are treated at the estimation phase

• Non-representative outliers are treated at the editing and imputation phase

• Problem of overediting!!!

Page 10: METHODS OF DETECTING AND TREATING OUTLIERS USED IN ...icos2017.fzs.ba/images/Presentations/Day_1_A1a/ICOS2017_Marink… · Non-sampling errors (with respect to nature) •Stochastic

Bivariate method of Hidiroglu and Berthelot for outlier detection• Originally designed for detection of outliers in periodic surveys (M.A.

Hidiroglou and J.M. Berthelot (1986))

• If a phenomenon of interest is measured at two (or more) different time points, then it is possible to identify observations with “biggest change”

• The method is applied in the context of comparing two related variables (Y1 and Y2) in the same survey iteration

Page 11: METHODS OF DETECTING AND TREATING OUTLIERS USED IN ...icos2017.fzs.ba/images/Presentations/Day_1_A1a/ICOS2017_Marink… · Non-sampling errors (with respect to nature) •Stochastic

Bivariate method of Hidiroglu and Berthelot for outlier detection• Business population variables usually have skewed distribution, so

ratios R12 should be transformed:

)(1)(

)(0)(

1

,12,12

,12

,12

,12,12

,12

,12

ii

i

i

ii

i

i

i

RmedianRifRmedian

R

RmedianRifR

Rmedian

s

Page 12: METHODS OF DETECTING AND TREATING OUTLIERS USED IN ...icos2017.fzs.ba/images/Presentations/Day_1_A1a/ICOS2017_Marink… · Non-sampling errors (with respect to nature) •Stochastic

Bivariate method of Hidiroglu and Berthelot for outlier detection• To take into account the magnitude of the difference between the

two variables values that are analyzed, a further transformation of the si values is performed:

• The parameter 0≤U≤1 controls the importance associated to the magnitude of the difference between the two variables

Uiiii yysE 21 ,max

Page 13: METHODS OF DETECTING AND TREATING OUTLIERS USED IN ...icos2017.fzs.ba/images/Presentations/Day_1_A1a/ICOS2017_Marink… · Non-sampling errors (with respect to nature) •Stochastic

Bivariate method of Hidiroglu and Berthelot for outlier detection• The values which are external to the following interval are classified

as outliers:

• Where:

31, QmedianQmedian dCEdCE

251 QmedianQ EEd medianQQ EEd 753

Page 14: METHODS OF DETECTING AND TREATING OUTLIERS USED IN ...icos2017.fzs.ba/images/Presentations/Day_1_A1a/ICOS2017_Marink… · Non-sampling errors (with respect to nature) •Stochastic

Bivariate method of Hidiroglu and Berthelot for outlier detection• No general recommendation for choice parameters U and C

• Examination of scatter plots, influence on target estimate and statistics of number of identified outliers

• Available resources and purpose of the survey (quality requirements) play significant role

• Subject matter knowledge crucial in decision making

• Once determined values of the parameters U and C can be reused (or minimally changed) in the next survey iteration, since the respective variables that are forming the ratio for all survey iterations show very similar behavior

Page 15: METHODS OF DETECTING AND TREATING OUTLIERS USED IN ...icos2017.fzs.ba/images/Presentations/Day_1_A1a/ICOS2017_Marink… · Non-sampling errors (with respect to nature) •Stochastic

Outliers in Labour Cost Survey

LCS 2014

Ratios Parameter U Parameter C

Number of

identified Outliers

Total gross salary by total paid hours 0.25 3.7 57

Total gross salary by total hours actually worked 0.29 4.0 52

Total gross salary by total number of employees 0.29 4.5 51

Total hours actually worked by total number of

employees 0.38 6.5 39

Total hours paid but not worked by total number

of employees 0.30 6.0 108

Total paid hours by total number of employees 0.17 11.5 88

Total labour cost by total paid hours 0.55 4.0 97

Total labour cost by total hours actually worked 0.55 4.0 96

Total labour cost by total number of employees 0.50 3.4 142

Page 16: METHODS OF DETECTING AND TREATING OUTLIERS USED IN ...icos2017.fzs.ba/images/Presentations/Day_1_A1a/ICOS2017_Marink… · Non-sampling errors (with respect to nature) •Stochastic

Outliers in Labour Cost Survey

Page 17: METHODS OF DETECTING AND TREATING OUTLIERS USED IN ...icos2017.fzs.ba/images/Presentations/Day_1_A1a/ICOS2017_Marink… · Non-sampling errors (with respect to nature) •Stochastic

Treatment of outliers

• Outliers should always be analyzed interactively (manual review of the printed or electronic questionnaires, contacts with the reporting unit, use of auxiliary information)

• Responsible subject matter methodologists are making decisions on how to treat the outliers

• If an outlier is confirmed to be an error, then it should be corrected or eliminated (either by recontacting the enterprise by telephone, or by imputation)

Page 18: METHODS OF DETECTING AND TREATING OUTLIERS USED IN ...icos2017.fzs.ba/images/Presentations/Day_1_A1a/ICOS2017_Marink… · Non-sampling errors (with respect to nature) •Stochastic

Implementation of the method

• Used R software environment for statistical computing (R Core Team (2016))

• Since method is mathematically simple, it is implemented just using built in facilities in the standard libraries

• Input to the function are variables of the survey dataset that will form the ratio and parameters U and C

• Output is dataset that contains units with observations that are identified as outlying and the scatter plots with marked outlying values (possibly, with influential values in terms of large sampling weight)

Page 19: METHODS OF DETECTING AND TREATING OUTLIERS USED IN ...icos2017.fzs.ba/images/Presentations/Day_1_A1a/ICOS2017_Marink… · Non-sampling errors (with respect to nature) •Stochastic

Implementation of the method

• Core function can be automatically repeated for desired variables forming the ratios and also by domain of interest for the survey

• Output dataset is organized as a table which in the rows contains unit records that have at least one identified outlier for the ratios that are subject to the analysis

• This means that all outliers for one specific unit are treated at once

Page 20: METHODS OF DETECTING AND TREATING OUTLIERS USED IN ...icos2017.fzs.ba/images/Presentations/Day_1_A1a/ICOS2017_Marink… · Non-sampling errors (with respect to nature) •Stochastic

Conclusion and future plans

• Before considering the possible treatment of identified outliers, we should try to understand why they occurred and whether it is likely that similar values will continue to appear

• Outliers are to be detected (together with influential errors) at the first stages of the editing and imputation process, and their treatment has to be particularly accurate

• Outliers should always be analyzed interactively

Page 21: METHODS OF DETECTING AND TREATING OUTLIERS USED IN ...icos2017.fzs.ba/images/Presentations/Day_1_A1a/ICOS2017_Marink… · Non-sampling errors (with respect to nature) •Stochastic

Conclusion and future plans

• RSIS applies the “Ratio method” of Hidiroglou and Berthelot in the context of the long term surveys

• The method is originally designed for detection of outliers in periodic surveys

• Plan is to expand application in the short term surveys, in which original idea is more appropriate

• Another idea is usage of external data for forming ratios