efficiency of classification algorithms as an alternative to ......introduction methods results...

43
Introduction Methods Results Discussion References Efficiency of Classification Algorithms as an Alternative to Logistic Regression in Propensity Score Adjustment for Survey Weighting Ram ´ on Ferri-Garc´ ıa 1 Mar´ ıa del Mar Rueda 1 1 Department of Statistics and Operational Research, University of Granada Barcelona, 27th October 2018

Upload: others

Post on 30-Jan-2021

6 views

Category:

Documents


0 download

TRANSCRIPT

  • Introduction Methods Results Discussion References

    Efficiency of Classification Algorithms as anAlternative to Logistic Regression in

    Propensity Score Adjustment for SurveyWeighting

    Ramón Ferri-Garcı́a 1 Marı́a del Mar Rueda 1

    1Department of Statistics and Operational Research, University of Granada

    Barcelona, 27th October 2018

  • Introduction Methods Results Discussion References

    Outline

    Introduction

    Methods

    ResultsArtificial dataReal data

    Discussion

  • Introduction Methods Results Discussion References

    Selection bias in online surveys is a main concern:• Participants are often self-selected (non-probabilistic

    sampling)• Online sampling frames are available only for narrow

    populations (Schonlau and Couper, 2017)• Some calibration procedures (GREG) are ineffective in

    removing volunteering bias (Dever et al., 2008)

  • Introduction Methods Results Discussion References

    Coverage bias in online surveys is important and oftenassociated to demographic variables

    0

    20

    40

    60

    80

    100

    2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017Year of survey

    Per

    cent

    age

    who

    acc

    esse

    d th

    e In

    tern

    etin

    the

    last

    3 m

    onth

    s (%

    )

    Education levelNo education

    Primary education

    Lower secondaryeducationUpper secondaryeducation

    Post-secondarynon-tertiaryeducationTertiary education

    Percentage of Spanish people by education levelwho have recently accessed the Internet (2006-2017)

    0

    20

    40

    60

    80

    100

    2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017Year of survey

    Per

    cent

    age

    who

    acc

    esse

    d th

    e In

    tern

    etin

    the

    last

    3 m

    onth

    s (%

    )Age (years) 16-2425-34

    35-4445-54

    55-6465-74

    Percentage of Spanish people by age groupwho have recently accessed the Internet (2006-2017)

    Source: Survey of information technologies in households, National Statistics Institute (INE)

  • Introduction Methods Results Discussion References

    Propensity Score Adjustment (PSA) has been proposed as amethod for removing selection bias.

    Non-probabilisticsample

    Probabilisticsample

    Probabilisticsample

    Non-probabilisticsample

    Logistic model

    x1, …, xk covariates

    Y = 1 non-prob0 prob

    Propensityestimatesπ(x)

    Y= 11+exp{−(β0+β1x1+...+βk xk)}

    Horvitz-Thompson weights(Lee and Valliant, 2009)

    wg=ΣdgP/ΣdgP

    ΣdgNP/ΣdgNPdgNP

    Hajek weights(Schonlau and Couper, 2017)

    wi=1−π(xi)π(xi)

    Partition of π(x) in g classes

  • Introduction Methods Results Discussion References

    Propensity Score Adjustment (PSA) was originally developedfor balancing experimental designs (Rosenbaum and Rubin,1983), but it has been proposed as a method to removeselection bias in online surveys.

    • Its application requires a probabilistic sample on whichcovariates have been measured.

    • Has been proved as effective in online surveys at the costof increasing variance (Lee, 2006; Lee and Valliant, 2009).

    • Efficacy strongly dependent on covariates’ choice and therelationship between Internet access and target variables(Valliant and Dever, 2011).

  • Introduction Methods Results Discussion References

    Logistic regression is the standard model in literature forpropensity estimation. It provides robust and efficient weights,but it has several drawbacks:

    • Requires linearity assumptions on risk• Strongly dependent of functional form and shape of

    covariates• Does not capture interactions on its usual configuration for

    PSA (D’Agostino, 1998; Westreich et al., 2010)

    In addition, alternatives to logistic regression as a classificator,in terms of accuracy, have arisen over time on the experimentaldesign context.

  • Introduction Methods Results Discussion References

    Proposal

    Non-probabilisticsample

    Probabilisticsample

    Probabilisticsample

    Non-probabilisticsample

    Classificationalgorithms

    x1, …, xk covariates

    Y = 1 non-prob0 prob

    Propensityestimatesπ(x)

    Horvitz-Thompson weights(Lee and Valliant, 2009)

    wg=ΣdgP/ΣdgP

    ΣdgNP/ΣdgNPdgNP

    Hajek weights(Schonlau and Couper, 2017)

    wi=1−π(xi)π(xi)

    Partition of π(x) in g classes

  • Introduction Methods Results Discussion References

    The use of Machine Learning (ML) classification algorithms forpropensity score estimation has been studied in experimentaldesign:

    • Decision trees (Setoguchi et al., 2008; Lee et al., 2010;Watkins et al., 2013; Wyss et al., 2014; Linden andYarnold, 2017)

    • Neural networks (Cavuto et al., 2006; Glynn et al., 2006)• Boosting (McCaffrey et al., 2013; Pirracchio et al., 2014;

    Watkins et al., 2013; Zhao et al., 2016)

    However, research on PSA with ML in survey estimation issparse.

  • Introduction Methods Results Discussion References

    Simulation with artificial data

    Simulation structure from Ferri-Garcı́a and Rueda (2018) whichis similar to the Bethlehem (2010) simulation with severalmodifications• Population size of N = 50000 from which, in each

    simulation:• A probabilistic (reference) sample of nrs = 500 is extracted• An Internet people (volunteer) sample of

    nvs = 500,750,1000,2000,5000,7500,10000 is extracted• 4 covariates (age, gender, nationality and education) and a

    target variable (voting intention) with 3 options:• Party 1 (dependent on gender)• Party 2 (dependent on age)• Party 3 (dependent on Internet access)

  • Introduction Methods Results Discussion References

    Simulation with artificial data

    Relationships between variables can be observed in thefollowing diagram:

    Gender Internet VoteAge

    NationalityEducation

    Variable in PSA model Target variableVolunteering variable

  • Introduction Methods Results Discussion References

    Simulation with artificial data

    Reweighting with PSA (Hajek-type weights) was performedover 500 simulations with several algorithms and parameters:• Decision trees: J48, C5.0 and CART

    • Min. num. of obs. per node: 0.5%,1%,5% of dataset• Confidence for pruning: 0.1,0.25,0.5

    • K-nearest neigbours with k = 3,5,7,9,11,13• Naı̈ve Bayes with Laplace smoothing: k = 1,2,5,10• Boosting:

    • Random Forest with num. of trees = 100,200,500• GBM with depth = 4,6,8 and learning rate

    = 0.1,0.01,0.001

  • Introduction Methods Results Discussion References

    Simulation with real dataSpanish Life Conditions Survey (2012 edition) dataset aspseudopopulation:• N = 28210 and 61 potential covariates after assessing for

    missing data.• Owning of computer at home as volunteering variable• Two target variables:

    • Self-reported health: bad/not bad (MAR)• >2 members in household (NMAR)

    • Four groups of covariates:• G1: demographic variables• G2: G1 + health-related variables• G3: G1 + poverty-related variables• G4: all potential covariates

  • Introduction Methods Results Discussion References

    Simulation with real data

    Reweighting with PSA (Hajek-type weights) was performedover 500 simulations, with nvs = 500,750,1000,2000,5000and the following algorithms and parameters:

    • Decision trees: J48, C5.0 and CART with defaultparameters

    • K-nearest neigbours with k = 3• Naı̈ve Bayes without Laplace smoothing• Boosting:

    • Random Forest with 100 trees• GBM with interaction depth (ID) = 8 and learning rate (LR)

    = 0.001

  • Introduction Methods Results Discussion References

    PSA with J48 in artificial data

    0.0

    0.1

    0.2

    0.3

    0.4

    2500 5000 7500 10000n_vs

    % o

    f bia

    s

    Party 1 (MCAR)

    0

    1

    2

    3

    2500 5000 7500 10000n_vs

    % o

    f bia

    s

    Party 2 (MAR)

    -12.5

    -10.0

    -7.5

    -5.0

    -2.5

    0.0

    2500 5000 7500 10000n_vs

    % o

    f bia

    s

    Party 3 (NMAR)

    ParametersM = 0.5% sample, CP = 0.1

    M = 0.5% sample, CP = 0.25

    M = 0.5% sample, CP = 0.5

    M = 1% sample, CP = 0.1

    M = 1% sample, CP = 0.25

    M = 1% sample, CP = 0.5

    M = 5% sample, CP = 0.1

    M = 5% sample, CP = 0.25

    M = 5% sample, CP = 0.5

    Logistic regression

    No adjustment

  • Introduction Methods Results Discussion References

    PSA with J48 in artificial data

    0.004

    0.008

    0.012

    0.016

    2500 5000 7500 10000n_vs

    Std

    . dev

    . of e

    stim

    ates

    Party 1 (MCAR)

    0.01

    0.02

    2500 5000 7500 10000n_vs

    Std

    . dev

    . of e

    stim

    ates

    Party 2 (MAR)

    0.01

    0.02

    2500 5000 7500 10000n_vs

    Std

    . dev

    . of e

    stim

    ates

    Party 3 (NMAR)

    ParametersM = 0.5% sample, CP = 0.1

    M = 0.5% sample, CP = 0.25

    M = 0.5% sample, CP = 0.5

    M = 1% sample, CP = 0.1

    M = 1% sample, CP = 0.25

    M = 1% sample, CP = 0.5

    M = 5% sample, CP = 0.1

    M = 5% sample, CP = 0.25

    M = 5% sample, CP = 0.5

    Logistic regression

    No adjustment

  • Introduction Methods Results Discussion References

    PSA with C5.0 in artificial data

    0.0

    0.1

    0.2

    0.3

    0.4

    2500 5000 7500 10000n_vs

    % o

    f bia

    s

    Party 1 (MCAR)

    0

    1

    2

    3

    2500 5000 7500 10000n_vs

    % o

    f bia

    s

    Party 2 (MAR)

    -12.5

    -10.0

    -7.5

    -5.0

    -2.5

    0.0

    2500 5000 7500 10000n_vs

    % o

    f bia

    s

    Party 3 (NMAR)

    ParametersM = 0.5% sample, CP = 0.1

    M = 0.5% sample, CP = 0.25

    M = 0.5% sample, CP = 0.5

    M = 1% sample, CP = 0.1

    M = 1% sample, CP = 0.25

    M = 1% sample, CP = 0.5

    M = 5% sample, CP = 0.1

    M = 5% sample, CP = 0.25

    M = 5% sample, CP = 0.5

    Logistic regression

    No adjustment

  • Introduction Methods Results Discussion References

    PSA with C5.0 in artificial data

    0.004

    0.008

    0.012

    2500 5000 7500 10000n_vs

    Std

    . dev

    . of e

    stim

    ates

    Party 1 (MCAR)

    0.01

    0.02

    2500 5000 7500 10000n_vs

    Std

    . dev

    . of e

    stim

    ates

    Party 2 (MAR)

    0.01

    0.02

    2500 5000 7500 10000n_vs

    Std

    . dev

    . of e

    stim

    ates

    Party 3 (NMAR)

    ParametersM = 0.5% sample, CP = 0.1

    M = 0.5% sample, CP = 0.25

    M = 0.5% sample, CP = 0.5

    M = 1% sample, CP = 0.1

    M = 1% sample, CP = 0.25

    M = 1% sample, CP = 0.5

    M = 5% sample, CP = 0.1

    M = 5% sample, CP = 0.25

    M = 5% sample, CP = 0.5

    Logistic regression

    No adjustment

  • Introduction Methods Results Discussion References

    PSA with CART in artificial data

    -0.1

    0.0

    0.1

    0.2

    0.3

    2500 5000 7500 10000n_vs

    % o

    f bia

    s

    Party 1 (MCAR)

    0

    1

    2

    3

    2500 5000 7500 10000n_vs

    % o

    f bia

    s

    Party 2 (MAR)

    -12

    -9

    -6

    -3

    0

    2500 5000 7500 10000n_vs

    % o

    f bia

    s

    Party 3 (NMAR)

    ParametersM = 0.5% sample, CP = 0.1

    M = 0.5% sample, CP = 0.25

    M = 0.5% sample, CP = 0.5

    M = 1% sample, CP = 0.1

    M = 1% sample, CP = 0.25

    M = 1% sample, CP = 0.5

    M = 5% sample, CP = 0.1

    M = 5% sample, CP = 0.25

    M = 5% sample, CP = 0.5

    Logistic regression

    No adjustment

  • Introduction Methods Results Discussion References

    PSA with CART in artificial data

    0.004

    0.008

    0.012

    2500 5000 7500 10000n_vs

    Std

    . dev

    . of e

    stim

    ates

    Party 1 (MCAR)

    0.005

    0.010

    0.015

    0.020

    0.025

    2500 5000 7500 10000n_vs

    Std

    . dev

    . of e

    stim

    ates

    Party 2 (MAR)

    0.01

    0.02

    2500 5000 7500 10000n_vs

    Std

    . dev

    . of e

    stim

    ates

    Party 3 (NMAR)

    ParametersM = 0.5% sample, CP = 0.1

    M = 0.5% sample, CP = 0.25

    M = 0.5% sample, CP = 0.5

    M = 1% sample, CP = 0.1

    M = 1% sample, CP = 0.25

    M = 1% sample, CP = 0.5

    M = 5% sample, CP = 0.1

    M = 5% sample, CP = 0.25

    M = 5% sample, CP = 0.5

    Logistic regression

    No adjustment

  • Introduction Methods Results Discussion References

    PSA with kNN in artificial data

    -0.1

    0.0

    0.1

    0.2

    0.3

    2500 5000 7500 10000n_vs

    % o

    f bia

    s

    Party 1 (MCAR)

    0

    1

    2

    3

    2500 5000 7500 10000n_vs

    % o

    f bia

    s

    Party 2 (MAR)

    -12

    -9

    -6

    -3

    0

    2500 5000 7500 10000n_vs

    % o

    f bia

    s

    Party 3 (NMAR)

    Parametersk = 3

    k = 5

    k = 7

    k = 9

    k = 11

    k = 13

    Logistic regression

    No adjustment

  • Introduction Methods Results Discussion References

    PSA with kNN in artificial data

    0.005

    0.010

    0.015

    2500 5000 7500 10000n_vs

    Std

    . dev

    . of e

    stim

    ates

    Party 1 (MCAR)

    0.01

    0.02

    0.03

    2500 5000 7500 10000n_vs

    Std

    . dev

    . of e

    stim

    ates

    Party 2 (MAR)

    0.01

    0.02

    0.03

    2500 5000 7500 10000n_vs

    Std

    . dev

    . of e

    stim

    ates

    Party 3 (NMAR)

    Parametersk = 3

    k = 5

    k = 7

    k = 9

    k = 11

    k = 13

    Logistic regression

    No adjustment

  • Introduction Methods Results Discussion References

    PSA with Naı̈ve Bayes in artificial data

    0.0

    0.1

    0.2

    0.3

    2500 5000 7500 10000n_vs

    % o

    f bia

    s

    Party 1 (MCAR)

    0

    1

    2

    3

    2500 5000 7500 10000n_vs

    % o

    f bia

    s

    Party 2 (MAR)

    -9

    -6

    -3

    0

    2500 5000 7500 10000n_vs

    % o

    f bia

    s

    Party 3 (NMAR)

    Parameterslaplace = 1

    laplace = 2

    laplace = 5

    laplace = 10

    Logistic regression

    No adjustment

  • Introduction Methods Results Discussion References

    PSA with Naı̈ve Bayes in artificial data

    0.004

    0.008

    0.012

    2500 5000 7500 10000n_vs

    Std

    . dev

    . of e

    stim

    ates

    Party 1 (MCAR)

    0.005

    0.010

    0.015

    0.020

    0.025

    2500 5000 7500 10000n_vs

    Std

    . dev

    . of e

    stim

    ates

    Party 2 (MAR)

    0.01

    0.02

    2500 5000 7500 10000n_vs

    Std

    . dev

    . of e

    stim

    ates

    Party 3 (NMAR)

    Parameterslaplace = 1

    laplace = 2

    laplace = 5

    laplace = 10

    Logistic regression

    No adjustment

  • Introduction Methods Results Discussion References

    PSA with Random Forest in artificial data

    0

    1

    2

    3

    2500 5000 7500 10000n_vs

    % o

    f bia

    s

    Party 1 (MCAR)

    -6

    -4

    -2

    0

    2

    2500 5000 7500 10000n_vs

    % o

    f bia

    s

    Party 2 (MAR)

    -10

    -5

    0

    2500 5000 7500 10000n_vs

    % o

    f bia

    s

    Party 3 (NMAR)

    ParametersTrees = 50

    Trees = 100

    Trees = 200

    Trees = 500

    Logistic regression

    No adjustment

  • Introduction Methods Results Discussion References

    PSA with Random Forest in artificial data

    0.01

    0.02

    0.03

    2500 5000 7500 10000n_vs

    Std

    . dev

    . of e

    stim

    ates

    Party 1 (MCAR)

    0.02

    0.04

    0.06

    2500 5000 7500 10000n_vs

    Std

    . dev

    . of e

    stim

    ates

    Party 2 (MAR)

    0.02

    0.04

    0.06

    2500 5000 7500 10000n_vs

    Std

    . dev

    . of e

    stim

    ates

    Party 3 (NMAR)

    ParametersTrees = 50

    Trees = 100

    Trees = 200

    Trees = 500

    Logistic regression

    No adjustment

  • Introduction Methods Results Discussion References

    PSA with GBM in artificial data

    0.0

    0.1

    0.2

    0.3

    0.4

    2500 5000 7500 10000n_vs

    % o

    f bia

    s

    Party 1 (MCAR)

    0

    1

    2

    3

    2500 5000 7500 10000n_vs

    % o

    f bia

    s

    Party 2 (MAR)

    -9

    -6

    -3

    0

    2500 5000 7500 10000n_vs

    % o

    f bia

    s

    Party 3 (NMAR)

    ParametersDepth = 4, LR = 0.001

    Depth = 4, LR = 0.01

    Depth = 4, LR = 0.1

    Depth = 6, LR = 0.001

    Depth = 6, LR = 0.01

    Depth = 6, LR = 0.1

    Depth = 8, LR = 0.001

    Depth = 8, LR = 0.01

    Depth = 8, LR = 0.1

    Logistic regression

    No adjustment

  • Introduction Methods Results Discussion References

    PSA with GBM in artificial data

    0.004

    0.008

    0.012

    0.016

    2500 5000 7500 10000n_vs

    Std

    . dev

    . of e

    stim

    ates

    Party 1 (MCAR)

    0.01

    0.02

    2500 5000 7500 10000n_vs

    Std

    . dev

    . of e

    stim

    ates

    Party 2 (MAR)

    0.01

    0.02

    2500 5000 7500 10000n_vs

    Std

    . dev

    . of e

    stim

    ates

    Party 3 (NMAR)

    ParametersDepth = 4, LR = 0.001

    Depth = 4, LR = 0.01

    Depth = 4, LR = 0.1

    Depth = 6, LR = 0.001

    Depth = 6, LR = 0.01

    Depth = 6, LR = 0.1

    Depth = 8, LR = 0.001

    Depth = 8, LR = 0.01

    Depth = 8, LR = 0.1

    Logistic regression

    No adjustment

  • Introduction Methods Results Discussion References

    PSA in self reported health with real data

    -4

    -3

    -2

    -1

    0

    500 750 1000 2000 5000n_vs

    % o

    f bia

    s

    Covariates: G1 (demographic)

    -4

    -2

    0

    500 750 1000 2000 5000n_vs

    % o

    f bia

    s

    Covariates: G2 (G1 + health-related)

    Algorithm3-NN

    C5.0

    CART

    GBM (ID 8, LR 0.1)

    J48

    Logistic regression

    Naïve Bayes

    Random Forest

  • Introduction Methods Results Discussion References

    PSA in self reported health with real data

    0.002

    0.005

    0.010

    0.015

    0.025

    0.050

    500 750 1000 2000 5000n_vs

    Std

    . dev

    . of e

    stim

    ates

    Covariates: G1 (demographic)

    0.002

    0.005

    0.010

    0.015

    0.025

    0.050

    0.100

    500 750 1000 2000 5000n_vs

    Std

    . dev

    . of e

    stim

    ates

    Covariates: G2 (G1 + health-related)

    Algorithm3-NN

    C5.0

    CART

    GBM (ID 8, LR 0.1)

    J48

    Logistic regression

    Naïve Bayes

    Random Forest

  • Introduction Methods Results Discussion References

    PSA in self reported health with real data

    -4

    -3

    -2

    -1

    0

    500 750 1000 2000 5000n_vs

    % o

    f bia

    s

    Covariates: G3 (G1 + poverty-related)

    -4

    -2

    0

    2

    4

    6

    500 750 1000 2000 5000n_vs

    % o

    f bia

    s

    Covariates: G4 (all eligible)

    Algorithm3-NN

    C5.0

    CART

    GBM (ID 8, LR 0.1)

    J48

    Logistic regression

    Naïve Bayes

    Random Forest

  • Introduction Methods Results Discussion References

    PSA in self reported health with real data

    0.005

    0.010

    0.015

    0.025

    0.050

    0.100

    0.300

    500 750 1000 2000 5000n_vs

    Std

    . dev

    . of e

    stim

    ates

    Covariates: G3 (G1 + poverty-related)

    0.005

    0.010

    0.015

    0.025

    0.050

    0.100

    0.300

    500 750 1000 2000 5000n_vs

    Std

    . dev

    . of e

    stim

    ates

    Covariates: G4 (all eligible)

    Algorithm3-NN

    C5.0

    CART

    GBM (ID 8, LR 0.1)

    J48

    Logistic regression

    Naïve Bayes

    Random Forest

  • Introduction Methods Results Discussion References

    PSA in household members with real data

    0

    3

    6

    9

    12

    500 750 1000 2000 5000n_vs

    % o

    f bia

    s

    Covariates: G1 (demographic)

    0

    5

    10

    15

    500 750 1000 2000 5000n_vs

    % o

    f bia

    s

    Covariates: G2 (G1 + health-related)

    Algorithm3-NN

    C5.0

    CART

    GBM (ID 8, LR 0.1)

    J48

    Logistic regression

    Naïve Bayes

    Random Forest

  • Introduction Methods Results Discussion References

    PSA in household members with real data

    0.005

    0.010

    0.015

    0.020

    0.030

    0.045

    0.060

    0.120

    500 750 1000 2000 5000n_vs

    Std

    . dev

    . of e

    stim

    ates

    Covariates: G1 (demographic)

    0.005

    0.010

    0.015

    0.020

    0.030

    0.050

    0.100

    0.200

    500 750 1000 2000 5000n_vs

    Std

    . dev

    . of e

    stim

    ates

    Covariates: G2 (G1 + health-related)

    Algorithm3-NN

    C5.0

    CART

    GBM (ID 8, LR 0.1)

    J48

    Logistic regression

    Naïve Bayes

    Random Forest

  • Introduction Methods Results Discussion References

    PSA in household members with real data

    0

    5

    10

    500 750 1000 2000 5000n_vs

    % o

    f bia

    s

    Covariates: G3 (G1 + poverty-related)

    0

    3

    6

    9

    500 750 1000 2000 5000n_vs

    % o

    f bia

    s

    Covariates: G4 (all eligible)

    Algorithm3-NN

    C5.0

    CART

    GBM (ID 8, LR 0.1)

    J48

    Logistic regression

    Naïve Bayes

    Random Forest

  • Introduction Methods Results Discussion References

    PSA in household members with real data

    0.010

    0.015

    0.025

    0.050

    0.100

    0.200

    0.400

    500 750 1000 2000 5000n_vs

    Std

    . dev

    . of e

    stim

    ates

    Covariates: G3 (G1 + poverty-related)

    0.010

    0.015

    0.025

    0.050

    0.100

    0.200

    0.400

    500 750 1000 2000 5000n_vs

    Std

    . dev

    . of e

    stim

    ates

    Covariates: G4 (all eligible)

    Algorithm3-NN

    C5.0

    CART

    GBM (ID 8, LR 0.1)

    J48

    Logistic regression

    Naïve Bayes

    Random Forest

  • Introduction Methods Results Discussion References

    • Decision trees can be an alternative if both samples arebalanced. The increase in estimators’ variance caused bythis approach has been observed in literature (Wyss et al.,2014).

    • Nearest neighbors provide better results inlow-dimensional contexts. This is a well-known attribute ofk-NN classifiers (see Beyer et al., 1999).

    • PSA with Naı̈ve Bayes is unstable if covariates presentrare classes (p. e. numeric variables) or the dimensionalityof the dataset increases. Naı̈ve Bayes is generally unableto deal with numeric variables and relies on independenceassumptions which are hardly met as the number ofvariables increase (Barber, 2012; Garcı́a et al., 2015).

  • Introduction Methods Results Discussion References

    • If controlling for overfitting, Random Forest is a solidalternative, specially if the target variable is influenced byInternet access. However, as in Pirracchio et al. (2014), anincrease in variance could be attributed to its use.

    • PSA with GBM presents very good properties, specially interms of estimators’ variance. These findings are in linewith McCaffrey et al. (2013).

    • However, the choice on which algorithm should beused remains on the dataset and its attributes.

  • Introduction Methods Results Discussion References

    • If non-volunteering is MCAR, CART and k-NN providegood results under any conditions, although most of thealgorithms (except for Random Forests) perform well asnvs increases.

    • If non-volunteering is MAR or NMAR, best choices arek-NN and GBM if dimensionality is low. If assessing forright covariates, J48 and C5.0 are good alternatives if nvsis small. Random Forests can cause a high impact in biasreducing but they tend to overfit.

  • Introduction Methods Results Discussion References

    • Data preprocessing and sample balancing are key stepswhen estimating efficient propensity scores with MLalgorithms. This point has been discussed in literature(Linden and Yarnold, 2017).

    • Parameter tuning, except for GBM, has not been proved tobe a decisive factor in PSA success at removing bias. Upto the authors’ knowledge, only Setoguchi et al. (2008) andLee et al. (2010) addressed parameter tuning for PSA withML.

    • The limitations of the experiment on which parametertuning was considered (artificial data, low dimensionality,etc.) must be taken into account in further research.

  • Introduction Methods Results Discussion References

    • Barber, D. (2012). Bayesian Reasoning and Machine Learning. Cambridge University Press, New York.• Bethlehem, J. (2010). Selection Bias in Web Surveys. Int. Stat. Rev., 78(2), 161-188.• Beyer, K., Goldstein, J., Ramakrishnan, R., Shaft, U. (1999). When is “nearest neighbor” meaningful?. In

    International conference on database theory (pp. 217-235). Springer, Berlin, Heidelberg.• Cavuto, S., Bravi, F., Grassi, M. C., Apolone, G. (2006). Propensity score for the analysis of observational

    data: an introduction and an illustrative example. Drug Dev. Res., 67(3), 208-216.• D’Agostino, R. B. Jr. (1998). Propensity score methods for bias reduction in the comparison of a treatment

    to a non-randomized control group. Stat. Med., 17, 2265-2281.• Dever, J. A., Rafferty, A., Valliant, R. (2008). Internet Surveys: Can Statistical Adjustments Eliminate

    Coverage Bias?. Surv. Res. Methods, 2(2), 47-62.• Ferri-Garcı́a, R., Rueda, M. M. (2018). Efficiency of Propensity Score Adjustment and calibration on the

    estimation from non-probabilistic online surveys. SORT-Stat. Oper. Res. T. (accepted).• Garcı́a, S., Luengo, J., Herrera, F. (2015). Data Preprocessing in Data Mining. Switzerland: Springer

    International Publishing.

  • Introduction Methods Results Discussion References

    • Glynn, R. J., Schneeweiss, S., Stürmer, T. (2006). Indications for propensity scores and review of their usein pharmacoepidemiology. Basic Clin. Pharmacol. Toxicol., 98(3), 253-259.

    • Lee, S. (2006). Propensity Score Adjustment as a Weighting Scheme for Volunteer Web Panel Surveys. J.Off. Stat., 22(2), 329-349.

    • Lee, B. K., Lessler, J., Stuart, E. A. (2010). Improving propensity score weighting using machine learning.Stat. Med., 29(3), 337-346.

    • Lee, S., Valliant, R. (2009). Estimation for Volunteer Panel Web Surveys Using Propensity ScoreAdjustment and Calibration Adjustment. Sociol. Method. Res., 37(3), 319-343.

    • Linden, A., Yarnold, P. R. (2017). Using classification tree analysis to generate propensity score weights. J.Eval. Clin. Pract., 23(4), 703-712.

    • McCaffrey, D. F., Griffin, B. A., Almirall, D., Slaughter, M. E., Ramchand, R., Burgette, L. F. (2013). A tutorialon propensity score estimation for multiple treatments using generalized boosted models. Stat. Med.,32(19), 3388-3414.

    • Pirracchio, R., Petersen, M. L., van der Laan, M. (2014). Improving propensity score estimators’ robustnessto model misspecification using super learner. Am. J. Epidemiol., 181(2), 108-119.

    • Rosenbaum, P. R., Rubin, D. B. (1983). The Central Role of the PRopensity Score in Observational Studiesfor Causal Effects. Biometrika, 70(1), 41-55.

  • Introduction Methods Results Discussion References

    • Schonlau, M., Couper, M. (2017). Options for Conducting Web Surveys. Stat. Sci., 32(2), 279-292.• Setoguchi, S., Schneeweiss, S., Brookhart, M. A., Glynn, R. J., Cook, E. F. (2008). Evaluating uses of data

    mining techniques in propensity score estimation: a simulation study. Pharmacoepidemiol. Drug. Saf.,17(6), 546-555.

    • Valliant, R., Dever, J. A. (2011). Estimating Propensity Adjustments for Volunteer Web Surveys. Sociol.Method. Res., 40(1), 105-137.

    • Watkins, S., Jonsson-Funk, M., Brookhart, M. A., Rosenberg, S. A., O’Shea, T. M., Daniels, J. (2013). AnEmpirical Comparison of Tree-Based Methods for Propensity Score Estimation. Health. Serv. Res., 48(5),1798-1817.

    • Westreich, D., Lessler, J., Funk, M. J. (2010). Propensity score estimation: neural networks, support vectormachines, decision trees (CART), and meta-classifiers as alternatives to logistic regression. J. Clin.Epidemiol., 63, 826-833.

    • Wyss, R., Ellis, A. R., Brookhart, M. A., Girman, C. J., Jonsson Funk, M., LoCasale, R., Stürmer, T. (2014).The role of prediction modeling in propensity score estimation: an evaluation of logistic regression, bCART,and the covariate-balancing propensity score. Am. J. Epidemiol., 180(6), 645-655.

    • Zhao, P., Su, X., Ge, T., Fan, J. (2016). Propensity score and proximity matching using random forest.Contemp. Clin. Trials Commun., 47, 85-92.

    IntroductionMethodsResultsArtificial dataReal data

    DiscussionReferences