logistic regression in factor identification of covid-19...
TRANSCRIPT
Logistic Regression in Factor Identification of Covid-19 Vaccine Clinical Trials
Jorge Luis Romeu, Ph.D.
https://www.researchgate.net/profile/Jorge_Romeu
http://web.cortland.edu/romeu/; Email: [email protected]
Copyright. December 11, 2020
1.0 Introduction
We compare Logistics Regression and Discriminant Analysis methods using clinical trial data, to
identify Covid-19 key factors that affect vaccine use. This work is part of our struggle vs. Covid-
19: https://www.researchgate.net/publication/341282217_A_Proposal_for_Fighting_Covid-
19_and_its_Economic_Fallout Previous work includes ICUs and hospital staffing using the
Negative Binomial distribution: https://www.researchgate.net/publication/345914205_Covid-
19_ICU_Staff_and_Equipment_Requirements_using_the_Negative_Binomial screening DOEs:
https://www.researchgate.net/publication/344924536_Design_of_Experiments_DOE_in_Covid-
19_Factor_Screening_and_Assessment using statistical methods to establish a new Vaccine Life:
https://www.researchgate.net/publication/344495955_Survival_Analysis_Methods_Applied_to_
Establishing_Covid-19_Vaccine_Life as well as to help accelerate vaccine testing:
https://www.researchgate.net/publication/344193195_Some_Statistical_Methods_to_Accelerate_
Covid-19_Vaccine_Testing and a Markov model to study problems of reopening college:
https://www.researchgate.net/publication/343825461_A_Markov_Model_to_Study_College_Re-
opening_Under_Covid-19 and Markov Model to study the effects of Herd Immunization:
https://www.researchgate.net/publication/343345908_A_Markov_Model_to_Study_Covid-
19_Herd_Immunization?channel=doi&linkId=5f244905458515b729f78487&showFulltext=t
rue as well as of general survival:
https://www.researchgate.net/publication/343021113_A_Markov_Chain_Model_for_Covid-
19_Survival_Analysis about socio-economic and racial issues affected by Covid-19:
https://www.researchgate.net/publication/343700072_A_Digression_About_Race_Ethnicity_Cla
ss_and_Covid-19 and developing A Markov Chain Model for Covid-19 Survival Analysis:
https://www.researchgate.net/publication/343021113_A_Markov_Chain_Model_for_Covid-
19_Survival_Analysis and An Example of Survival Analysis Applied to analyzing Covid-19 Data:
https://www.researchgate.net/publication/342583500_An_Example_of_Survival_Analysis_Data
_Applied_to_Covid-19, and Multivariate Statistics in the Analysis of Covid-19 Data, and More
on Applying Multivariate Statistics to Covid-19 Data, both of which can also be found in:
https://www.researchgate.net/publication/341385856_Multivariate_Stats_PC_Discrimination_in
_the_Analysis_of_Covid-19, and the implementation of multivariate analyses methods such as:
https://www.researchgate.net/publication/342154667_More_on_Applying_Principal_Component
s_Discrimination_Analysis_to_Covid-19 Design of Experiments to the Assessment of Covid-19:
https://www.researchgate.net/publication/341532612_Example_of_a_DOE_Application_to_Cor
onavarius_Data_Analysis Offshoring: https://www.researchgate.net/publication/341685776_Off-
Shoring_Taxpayers_and_the_Coronavarus_Pandemic and reliability methods in ICU assessment:
https://www.researchgate.net/publication/342449617_Example_of_the_Design_and_Operation_
of_an_ICU_using_Reliability_Principles and Quality Control methods for monitoring Covid-19:
https://web.cortland.edu/matresearch/AplicatSPCtoCovid19MFE2020.pdf Numerical Example
https://www.researchgate.net/publication/339936386_A_simple_numerical_example_that_illustr
ates_the_dangesrs_of_the_Coronavarus_epidemic
2.0 Problem Statement and Logistics Regression Analysis
This article starts by answering a question posed by some readers. Why didn’t we use Logistic
Regression in our Covid-19 data analyses? The short answer is that Logistics Regression and the
Discriminant Function results are equivalent, as will be shown here. Each analyst has their own
preference. We are more familiar and experienced with Fisher Discriminant Function.
We redevelop below Fisher’s Discriminant data analysis in Table #1, (originally in Section 3.0:
https://www.researchgate.net/publication/344495955_Survival_Analysis_Methods_Applied_to_
Establishing_Covid-
19_Vaccine_Life?channel=doi&linkId=5f7c9ecba6fdccfd7b4c597d&showFulltext=true) using a
Logistic Regression. We compare the two methods and verify how statistical results are similar.
In the second part of this paper we analyze an illustrative, manufactured Clinical Trials dataset,
using both Logistic Regression and Discriminant Analysis procedures. We verify again how both
procedure results are equivalent. We start by briefly discussing Logistic Regression.
Logistic Regression (https://online.stat.psu.edu/stat504/node/150/) is related to the Odds Ratio
(OR) concept. Assume we have four analogous cups, one of which is ours. The OR for randomly
selecting our own cup, is one in three (1/3; one chance to win, and three to lose). The probability
of correct selection is: p = OR/(1+OR) = [1/3]/[1+(1/3)] = 1/4 = 0.25. If we wanted to move in
the opposite direction: OR = p / (1-p) = 0.25 / (1-0.25) = 0.25/0.75 = 1/3.
Let’s define Y, a dichotomous (0, 1) random variable, such that: P{Y=1}= p, and P{Y= 0}=1- p.
The Event of Interest is {Y=1}; and we seek a vector/variable X such that: p(Y=1|x) is known:
The Logistic Regression Model (Logit) is then obtained by regressing Log (OR) on the data:
Log (OR) = Log [ p(Y|x) / (1 − p(Y|x)) ] = β0 + x·β
The Logit is the Logarithm of the Odds Ratio. It is modeled in terms of a linear regression. We
obtain regression coefficients β0 and vector β, and use them to obtain classification probability
p(Y|x; b), for each of vector x data points. Solving for “p” in the above formula we get:
p(Y|x; b) = [eβ0
+ x·β ] / [ 1+ e
β0
+ x·β ] = 1 / [1 + e
−(β0
+ x·β) ]
Applying Logistic Regression to the dataset in Section 3.0 of the above-mentioned paper:
Link Function: Logit
Variable Value Count
DscGrps 1 51
-1 22
Total 73
Logistic Regression Table
Odds 95% CI
Predictor Coef StDev Z P Ratio Lower Upper
Constant -27.18 10.25 -2.65 0.008
SocioEcon
1 0.484 1.639 0.30 0.768 1.62 0.07 40.30
Age 0.480 0.193 2.49 0.013 1.62 1.11 2.36
Co-Morbid 2.327 1.465 1.59 0.112 10.25 0.58 181.07
Gender
1 0.085 1.336 0.06 0.949 1.09 0.08 14.93
Log-Likelihood = -8.607
Test that all slopes are zero: G = 72.141, DF = 4, P-Value = 0.000
The table shows Logistic Regression coefficients, their p-values, estimated OR, and its 95% CI.
If OR is smaller than unit, correlation between Y and Xi is negative. If OR is greater than unit,
said correlation is positive. If OR is unit, then there is no correlation (Y and Xi are independent).
Log-Likelihood p-value plays a role analogous to the Multiple Regression F p-value.
Our Logistic Regression Equation is:
Log [ p(Y|x)/(1 − p(Y|x)) ] = -27.18 + 0.484X1 + 0.480X2 + 2.327X3 + 0.085X4
Variables in Red are statistically significant, or close to significant. Data are re-analyzed,
using only the two mildly significant variables Age & Co-Morbidities:
Link Function: Logit
Response Information
Variable Value Count
DscGrps 1 51
-1 22
Total 73
Logistic Regression Table
Odds 95% CI
Predictor Coef StDev Z P Ratio Lower Upper
Constant -25.930 8.846 -2.93 0.003
Age 0.456 0.165 2.76 0.006 1.58 1.14 2.18
Co-Morbid 2.452 1.435 1.71 0.088 11.61 0.70 193.45
Log-Likelihood = -8.656
Test that all slopes are zero: G = 72.043, DF = 2, P-Value = 0.000
The p-values of variables Age & Co-Morbidities are significant (at α=0.1). The Log-Likelihood
is highly significant; thence also must be some variables in said equation. The Odds Ratio CI for
Co-morbidities is very wide and covers Unit. We obtain P{Y|xi} by using vector x in:
p(Y|x; b) = [eβ0
+ x·β ] / [ 1+ e
β0
+ x·β ]
Fisher Discrimination Function for said data set (displayed at the end) Below is the data re-analysis (DG v. Age, Co-Morb) using Fisher Discriminant function:
The regression equation is: DG = - 2.77 + 0.0459 Age + 0.286 Comorb
Predictor Coef SE Coef T P
Constant -2.7651 0.1984 -13.94 0.000
Age 0.045914 0.003785 12.13 0.000
Co-Morb 0.28564 0.08278 3.45 0.001
S = 0.500681 R-Sq = 75.3% R-Sq(adj) = 74.8% (explains ¾ of the problem)
That these two results are similar can be verified by plotting them:
Logistic
Fish
erC
lass
20151050-5-10
2.0
1.5
1.0
0.5
0.0
-0.5
-1.0
Scatterplot of FisherClass vs Logistic
The scatter plot straight line of patient classification using these two statistical methods, show
the equivalence of the two statistical results: they both classify patients in a similar way.
In addition, there were exactly four misses for the same patients, using both statistical methods:
Group Age Profile Fisher Logistic
-1 55 1 0.0455 1.59
1 52 0 -0.3924 -2.27
1 45 2 -0.1185 -0.53
1 51 1 -0.1397 -0.25
This reinforces the proof regarding how similar results are from both statistical procedures. For
completion, we reprint the original data, their classification values using both procedures and the
probability of becoming the Event {Y=1|x}, (patients infected during the experiment):
Table 1. Data from Section 3.0 of the above-mentioned article.
No. SocioEcon Age Comorb Gender DG Discrim Logistic ProbEvent
1 1 45 1 0 0 -0.4175 -3.01 0.04974
2 0 50 1 1 0 -0.186 -0.71 0.3387
3 0 40 0 1 0 -0.948 -7.79 0.00046
4 0 43 0 0 0 -0.8091 -6.41 0.00181
5 1 47 1 1 0 -0.3249 -2.09 0.11531
6 0 48 1 0 0 -0.2786 -1.63 0.17059
7 1 49 1 0 0 -0.2323 -1.17 0.24503
8 0 42 0 0 0 -0.8554 -6.87 0.00115
9 0 36 0 1 0 -1.1332 -9.63 0.00007
10 0 39 0 0 0 -0.9943 -8.25 0.00029
11 0 46 1 0 0 -0.3712 -2.55 0.0763
12 1 44 1 1 0 -0.4638 -3.47 0.03211
13 0 42 1 1 0 -0.5564 -4.39 0.01315
14 0 51 1 0 0 -0.1397 -0.25 0.44696
15 0 49 0 0 0 -0.5313 -3.65 0.02719
16 0 55 1 0 0 0.0455 1.59 0.83365
17 0 45 1 0 0 -0.4175 -3.01 *
18 0 47 1 1 0 -0.3249 -2.09 *
19 0 42 0 1 0 -0.8554 -6.87 *
20 1 44 1 0 0 -0.4638 -3.47 *
21 0 47 1 1 0 -0.3249 -2.09 *
22 0 41 0 1 0 -0.9017 -7.33 0.00073
23 0 73 1 1 1 0.8789 9.87 0.99995
24 1 58 2 0 1 0.4834 5.45 0.99565
25 0 60 1 1 1 0.277 3.89 0.98001
26 0 52 0 0 1 -0.3924 -2.27 0.09895
27 0 65 2 0 1 0.8075 8.67 0.99982
28 0 72 1 0 1 0.8326 9.41 0.99991
29 0 66 1 0 1 0.5548 6.65 0.99868
30 0 61 1 1 1 0.3233 4.35 0.98724
31 1 55 2 0 1 0.3445 4.07 0.98311
32 0 63 2 0 1 0.7149 7.75 0.99955
33 0 78 1 1 1 1.1104 12.17 0.99999
34 0 73 2 1 1 1.1779 12.35 1
35 0 77 1 1 1 1.0641 11.71 0.99999
36 0 79 1 0 1 1.1567 12.63 1
37 0 82 1 0 1 1.2956 14.01 1
38 0 73 1 0 1 0.8789 9.87 *
39 0 78 2 0 1 1.4094 14.65 1
40 0 74 1 0 1 0.9252 10.33 0.99997
41 0 68 1 1 1 0.6474 7.57 0.99947
42 0 66 1 1 1 0.5548 6.65 *
43 0 69 2 0 1 0.9927 10.51 0.99997
44 0 77 0 1 1 0.7651 9.23 0.9999
45 0 85 2 0 1 1.7335 17.87 1
46 0 55 1 0 1 0.0455 1.59 *
47 1 45 2 1 1 -0.1185 -0.53 0.37805
48 0 49 2 0 1 0.0667 1.31 0.79032
49 0 57 1 1 1 0.1381 2.51 0.92581
50 0 51 1 0 1 -0.1397 -0.25 *
51 0 66 2 1 1 0.8538 9.13 0.99989
52 0 69 2 1 1 0.9927 10.51 *
53 0 59 1 1 1 0.2307 3.43 0.96882
54 1 55 2 1 1 0.3445 4.07 *
55 0 67 2 0 1 0.9001 9.59 0.99993
56 0 59 1 1 1 0.2307 3.43 *
57 0 68 2 0 1 0.9464 10.05 0.99995
58 0 72 1 1 1 0.8326 9.41 *
59 0 77 1 1 1 1.0641 11.71 *
60 0 73 1 1 1 0.8789 9.87 *
61 0 70 0 1 1 0.441 6.01 0.99753
62 0 79 1 0 1 1.1567 12.63 *
63 0 80 2 0 1 1.502 15.57 1
64 0 82 2 1 1 1.5946 16.49 1
65 0 81 0 0 1 0.9503 11.07 0.99998
66 0 84 1 1 1 1.3882 14.93 1
67 0 85 2 1 1 1.7335 17.87 *
68 0 72 1 1 1 0.8326 9.41 *
69 1 66 2 1 1 0.8538 9.13 *
70 0 69 2 1 1 0.9927 10.51 *
71 0 77 2 1 1 1.3631 14.19 1
72 0 79 0 0 1 0.8577 10.15 0.99996
73 0 84 0 0 1 1.0892 12.45 1
(the four lines highlighted in yellow are miss-classifications)
We provide below histograms representing the corresponding Age and Co-Morbidity patterns:
Distribution of patient ages (Infected or not) greatly differ in both groups.
The number of patient Co-morbidities in the Infected group are more (i.e. two).
The identification of statistically significant variables in the Logistics Regression and Fisher
Discriminant Analysis models is supported by the differing graphs of the distribution of variables
Ages and Co-Morbidities, of both subgroups (Patients Infected and Not Infected).
F r e q u e n c y
2 1 0
14
12
10
8
6
4
2
0
2 1 0
25
20
15
10
5
0
Profile Profile_1
Co-Morbidities
F r e q u e n c y
84 72 60 48 36
10
8
6
4
2
0
84 72 60 48 36
12
10
8
6
4
2
0
Age Age_1
Age Distribution
3.0 Clinical Trials Data Analysis using Logistics Regression and Discriminant Function
Again, we unsuccessfully tried to obtain Covid-19 patient data. Since it is important to show the
use of Logistic Regression techniques using appropriate data, we created a data set (Table #2),
built from the previous example, adapting it to suit the present one. Modifications included
changing several concomitant variables, for each individual, using our judgment and experience.
Its intent, as before, is to show how these two statistical procedures can be used to identify key
factors that affect the performance of the two patient groups analyzed (vaccinated and not).
Assume that Covid-19 vaccine clinical trials were implemented. But now only data from infected
participants were analyzed. Those infected participants who were given a placebo (denoted with
0 in column Infected) are numbered 1 to 31. Those infected participants who were given the real
vaccine (denoted with 1 in column Infected) are numbered 32 to 68. Our Event of Interest (Y=1)
is Infected Participants that received the real vaccine (as opposed to a placebo).
Description of the concomitant data recorded from each individual participant:
Co-morbid 0.None 1.Some
Gender: 0.Male 1.Female
Infected: 0.Placebo1Vaccine
Profile: Number
Participant Profile is numbered: Zero, if participant seldom interacts with others; One, if some,
cautious interaction with the outside world is realized; Two, if extensive interaction activities.
The columns Discrim and Logstics correspond to such participants’ evaluations made using their
corresponding Discrimination and Logistics functions. These outcomes will be discussed later in
this paper, when we compare again the results of these two similar statistical procedures.
Table 2: modified data matrix, from the original, created data
No. Co-Morb Age Profile Gender Infected Discrim LogEval EventProb
1 1 45 1 0 0 0.0999 -2.3611 0.0862
2 0 50 1 1 0 0.2161 -1.7109 0.1530
3 1 60 0 1 0 0.3037 -1.3028 0.2137
4 0 43 0 0 0 -0.091 -3.5135 0.0289
5 1 47 2 1 0 0.2912 -1.2088 0.2299
6 0 68 1 0 0 0.6345 0.6298 0.6524
7 1 49 1 0 0 0.1929 -1.8410 0.1369
8 0 42 2 0 0 0.1750 -1.8590 0.1348
9 0 56 0 1 0 0.2108 -1.8230 0.1391
10 0 39 1 0 0 -0.039 -3.1414 0.0414
11 0 46 1 0 0 0.1232 -2.2311 0.0970
12 1 74 1 1 0 0.7739 1.4100 0.8038
13 0 42 1 1 0 0.0302 -2.7512 0.0600
14 0 51 1 0 0 0.2394 -1.5809 0.1707
15 0 59 1 0 0 0.4253 -0.5406 0.3681
16 0 45 1 0 0 0.0999 -2.3611 *
17 0 47 2 1 0 0.2912 -1.2088 *
18 1 72 0 1 0 0.5826 0.2577 0.5641
19 1 44 1 0 0 0.0767 -2.4912 0.0765
20 0 47 1 1 0 0.1464 -2.1010 0.1090
21 0 41 0 1 0 -0.137 -3.7735 0.0225
22 0 73 1 1 0 0.7507 1.2800 0.7824
23 1 58 2 0 0 0.5468 0.2216 0.5552
24 0 60 1 1 0 0.4485 -0.4105 0.3988
25 0 65 2 0 0 0.7095 1.1319 0.7562
26 1 72 1 0 0 0.7274 1.1499 0.7595
27 0 66 2 0 0 0.7328 1.2620 0.7794
28 0 61 1 1 0 0.4718 -0.2805 0.4303
29 1 55 2 0 0 0.4771 -0.1685 0.4580
30 0 63 2 0 0 0.6630 0.8718 0.7051
31 2 78 0 1 0 0.7221 1.0379 0.7384
32 0 73 1 1 1 0.7507 1.2800 *
33 0 77 1 1 1 0.8436 1.8001 0.8582
34 1 79 1 0 1 0.8901 2.0602 0.8870
35 1 82 1 0 1 0.9598 2.4503 0.9206
36 0 33 2 0 1 -0.034 -3.0293 0.0461
37 1 78 1 0 1 0.8669 1.9302 0.8733
38 0 74 1 0 1 0.7739 1.4100 *
39 0 68 1 1 1 0.6345 0.6298 *
40 0 66 1 1 1 0.5880 0.3697 0.5914
41 0 69 1 0 1 0.6577 0.7598 0.6813
42 0 77 0 1 1 0.6988 0.9079 0.7126
43 1 85 0 0 1 0.8848 1.9482 0.8752
44 0 55 2 0 1 0.4771 -0.1685 *
45 0 49 2 0 1 0.3377 -0.9487 0.2791
46 0 57 2 1 1 0.5236 0.0916 0.5229
47 0 66 2 1 1 0.7328 1.2620 *
48 1 69 1 1 1 0.6577 0.7598 *
49 0 59 1 1 1 0.4253 -0.5406 *
50 1 55 2 1 1 0.4771 -0.1685 *
51 0 67 2 0 1 0.7560 1.3920 0.8009
52 0 59 1 1 1 0.4253 -0.5406 *
53 1 68 1 0 1 0.6345 0.6298 *
54 0 72 1 1 1 0.7274 1.1499 *
55 1 77 0 1 1 0.6988 0.9079 *
56 0 73 1 1 1 0.7507 1.2800 *
57 0 70 0 1 1 0.5361 -0.0024 0.4994
58 1 79 1 0 1 0.8901 2.0602 *
59 1 80 1 0 1 0.9133 2.1902 0.8994
60 0 82 1 1 1 0.9598 2.4503 *
61 1 81 0 0 1 0.7918 1.4280 0.8066
62 1 84 1 1 1 1.0063 2.7104 0.9376
63 1 85 0 1 1 0.8848 1.9482 *
64 1 66 2 1 1 0.7328 1.2620 *
65 0 69 2 1 1 0.8025 1.6521 0.8392
66 0 77 1 1 1 0.8436 1.8001 *
67 1 79 0 0 1 0.7453 1.1680 0.7628
68 1 84 0 0 1 0.8615 1.8182 0.8603
Discriminant Regression Analysis: Vaccine versus Co-morb, Age, Profile, Gender The regression equation is:
Infected = - 1.14 - 0.145 Comorb + 0.0248 Age + 0.132 Profile + 0.042 Gender
Predictor Coef SE Coef T P
Constant -1.1419 0.2952 -3.87 0.000
Comorb -0.1447 0.1057 -1.37 0.176
Age 0.024849 0.004110 6.05 0.000
Profile 0.13183 0.08040 1.64 0.106
Gender 0.0421 0.1029 0.41 0.684
S = 0.408134 R-Sq = 37.8% R-Sq(adj) = 33.8%
Analysis of Variance
Source DF SS MS F P
Regression 4 6.3735 1.5934 9.57 0.000
Residual Error 63 10.4941 0.1666
Total 67 16.8676
Notice how only the factors Age and Profile are statistically significant, and this at level α=0.1.
They have an impact on the difference between infected patients being either vaccinated or not.
The other two factors considered (presence of co-morbidities and gender) are not significant,
which means they do not seem to have an effect on the vaccination of the patient.
The model explains a little over 1/3 of the problem (0.378). In the previous example, it explained
over 2/3 of the problem. Such low, but still realistic Index of Fit, means that additional factors
must be found, in order to explain a larger portion of the differences between the two groups.
We present next the regression model assumption graphs, which are self-explanatory.
Finally, we redo the Discriminant Function using only the two above-mentioned significant
factors: Age and Profile.
Standardized Residual
Pe
rce
nt
420-2-4
99.9
99
90
50
10
1
0.1
Fitted Value
Sta
nd
ard
ize
d R
esid
ua
l
1.000.750.500.250.00
2
0
-2
Standardized Residual
Fre
qu
en
cy
210-1-2
20
15
10
5
0
Observation Order
Sta
nd
ard
ize
d R
esid
ua
l
65605550454035302520151051
2
0
-2
Normal Probability Plot of the Residuals Residuals Versus the Fitted Values
Histogram of the Residuals Residuals Versus the Order of the Data
Residual Plots for Vaccine
Regression Analysis: Vaccine versus Age, Profile The regression equation is:
Infected = - 1.09 + 0.0232 Age + 0.145 Profile
Predictor Coef SE Coef T P
Constant -1.0907 0.2924 -3.73 0.000
Age 0.023241 0.003897 5.96 0.000
Profile 0.14479 0.07872 1.84 0.070
S = 0.409384 R-Sq = 35.4% R-Sq(adj) = 33.4%
Analysis of Variance
Source DF SS MS F P
Regression 2 5.9739 2.9870 17.82 0.000
Residual Error 65 10.8937 0.1676
Total 67 16.8676
Notice how the significance levels have improved (we can now use α=0.07). The Index of Fit
remains practically the same. We still need to search for additional patient characteristics (i.e.
additional factors) that help better explain the differences between the two groups.
Standardized Residual
Pe
rce
nt
420-2-4
99.9
99
90
50
10
1
0.1
Fitted Value
Sta
nd
ard
ize
d R
esid
ua
l
1.000.750.500.250.00
2
1
0
-1
-2
Standardized Residual
Fre
qu
en
cy
210-1-2
20
15
10
5
0
Observation Order
Sta
nd
ard
ize
d R
esid
ua
l
65605550454035302520151051
2
1
0
-1
-2
Normal Probability Plot of the Residuals Residuals Versus the Fitted Values
Histogram of the Residuals Residuals Versus the Order of the Data
Residual Plots for Vaccine
We now implement the equivalent Logistic Regression with the same data set.
Binary Logistic Regression: Vaccine versus Comorb, Age, Profile, Gender Link Function: Logit
Response Information
Variable Value Count
Vaccine 1 37 (Event)
0 31
Total 68
Logistic Regression Table
Odds 95% CI
Predictor Coef SE Coef Z P Ratio Lower Upper
Constant -9.91009 2.68358 -3.69 0.000
Comorb
1 -0.417620 0.704333 -0.59 0.553 0.66 0.17 2.62
2 -22.9869 24542.6 -0.00 0.999 0.00 0.00 *
Age 0.142119 0.0350788 4.05 0.000 1.15 1.08 1.23
Profile 0.835803 0.545221 1.53 0.125 2.31 0.79 6.72
Gender
1 0.595450 0.667785 0.89 0.373 1.81 0.49 6.71
Log-Likelihood = -30.820
Test that all slopes are zero: G = 32.099, DF = 5, P-Value = 0.000
Compare with the above Discriminant Function, and verify how the same two Factors are
significant at level α=0.125, and the same other two factors are not significant.
Redone for the two significant variables (at alfa = 0.12)
Link Function: Logit
Response Information
Variable Value Count
Vaccine 1 37 (Event)
0 31
Total 68
Logistic Regression Table
Odds 95% CI
Predictor Coef SE Coef Z P Ratio Lower Upper
Constant -9.10513 2.42337 -3.76 0.000
Age 0.130039 0.0318680 4.08 0.000 1.14 1.07 1.21
Profile 0.892254 0.512681 1.74 0.082 2.44 0.89 6.67
Log-Likelihood = -32.976
Test that all slopes are zero: G = 27.786, DF = 2, P-Value = 0.000
Again, results are equivalent to the ones obtained with the Discriminant. To verify this, we
plot the two responses, evaluated with their respective Logistic and Discriminant functions.
The perfect straight line shows how these two results are, in fact, equivalent:
FITS2
Lo
gEv
al
1.000.750.500.250.00
3
2
1
0
-1
-2
-3
-4
Regression Eval Fits (significant variables)
The probability (EventProb: last column of Table 2) of the ith patient inclusion in the group of
interest (Y=1), given its particular characteristics vector, denoted by Xi is:
P{Y=1| Xi} = p(Y=1 | Xi ; b) = [eβ
0+ x·β
] / [ 1+ e β
0+ x·β
]
For example, for patient number one, in Table #2 we have: β0 = -9.11; β1 = 0.13; β2 =0.89
No. Co-Morb Age Profile Gender Infected FITS2 LogEval EventProb
1 1 45 1 0 0 0.0999 -2.3611 0.0862
P {Y=1 | Xi ; b= (β0, β1, β2)} = [eβ
0+ x·β
] / [ 1+ e β
0+ x·β
] = 0.086
Below, find the distributions of the two Age groups analyzed:
Fre
qu
en
cy
8472604836
9
8
7
6
5
4
3
2
1
0
8472604836
Age Age_1
Age (Vaccinated and Not)
Notice how infected age distributions differ. This factor his highly significant (very small p-val.)
Below, find the distribution of the Patient Profiles for the two group analyzed:
Fr
eq
ue
ncy
210
20
15
10
5
0
210
Profile Profile_1
Profile (Vaccinated and Not)
Notice how the distributions differ, however less than with Age. That is why the p-val = 0.082 is
higher, and the OR 95% CI is wider and covers Unit. This factor is less reliable than the first one. Below are the Descriptive Statistics: Age, Age_1, Profile, Profile_1 Variable N Mean SE Mean StDev Min Q1
Age 31 55.42 2.06 11.49 39.00 45.00
Age_1 37 70.89 1.86 11.31 33.00 66.00
Profile 31 1.065 0.122 0.680 0.000 1.000
Profile_1 37 1.027 0.113 0.687 0.000 1.000
Variable Median Q3 Maximum
Age 55.00 65.00 78.00
Age_1 73.00 79.00 85.00
Profile 1.000 2.000 2.000
Profile_1 1.000 1.500 2.000
Compare the Five descriptive statistics for the two statistically significant factors analyzed:
Profile and Age, and verify how they do differ (as in the graphs above).
5.0 Discussion
Again, the data used in this analysis was not collected; it was created by this author for
illustrative purposes. Thence, the results and discussion below are also only for illustrative
purposes. We hope, with this exercise, to encourage researchers in the Public Health and
medical environments to implement statistical procedures using their real Covid-19 data.
By implementing either Logistics Regression or Discrimination Analysis, we detect two factors
that differentiate results from both groups of patients infected (those who have been vaccinated,
and those who have not). Since data are assumed to be random samples from these two groups,
we can infer that older patients, even when vaccinated, are still prone to become infected, and
that patient profile (i.e. the level with which they interact with the rest of the world) also has an
effect. The latter effect is low; thence a larger sample should be drawn, to confirm or reject.
This approach can be reproduced by public health and medical researchers, using as responses
different pairs of groups: placebo v. vaccinated, infected v. not infected, deceased v. surviving,
Vaccine A v. Vaccine B, etc. Patient factors may include any characteristic of interest: weight,
age, gender, occupation, number (or specific types) of co-morbidities, level of interaction, etc.
Logistic Regression or Discrimination Analysis can then be implemented, and the statistically
significant factors will identify the key elements on which to undertake further research.
As we have said in our previous article, vaccine development, including clinical trials and release
decisions, must be science and not politically based. When a vaccine is released, it is because its
risk analysis has proven vaccine yields more benefits than harm. The early release of Covid-19
vaccines is due to the urgency of having more than a million deaths already world-wide, and the
300 thousand already occurred in the USA, and counting.
6. Conclusions
This Covid-19 work stems from our proposal to the retired academic and research communities:
https://www.researchgate.net/publication/341282217_A_Proposal_for_Fighting_Covid-
19_and_its_Economic_Fallout which pursues one goal: to contribute to defeat Covid-19.
This paper is a tutorial on the uses of Logistic Regression to help identify key factors in Covid-
19 Clinical Trials. The data analyzed was created by this researcher, using his experience and
information. Thence, its numerical results have only illustrative value. However, public health
and medical researchers and practitioners can follow our Logistic Regression procedures, and
substitute their own data for ours, generating additional analyses, and including new factors, as
they become available.
We want to reach four audiences: (1) public health professionals and researchers, (2) medical
doctors, (3) statisticians and (4) the public in general.
We want to encourage public health and medical professionals to use more statistical procedures
and do more joint work with statisticians -not only after data have been collected, but also at the
time that experiments are being designed
We want to encourage statisticians, especially those retired, who have the experience, financial
support (their pension), and the time to provide such assistance, to contribute in helping with the
planning, implementation and analysis of statistical procedures –or with writing about them.
We want to provide illustrative examples to doctors, public health researchers, and to the general
public, to help them better understand what the others do, fostering more efficient collaboration.
Finally, this series of papers on statistical analysis of Covid-19, listed in the initial section of this
article could become part of a biostatistics course in a public health or medical curriculum, or an
applications course in a statistics department.
Bibliography
Beyer, W., Editor. Handbook of Tables for Probability and Statistics. The Chemical Rubber Co.
(CRC). Ohio. 1966.
Box, G., Hunter, W. G., and J. S. Hunter. Statistics for Experimenters.Wiley. New York.1978.
Walpole, R. E. and R. H. Myers. Probability and Statistics for Engineers and Scientists. Prentice-
Hall. http://www.elcom-hu.com/Mshtrk/Statstics/9th%20txt%20book.pdf
Romeu, J. L. Operations Research and Statistics Techniques. Proceedings of Federal Conference
on Statistical Methodology. https://web.cortland.edu/matresearch/OR&StatsFCSMPaper.pdf
Romeu, J. L. Determining the Experimental Sample Size. Journal of Systems Reliability Center.
(SRC): 3rd Qtr. 2005 (pp. 11-21).
About the Author:
Jorge Luis Romeu retired Emeritus from the State University of New York (SUNY). He was, for
sixteen years, a Research Professor at Syracuse University, where he is currently an Adjunct
Professor of Statistics. Romeu worked for many years as a Senior Research Engineer with the
Reliability Analysis Center (RAC), an Air Force Information and Analysis Center operated by
IIT Research Institute (IITRI). Romeu received seven Fulbright assignments: in Mexico (3), the
Dominican Republic (2), Ecuador, and Colombia. He holds a doctorate in Statistics/O.R., is a C.
Stat. Fellow, of the Royal Statistical Society, a Senior Member of the American Society for
Quality (ASQ), and Member of the American Statistical Association. He is a Past ASQ Regional
Director (and currently a Deputy Regional Director), and holds Reliability and Quality ASQ
Professional Certifications. Romeu created and directs the Juarez Lincoln Marti International Ed.
Project (JLM, https://web.cortland.edu/matresearch/), which supports (i) higher education in
Ibero-America and (ii) maintains the Quality, Reliability and Continuous Improvement Institute
(QR&CII, https://web.cortland.edu/romeu/QR&CII.htm) applied statistics web site.