practical multivariate analysis sixth edition ch1-4.pdf1.1 deﬁning multivariate analysis 3 1.2...

Practical Multivariate AnalysisSixth Edition

Abdelmonem Afifi, Susanne May, Robin A. Donatello, Virginia A. Clark

Contents

Preface xiii

Preface xiii

Authors’ Biographies xvii

Authors’ Biographies xvii

I Preparation for Analysis 1

1 What is multivariate analysis? 31.1 Defining multivariate analysis 31.2 Examples of multivariate analyses 31.3 Exploratory versus confirmatory analyses 61.4 Multivariate analyses discussed in this book 61.5 Organization and content of the book 9

2 Characterizing data for analysis 112.1 Variables: their definition, classification, and use 112.2 Defining statistical variables 112.3 Stevens’s classification of variables 122.4 How variables are used in data analysis 142.5 Examples of classifying variables 152.6 Other characteristics of data 152.7 Summary 152.8 Problems 15

3 Preparing for data analysis 173.1 Processing data so they can be analyzed 173.2 Choice of a statistical package 183.3 Techniques for data entry 193.4 Organizing the data 233.5 Reproducible research and literate programming 283.6 Example: Depression study 303.7 Summary 323.8 Problems 32

4 Data Visualization 374.1 Introduction 374.2 Univariate Data 384.3 Bivariate Data 454.4 Multivariate Data 50

vii

viii

4.5 Discussion of computer programs 524.6 What to watch out for 544.7 Summary 564.8 Problems 56

5 Data screening and transformations 595.1 Transformations, assessing normality and independence 595.2 Common transformations 595.3 Selecting appropriate transformations 625.4 Assessing independence 695.5 Discussion of computer programs 715.6 Summary 715.7 Problems 72

6 Selecting appropriate analyses 756.1 Which analyses to perform? 756.2 Why selection is often difficult 756.3 Appropriate statistical measures 766.4 Selecting appropriate multivariate analyses 796.5 Summary 806.6 Problems 80

II Regression Analysis 85

7 Simple regression and correlation 877.1 Chapter outline 877.2 When are regression and correlation used? 877.3 Data example 887.4 Regression methods: fixed-X case 897.5 Regression and correlation: variable-X case 937.6 Interpretation: fixed-X case 937.7 Interpretation: variable-X case 947.8 Other available computer output 987.9 Robustness and transformations for regression 1037.10 Other types of regression 1057.11 Special applications of regression 1077.12 Discussion of computer programs 1107.13 What to watch out for 1107.14 Summary 1127.15 Problems 112

8 Multiple regression and correlation 1158.1 Chapter outline 1158.2 When are regression and correlation used? 1158.3 Data example 1168.4 Regression methods: fixed-X case 1178.5 Regression and correlation: variable-X case 1198.6 Interpretation: fixed-X case 1248.7 Interpretation: variable-X case 1268.8 Regression diagnostics and transformations 1288.9 Other options in computer programs 132

ix

8.10 Discussion of computer programs 1368.11 What to watch out for 1398.12 Summary 1408.13 Problems 140

9 Variable selection in regression 1459.1 Chapter outline 1459.2 When are variable selection methods used? 1459.3 Data example 1479.4 Criteria for variable selection 1499.5 A general F test 1529.6 Stepwise regression 1539.7 Lasso regression 1599.8 Discussion of computer programs 1639.9 Discussion of strategies 1639.10 What to watch out for 1659.11 Summary 1679.12 Problems 167

10 Special regression topics 17110.1 Chapter outline 17110.2 Missing values in regression analysis 17110.3 Dummy variables 17710.4 Constraints on parameters 18410.5 Regression analysis with multicollinearity 18610.6 Ridge regression 18710.7 Summary 19010.8 Problems 191

11 Discriminant analysis 19511.1 Chapter outline 19511.2 When is discriminant analysis used? 19511.3 Data example 19611.4 Basic concepts of classification 19711.5 Theoretical background 20211.6 Interpretation 20411.7 Adjusting the dividing point 20711.8 How good is the discrimination? 20911.9 Testing variable contributions 21011.10 Variable selection 21111.11 Discussion of computer programs 21111.12 What to watch out for 21211.13 Summary 21411.14Problems 214

12 Logistic regression 21712.1 Chapter outline 21712.2 When is logistic regression used? 21712.3 Data example 21812.4 Basic concepts of logistic regression 21912.5 Interpretation: categorical variables 22012.6 Interpretation: continuous variables 222

x

12.7 Interpretation: interactions 22312.8 Refining and evaluating logistic regression 22912.9 Nominal and ordinal logistic regression 23812.10 Applications of logistic regression 24312.11Poisson regression 24612.12Discussion of computer programs 24912.13What to watch out for 24912.14Summary 25112.15Problems 251

13 Regression analysis with survival data 25713.1 Chapter outline 25713.2 When is survival analysis used? 25713.3 Data examples 25813.4 Survival functions 25813.5 Common survival distributions 26413.6 Comparing survival among groups 26413.7 The log-linear regression model 26613.8 The Cox regression model 26813.9 Comparing regression models 27613.10Discussion of computer programs 27813.11 What to watch out for 27813.12 Summary 28013.13Problems 280

14 Principal components analysis 28314.1 Chapter outline 28314.2 When is principal components analysis used? 28314.3 Data example 28414.4 Basic concepts 28414.5 Interpretation 28714.6 Other uses 29414.7 Discussion of computer programs 29614.8 What to watch out for 29714.9 Summary 29714.10Problems 298

15 Factor analysis 29915.1 Chapter outline 29915.2 When is factor analysis used? 29915.3 Data example 30015.4 Basic concepts 30015.5 Initial extraction: principal components 30215.6 Initial extraction: iterated components 30515.7 Factor rotations 30715.8 Assigning factor scores 31115.9 Application of factor analysis 31215.10Discussion of computer programs 31215.11What to watch out for 31415.12Summary 31515.13Problems 31615.14Problems 316

xi

16 Cluster analysis 31916.1 Chapter outline 31916.2 When is cluster analysis used? 31916.3 Data example 32016.4 Basic concepts: initial analysis 32016.5 Analytical clustering techniques 32616.6 Cluster analysis for financial data set 33016.7 Discussion of computer programs 33516.8 What to watch out for 33816.9 Summary 33816.10Problems 338

17 Log-linear analysis 34117.1 Chapter outline 34117.2 When is log-linear analysis used? 34117.3 Data example 34217.4 Notation and sample considerations 34317.5 Tests and models for two-way tables 34517.6 Example of a two-way table 34717.7 Models for multiway tables 34917.8 Exploratory model building 35217.9 Assessing specific models 35617.10 Sample size issues 35717.11 The logit model 35817.12Discussion of computer programs 36017.13 What to watch out for 36017.14 Summary 36217.15Problems 362

18 Correlated outcomes regression 36518.1 Chapter outline 36518.2 When is correlated outcomes regression used? 36518.3 Data examples 36618.4 Basic concepts 36818.5 Regression of clustered data with continuous outcome 37318.6 Regression of clustered data with binary outcomes 37718.7 Regression of longitudinal data 37918.8 Generalized estimating equations analysis of correlated data 38318.9 Discussion of computer programs 38718.10 What to watch out for 38818.11 Summary 38918.12Problems 389

Appendix A 393

Appendix A 393A.1 Data sets and how to obtain them 393A.2 Chemical companies financial data 393A.3 Depression study data 393A.4 Financial performance cluster analysis data 393A.5 Lung cancer survival data 394A.6 Lung function data 394

xii

A.7 Parental HIV data 394A.8 Northridge earthquake data 395A.9 School data 395A.10 Mice data 395

Bibliography 397

Index 415

Preface

The first edition of this book appeared in 1984 under the title “Computer Aided Multivariate Anal-ysis.” The title was chosen in order to distinguish it from other books that were more theoreticallyoriented. By the time we published the fifth edition in 2012, it was impossible to think of a book onmultivariate analysis for scientists and applied researchers that is not computer oriented. We there-fore decided at that time to change the title to Practical Multivariate Analysis to better characterizethe nature of the book. Today, we are pleased to present the sixth edition.

We wrote this book for investigators, specifically behavioral scientists, biomedical scientists,and industrial or academic researchers, who wish to perform multivariate statistical analyses andunderstand the results. We expect the readers to be able to perform and understand the results,but also expect them to know when to ask for help from an expert on the subject. The book caneither be used as a self-guided textbook or as a text in an applied course in multivariate analysis.In addition, we believe that the book can be helpful to many statisticians who have been trainedin conventional mathematical statistics who are now working as statistical consultants and need toexplain multivariate statistical concepts to clients with a limited background in mathematics.

We do not present mathematical derivations of the techniques; rather we rely on geometric andgraphical arguments and on examples to illustrate them. The mathematical level has been deliber-ately kept low. While the derivations of the techniques are referenced, we concentrate on applica-tions to real-life problems, which we feel are the ‘fun’ part of multivariate analysis. To this end, weassume that the reader will use a packaged software program to perform the analysis. We discussspecifically how each of four popular and comprehensive software packages can be used for thispurpose. These packages are R, SAS, SPSS, and STATA. The book can be used, however, in con-junction with all other software packages since our presentation explains the output of most standardstatistical programs.

We assume that the reader has taken a basic course in statistics that includes tests of hypothesesand covers one-way analysis of variance.

Approach of this book

We wrote the book in a modular fashion. Part One, consisting of six chapters, provides examples ofstudies requiring multivariate analysis techniques, discusses characterizing data for analysis, com-puter programs, data entry, data management, data clean-up, missing values, and transformations. Italso includes a new chapter on graphics and data visualization and presents a rough guide to assistin the choice of an appropriate multivariate analysis. We included these topics since many investi-gators have more difficulty with these preliminary steps than with running the multivariate analysesthemselves. Also, if these steps are not done with care, the results of the statistical analysis can befaulty.

In the rest of the chapters, we follow a standard format. The first four sections of each chapterinclude a discussion of when the technique is used, a data example, and the basic assumptions andconcepts of the technique. In subsequent sections, we present more detailed aspects of the analysis.At the end of each chapter, we give a summary table showing which features are available in thefour software packages. We also include a section entitled ‘What to watch out for’ to warn the readerabout common problems related to data analysis. In those sections, we rely on our own experiencesin consulting and those detailed in the literature to supplement the formal treatment of the subject.

xiii

xiv PREFACE

Part Two covers regression analysis. Chapter 7 deals with simple linear regression and is in-cluded for review purposes to introduce our notation and to provide a more complete discussionof outliers and diagnostics than is found in some elementary texts. Chapters 8-10 are concernedwith multiple linear regression. Multiple linear regression is used very heavily in practice and pro-vides the foundation for understanding many concepts relating to residual analysis, transformations,choice of variables, missing values, dummy variables, and multicollinearity. Since these conceptsare essential to a good grasp of multivariate analysis, we thought it useful to include these chaptersin the book.

Chapters 11-18 might be considered the heart of multivariate analysis. They include chapters ondiscriminant analysis, logistic regression analysis, survival analysis, principal components analysis,factor analysis, cluster analysis, log-linear analysis and correlated outcomes regression. The mul-tivariate analyses have been discussed more as separate techniques than as special cases of somegeneral framework. The advantage of this approach is that it allows us to concentrate on explaininghow to analyze a certain type of data from readily available computer programs to answer realisticquestions. It also enables the reader to approach each chapter independently. We did include inter-spersed discussions of how the different analyses relate to each other in an effort to describe the ‘bigpicture’ of multivariate analysis.

How to use the bookWe have received many helpful suggestions from instructors and reviewers on how to order thesechapters for reading or teaching purposes. For example, one instructor uses the following order inteaching: principal components, factor analysis, and then cluster analysis, . Another prefers present-ing a detailed treatment of multiple regression followed by logistic regression and survival analysis.Instructors and self-learning readers have a wide choice of other orderings of the material becausethe chapters are largely self contained.

What’s new in the Sixth Edition

During the nearly thirty six years since we wrote the first edition of this book, tremendous advanceshave taken place in the field of computing and software development. These advances have madeit possible to quickly perform any of the multivariate analyses that were available only in theory atthat time. They also spurred the invention of new multivariate analyses as well as new options formany of the standard methods. In this edition, we have taken advantage of these developments andmade many changes as described below.

For each of the techniques discussed, we used the most recent software versions available anddiscussed the most modern ways of performing the analysis. In each chapter, we updated the ref-erences to today’s literature (while still including the fundamental original references). In termsof statistical software, we discontinued description of S-Plus because of the more wide-spread useof the similar package R. Also, we no longer include Statistica since it is largely not used by ourintended readers.

In addition to the above-described modifications, we included comments to distinguish betweenexploratory and confirmatory analyses in Chapter 1 and throughout the book. We also expandedthe discussion of missing values in Chapter 3 and added a discussion of literate programming andreproducible research.

As mentioned above, we added a new chapter (Chapter 4) on graphics and data visualization.In Chapter 9, we updated our discussion of variable selection and added a description of Lasso,a more recent method than the ones already included. In Chapter 10, we added a description ofMICE, a multiple imputation approach for dealing with missing values. In Chapter 18, we added adescription of the generalized estimating equations (GEE) method for handling correlated data andcompared it to the mixed model approach. Finally, in each chapter we updated and/or expanded thesummary table of the options available in the four statistical packages to make it consistent with themost recent software versions.

xv

Data sets used for examples and problems are described throughout the book as needed andsummarized in Appendix A. Two web sites are also available. The first one is the CRC web site:http://www.crcpress.com/product/isbn/9781138702226. From this site, you can down-load all the data sets used in the book by clicking on the Downloads/Updates tab. The other website that is available to all readers is: https://stats.idre.ucla.edu/other/examples/pma6.This site, developed by the UCLA Institute for Digital Research and Education (IDRE), includesthe data sets in the formats of various statistical software packages available in the links included inthe Appendix A part of the table of contents in that web page. It also includes illustrations of exam-ples in most chapters, complete with code for three of the four software packages used in the book.Please note that the current site is done for the 5th edition and it is hoped that it will be updated forthe sixth edition. We encourage readers to obtain data from either web site and frequently refer tothe solutions given in the UCLA web site for practice.

Acknowledgements

We would like to express our appreciation to our colleagues and former students and staff thathelped us over the years, both in the planning and preparation of the various editions. These includeour colleagues Drs. Carol Aneshensel, Roger Detels, Robert Elashoff, Ralph Frerichs, Mary AnnHill, and Roberta Madison. Our former students include Drs. Stella Grosser, Luohua Jiang, JackLee, Steven Lewis, Tim Morgan, Leanne Streja and David Zhang. Our former staff includes Ms.Dorothy Breininger, Jackie Champion, and Anne Eiseman. In addition, we would like to thank Ms.Meike Jantzen, and Mr. Jack Fogliasso for their help with the references and typesetting.

We also thank Rob Calver and Lara Spieker from CRC Press for their very capable assistancein the preparation of the sixth edition.

We especially appreciate the efforts of the staff of UCLA Institute for Digital Research andEducation in putting together the UCLA web site of examples from the book (referenced above).

Our deep gratitude goes to our spouses Marianne Afifi, Bruce Jacobson, Ian Donatello andWelden Clark for their patience and encouragement throughout the stages of conception, writing,and production of the book. Special thanks go to Welden Clark for his expert assistance and trou-bleshooting of earlier electronic versions of the manuscript.

Abdelmonem AfifiSusanne MayRobin DonatelloVirginia A. Clark

Authors’ Biographies

Abdelmonem Afifi, Ph.D., has been Professor of Biostatistics in the School of Public Health, Uni-versity of California, Los Angeles (UCLA) since 1965, and served as the Dean of the School from1985 until 2000. His research includes multivariate and multilevel data analysis, handling missingobservations in regression and discriminant analyses, meta-analysis, and model selection. Over theyears, he taught well-attended courses in biostatistics for Public Health students and clinical re-search physicians, and doctoral-level courses in multivariate statistics and multilevel modeling. Hehas authored many publications in statistics and health related fields, including two widely usedbooks (with multiple editions) on multivariate analysis. He received several prestigious awards forexcellence in teaching and research.

Susanne May, Ph.D., is a Professor in the Department of Biostatistics at the University of Wash-ington in Seattle. Her areas of expertise and interest include clinical trials, survival analysis, andlongitudinal data analysis. She has more than 20 years of experience as a statistical collaboratorand consultant on health related research projects. In addition to a number of methodological andapplied publications, she is a coauthor (with Drs. Hosmer and Lemeshow) of Applied SurvivalAnalysis: Regression Modeling of Time-to-Event Data. Dr. May has taught courses on introductorystatistics, clinical trials, and survival analysis.

Robin A. Donatello, Dr. P.H., is an Associate Professor in the Department of Mathematics andStatistics and the Developer of the Data Science Initiative at California State University, Chico.Her areas of interest include applied research in the Public Health and Natural Science fields. Shehas expertise in data visualization, techniques to address missing and erroneous data, implementingreproducible research workflows, computational statistics and Data Science. Dr. Donatello teachesundergraduate and graduate level courses in statistical programming, applied statistics, and datascience.

Virginia A. Clark, Ph. D., was professor emerita of Biostatistics and Biomathematics at UCLA.For 27 years, she taught courses in multivariate analysis and survival analysis, among others. Inaddition to this book, she is coauthor of four books on survival analysis, linear models and analysisof variance, and survey research as well as an introductory book on biostatistics. She publishedextensively in statistical and health science journals.

xvii

Part I

Preparation for Analysis

1

Chapter 1

What is multivariate analysis?

1.1 Defining multivariate analysis

The expression multivariate analysis is used to describe analyses of data that are multivariate inthe sense that numerous observations or variables are obtained for each individual or unit studied. Ina typical survey 30 to 100 questions are asked of each respondent. In describing the financial statusof a company, an investor may wish to examine five to ten measures of the company’s performance.Commonly, the answers to some of these measures are interrelated. The challenge of disentanglingcomplicated interrelationships among various measures on the same individual or unit and of inter-preting these results is what makes multivariate analysis a rewarding activity for the investigator.Often results are obtained that could not be attained without multivariate analysis.

In the next section of this chapter several studies are described in which the use of multivariateanalysis is essential to understanding the underlying problem. Section 1.3 provide a rational formaking a distinction between confirmatory and exploratory analyses. Section 1.4 gives a listing anda very brief description of the multivariate analysis techniques discussed in this book. Section 1.5then outlines the organization of the book.

1.2 Examples of multivariate analyses

The studies described in the following subsections illustrate various multivariate analysis tech-niques. These are used later in the book as examples.

Depression study example

The data for the depression study have been obtained from a complex, random, multiethnic sampleof 1000 adult residents of Los Angeles County. The study was a panel or longitudinal design wherethe same respondents were interviewed four times between May 1979 and July 1980. About three-fourths of the respondents were re-interviewed for all four interviews. The field work for the surveywas conducted by professional interviewers from the Institute for Social Science Research at theUniversity of California in Los Angeles.

This research is an epidemiological study of depression and help-seeking behavior among free-living (noninstitutionalized) adults. The major objectives are to provide estimates of the prevalenceand incidence of depression and to identify causal factors and outcomes associated with this con-dition. The factors examined include demographic variables, life events stressors, physical healthstatus, health care use, medication use, lifestyle, and social support networks. The major instrumentused for classifying depression is the Depression Index (CESD) of the National Institute of MentalHealth, Center of Epidemiological Studies. A discussion of this index and the resulting prevalenceof depression in this sample is given in Frerichs et al. (1981).

The longitudinal design of the study offers advantages for assessing causal priorities since thetime sequence allows us to rule out certain potential causal links. Nonexperimental data of this typecannot directly be used to establish causal relationships, but models based on an explicit theoretical

3

4 CHAPTER 1. WHAT IS MULTIVARIATE ANALYSIS?

framework can be tested to determine if they are consistent with the data. An example of such modeltesting is given in Aneshensel and Frerichs (1982).

Data from the first time period of the depression study are described in Chapter 3. Only a subsetof the factors measured on a subsample of the respondents is included in this book’s web site inorder to keep the data set easily comprehensible. These data are used several times in subsequentchapters to illustrate some of the multivariate techniques presented in this book.

Parental HIV study

The data from the parental HIV study have been obtained from a clinical trial to evaluate an inter-vention given to increase coping skills (Rotheram-Borus et al., 2001). The purpose of the interven-tion was to improve behavioral, social, and health outcomes for parents with HIV/AIDS and theirchildren. Parents and their adolescent children were recruited from the New York City Division ofAids Services (DAS). Adolescents were eligible for the study if they were between the ages of 11and 18 and if the parents and adolescents had given informed consent. Individual interviews wereconducted every three months for the first two years and every six months thereafter. Informationobtained in the interviews included background characteristics, sexual behavior, alcohol and druguse, medical and reproductive history, and a number of psychological scales.

A subset of the data from the study is available on this book’s web site. To protect the identity ofthe participating adolescents we used the following procedures. We randomly chose one adolescentper family. In addition, we reduced the sample further by choosing a random subset of the originalsample. Adolescent case numbers were assigned randomly without regard to the original order orany other numbers in the original data set.

Data from the baseline assessment will be used for problems as well as to illustrate variousmultivariate analysis techniques.

Northridge earthquake study

On the morning of January 17, 1994 a magnitude 6.7 earthquake centered in Northridge, CA awokeLos Angeles and Ventura County residents. Between August 1994 and May 1996, 1830 residentswere interviewed about what happened to them in the earthquake. The study uses a telephone sur-vey lasting approximately 48 minutes to assess the residents’ experiences in and responses to theNorthridge earthquake. Data from 506 residents are included in the data set posted on the book website, and described in Appendix A.

Subjects were asked where they were, how they reacted, where they obtained information,whether their property was damaged or whether they experienced injury, and what agencies theywere in contact with. The questionnaire included the Brief Symptom Inventory (BSI), a measureof psychological functioning used in community studies, and questions on emotional distress. Sub-jects were also asked about the impact of the damage to the transportation system as a result of theearthquake. Investigators not only wanted to learn about the experiences of the Southern Californiaresidents in the Northridge earthquake, but also wished to compare their findings to similar studiesof the Los Angeles residents surveyed after the Whittier Narrows earthquake on October 1, 1987,and Bay Area residents interviewed after the Loma Prieta earthquake on October 17, 1989.

The Northridge earthquake data set is used in problems at the end of several chapters of the bookto illustrate a number of multivariate techniques. Multivariate analyses of these data include, forexample, exploring pre- and post-earthquake preparedness activities as well as taking into accountseveral factors relating to the subject and the property (Nguyen et al., 2006).

Bank loan study

The managers of a bank need some way to improve their prediction of which borrowers will suc-cessfully pay back a type of bank loan. They have data from the past on the characteristics of persons

1.2. EXAMPLES OF MULTIVARIATE ANALYSES 5

to whom the bank has lent money and the subsequent record of how well the person has repaid theloan. Loan payers can be classified into several types: those who met all of the terms of the loan,those who eventually repaid the loan but often did not meet deadlines, and those who simply de-faulted. They also have information on age, sex, income, other indebtedness, length of residence,type of residence, family size, occupation, and the reason for the loan. The question is, can a simplerating system be devised that will help the bank personnel improve their prediction rate and lessenthe time it takes to approve loans? The methods described in Chapter 12 and Chapter 13 can be usedto answer this question.

Lung function study

The purpose of this lung function study of chronic respiratory disease is to determine the effectsof various types of smog on lung function of children and adults in the Los Angeles area. Becausethey could not randomly assign people to live in areas that had different levels of pollutants, theinvestigators were very concerned about the interaction that might exist between the locations wherepersons chose to live and their values on various lung function tests. The investigators picked fourareas of quite different types of air pollution and measured various demographic and other responseson all persons over seven years old who live there. These areas were chosen so that they are close toan air-monitoring station.

The researchers took measurements at two points in time and used the change in lung functionover time as well as the levels at the two periods as outcome measures to assess the effects of airpollution. The investigators had to do the lung function tests by using a mobile unit in the field, andmuch effort went into problems of validating the accuracy of the field observations. A discussionof the particular lung function measurements used for one of the four areas can be found in Detelset al. (1975). In the analysis of the data, adjustments must be made for sex, age, height, and smokingstatus of each person.

Over 15,000 respondents have been examined and interviewed in this study. The data set is beingused to answer numerous questions concerning effects of air pollution, smoking, occupation, etc. ondifferent lung function measurements. For example, since the investigators obtained measurementson all family members seven years old and older, it is possible to assess the effects of having parentswho smoke on the lung function of their children (Tashkin et al., 1984). Studies of this type requiremultivariate analyses so that investigators can arrive at plausible scientific conclusions that couldexplain the resulting lung function levels.

This data set is described in Appendix A. Lung function and associated data for nonsmokingfamilies for the father, mother, and up to three children ages 7–17 are available from the book’s website.

School data set

The school data set is a publicly available data set that is provided by the National Centerfor Educational Statistics. The data come from the National Education Longitudinal Study of1988 (called NELS:88). The study collected data beginning with 8th graders and conducted ini-tial interviews and four follow-up interviews which were performed every other year. The dataused here contain only initial interview data. They represent a random subsample of 23 schoolswith 519 students out of more than a thousand schools with almost twenty five thousand stu-dents. Extensive documentation of all aspects of the study is available at the following web site:http://nces.ed.gov/surveys/NELS88/. The longitudinal component of NELS:88 has beenused to investigate change in students’ lives and school-related and other outcomes. The focuson the initial interview data provides the opportunity to examine associations between school andstudent-related factors and students’ academic performance in a cross-sectional manner. This typeof analysis will be illustrated in Chapter 18.


1.3 Exploratory versus confirmatory analyses

A crucial component for most research studies and analyses is the testing of hypotheses. For sometypes of studies, hypotheses are specified in detail prior to study start (a priori) and then remainunchanged. This is typically the case, e.g., for clinical trials and other designed experiments. Forother types of studies, some hypotheses might be specified in advance while others are generatedonly after study start and potentially after reviewing some or all of the study data. This is oftenthe case for observational studies. In this section, we make a distinction between two conceptuallydifferent approaches to analysis and reporting based on whether the primary goal of a study is toconfirm prespecified hypotheses or to explore hypotheses that have not been prespecified.

The following is a motivating example provided by Fleming (2010). He describes an experiencewhere he walked into a maternity ward (when they still had such) while visiting a friend who hadjust given birth. He noticed that there were 22 babies, but only 2 of one gender while the other 20were of the other gender. As a statistician, he dutifully calculated the p-value for the likelihood ofseeing such (or worse) imbalance if in truth there are 50% of each. The two-sided p-value turns outto be 0.0001, indicating a very small likelihood (1 in 10,000) of such or more extreme imbalanceto be observed if in truth there are 50% of each. This is an example of where the hypothesis wasgenerated after seeing the data. We will call such hypotheses exploratory.

Following Fleming, researchers might want to go out and test an exploratory hypothesis in an-other setting or with new data. In the example above, one might want to go to another maternityward to collect further evidence of a strong imbalance in gender distribution at birth. Imagine thatin a second (confirmatory) maternity ward there might be exactly equal numbers for each gender(e.g. 11 boys and 11 girls). Testing the same hypothesis in this setting will not yield any statisticallysignificant difference from the presumed 50%. Nevertheless, one might be tempted to simply com-bine the two studies. A corresponding two-sided p-value remains statistically significant (p-value< 0.01).

The above example might appear silly, because few researchers will believe that the distributionof gender at birth (without human interference) is very different from 50%. Nevertheless, thereare many published research articles which test and present the results for hypotheses that weregenerated by looking at data and noticing ‘unusual’ results. Without a clear distinction betweenwhether hypotheses were specified a priori or not, it is difficult to interpret the p-values provided.

Results from confirmatory analyses provide much stronger evidence than results from ex-ploratory analyses. Accordingly, interpretation of results from confirmatory analyses can be statedusing much stronger language than interpretation of results from exploratory analyses. Furthermore,results from exploratory analyses should not be combined with results from confirmatory analyses(e.g. in meta analyses), because the random high bias (Fleming, 2010) will remain (albeit atten-uated). To avoid random high bias when combining data or estimates from multiple studies onlydata/estimates from confirmatory analyses should be combined. However, this requires clear identi-fication of whether confirmatory or exploratory analysis were performed for each individual studyand/or analysis.

Many authors have pointed out that the medical literature is replete with studies that cannot bereproduced (Breslow, 1999; Munafo et al., 2017). As argued by Breslow (1999), reproducibility ofstudies, and in particular epidemiologic studies, can be improved if hypotheses are specified a prioriand the nature of the study (exploratory versus confirmatory) is clearly specified.

Throughout this book, we distinguish between the two approaches to multivariate analyses andpresentations of results and provide examples for each.

1.4 Multivariate analyses discussed in this book

In this section a brief description of the major multivariate techniques covered in this book is pre-sented. To keep the statistical vocabulary to a minimum, we illustrate the descriptions by examples.

1.4. MULTIVARIATE ANALYSES DISCUSSED IN THIS BOOK 7

Simple linear regression

A nutritionist wishes to study the effects of early calcium intake on the bone density of post-menopausal women. She can measure the bone density of the arm (radial bone), in grams per squarecentimeter, by using a noninvasive device. Women who are at risk of hip fractures because of toolow a bone density will tend to show low arm bone density also. The nutritionist intends to sample agroup of elderly churchgoing women. For women over 65 years of age, she will plot calcium intakeas a teenager (obtained by asking the women about their consumption of high-calcium foods duringtheir teens) on the horizontal axis and arm bone density (measured) on the vertical axis. She expectsthe radial bone density to be lower in women who had a lower calcium intake. The nutritionist plansto fit a simple linear regression equation and test whether the slope of the regression line is zero. Inthis example a single outcome factor is being predicted by a single predictor factor.

Simple linear regression as used in this case would not be considered multivariate by somestatisticians, but it is included in this book to introduce the topic of multiple regression.

Multiple linear regression

A manager is interested in determining which factors predict the dollar value of sales of the firm’spersonal computers. Aggregate data on population size, income, educational level, proportion ofpopulation living in metropolitan areas, etc. have been collected for 30 areas. As a first step, amultiple linear regression equation is computed, where dollar sales is the outcome variable andthe other factors are considered as candidates for predictor variables. A linear combination of thepredictors is used to predict the outcome or response variable.

Discriminant function analysis

A large sample of initially disease-free men over 50 years of age from a community has beenfollowed to see who subsequently has a diagnosed heart attack. At the initial visit, blood was drawnfrom each man, and numerous other determinations were made, including body mass index, serumcholesterol, phospholipids, and blood glucose. The investigator would like to determine a linearfunction of these and possibly other measurements that would be useful in predicting who wouldand who would not get a heart attack within ten years. That is, the investigator wishes to derive aclassification (discriminant) function that would help determine whether or not a middle-aged manis likely to have a heart attack.

Logistic regression

An online movie streaming service has classified movies into two distinct groups according towhether they have a high or low proportion of the viewing audience when shown. The companyalso records data on features such as the length of the movie, the genre, and the characteristicsof the actors. An analyst would use logistic regression because some of the data do not meet theassumptions for statistical inference used in discriminant function analysis, but they do meet theassumptions for logistic regression. From logistic regression we derive an equation to estimate theprobability of capturing a high proportion of the target audience.

Poisson regression

In a health survey, middle school students were asked how many visits they made to the dentist inthe last year. The investigators are concerned that many students in this community are not receivingadequate dental care. They want to determine what characterizes how frequently students go to thedentist so that they can design a program to improve utilization of dental care. Visits per year arecount data and Poisson regression analysis provides a good tool for analyzing this type of data.Poisson regression is covered in the logistic regression chapter.


Survival analysis

An administrator of a large health maintenance organization (HMO) has collected data for a numberof years on length of employment in years for their physicians who are either family practitioners orinternists. Some of the physicians are still employed, but many have left. For those still employed,the administrator can only know that their ultimate length of employment will be greater than theircurrent length of employment. The administrator wishes to describe the distribution of length ofemployment for each type of physician, determine the possible effects of factors such as gender andlocation of work, and test whether or not the length of employment is the same for two specialties.Survival analysis, or event history analysis (as it is often called by behavioral scientists), can be usedto analyze the distribution of time to an event such as quitting work, having a relapse of a disease,or dying of cancer.

Principal components analysis

An investigator has made a number of measurements of lung function on a sample of adult maleswho do not smoke. In these tests each man is told to inhale deeply and then blow out as fast andas much as possible into a spirometer, which makes a trace of the volume of air expired over time.The maximum or forced vital capacity (FVC) is measured as the difference between maximuminspiration and maximum expiration. Also, the amount of air expired in the first second (FEV1), theforced mid-expiratory flow rate (FEF 25–75), the maximal expiratory flow rate at 50% of forced vitalcapacity (V50), and other measures of lung function are calculated from this trace. Since all thesemeasures are made from the same flow–volume curve for each man, they are highly interrelated.From past experience it is known that some of these measures are more interrelated than others andthat they measure airway resistance in different sections of the airway.

The investigator performs a principal components analysis to determine whether a new set ofmeasurements called principal components can be obtained. These principal components will belinear functions of the original lung function measurements and will be uncorrelated with each other.It is hoped that the first two or three principal components will explain most of the variation in theoriginal lung function measurements among the men. Also, it is anticipated that some operationalmeaning can be attached to these linear functions that will aid in their interpretation. The investigatormay decide to do future analyses on these uncorrelated principal components rather than on theoriginal data. One advantage of this method is that often fewer principal components are neededthan original variables. Also, since the principal components are uncorrelated, future computationsand explanations can be simplified.

Factor analysis

An investigator has asked each respondent in a survey whether he or she strongly agrees, agrees, isundecided, disagrees, or strongly disagrees with 15 statements concerning attitudes toward inflation.As a first step, the investigator will do a factor analysis on the resulting data to determine whichstatements belong together in sets that are uncorrelated with other sets. The particular statementsthat form a single set will be examined to obtain a better understanding of attitudes toward inflation.Scores derived from each set or factor will be used in subsequent analyses to predict consumerspending.

Cluster analysis

Investigators have made numerous measurements on a sample of patients who have been classifiedas being depressed. They wish to determine, on the basis of their measurements, whether thesepatients can be classified by type of depression. That is, is it possible to determine distinct types ofdepressed patients by performing a cluster analysis on patient scores on various tests?

1.5. ORGANIZATION AND CONTENT OF THE BOOK 9

Unlike the investigator studying men who do or do not get heart attacks, these investigatorsdo not possess a set of individuals whose type of depression can be known before the analysis isperformed. Nevertheless, the investigators want to separate the patients into unique groups and toexamine the resulting groups to see whether distinct types do exist and, if so, what their character-istics are.

Log-linear analysis

An epidemiologist in a medical study wishes to examine the interrelationships among the use ofsubstances that are thought to be risk factors for disease. These include four risk factors where theanswers have been summarized into categories. The risk factors are smoking tobacco (yes at present,former smoker, never smoked), drinking (yes, no), marijuana use (yes, no), and other illicit drug use(yes, no). Previous studies have shown that people who drink are more apt than nondrinkers tosmoke cigarettes, but the investigator wants to study the associations among the use of these foursubstances simultaneously.

Correlated outcomes regression

A health services researcher is interested in determining the hospital-related costs of appendectomy,the surgical removal of the appendix. Data are available for a number of patients in each of severalhospitals. Such a sample is called a clustered sample since patients are clustered within hospitals.For each operation, the information includes the costs as well as the patient’s age, gender, healthstatus and other characteristics. Information is also available on the hospital, such as its number ofbeds, location and staff size. A multiple linear regression equation is computed, where cost is theoutcome variable and the other factors are considered as candidates for predictor variables. As inmultiple linear regression, a linear combination of the predictors is used to predict the outcome orresponse variable. However, adjustments to the analysis must be made to account for the clusterednature of the sample, namely the possibility that patients within any one hospital may be moresimilar to each other than to patients in other hospitals. Since the outcomes within a given hospitalare correlated, the researcher plans to use correlated outcomes regression to analyze the data.

1.5 Organization and content of the book

This book is organized into two major parts. Part One (Chapters 1–6) deals with data entry, prepa-ration, visualization, screening, missing values, transformations, and decisions about likely choicesfor analysis. Part Two (Chapters 7–18) deals with regression analysis.

Chapters 2–6 are concerned with data preparation and the choice of what analysis to use. First,variables and how they are classified are discussed in Chapter 2. The next chapter concentrateson the practical problems of getting data into the computer, handling nonresponse, data manage-ment, getting rid of erroneous values, and preparing a useful codebook. Visualization techniquesare discussed in Chapter 4. The next chapter deals with checking assumptions of normality andindependence. The features of computer software packages used in this book are discussed. Thechoice of appropriate statistical analyses is discussed in Chapter 6.

Readers who are familiar with handling data sets on computers could skip some of these initialchapters and go directly to Chapter 7. However, formal course work in statistics often leaves aninvestigator unprepared for the complications and difficulties involved in real data sets. The materialin Chapters 2–6 was deliberately included to fill this gap in preparing investigators for real worlddata problems.

For a course limited to multivariate analysis, Chapters 2–6 can be omitted if a carefully prepareddata set is used for analysis. The depression data set, presented in Chapter 3, has been modified tomake it directly usable for multivariate data analysis, but the user may wish to subtract one from thevariables 2, 31, 33, and 34 to change the values to zeros and ones. Also, the lung function data, the


lung cancer data, and the parental HIV data are briefly described in Appendix A. These data, alongwith the data in Table 9.1 and Table 16.1, are available on the web from the publisher. See AppendixA or the preface for the exact web site address.

In Chapters 7–18 we follow a standard format. The topics discussed in each chapter are given,followed by a discussion of when the techniques are used. Then the basic concepts and formulasare explained. Further interpretation, and data examples with topics chosen that relate directly tothe techniques, follow. Finally, a summary of the available computer output that may be obtainedfrom four statistical software packages is presented. We conclude each chapter with a discussion ofpitfalls to avoid and alternatives to consider when performing the analyses described.

As much as possible, we have tried to make each chapter self-contained. However, Chapters 11and 12, on discriminant analysis and logistic regression, are somewhat interrelated, as are Chapters14 and 15, covering principal components and factor analysis.

References for further information on each topic are given in each chapter. Most of the refer-ences do require more mathematics than this book, but special emphasis can be placed on refer-ences that include examples. If you wish primarily to learn the concepts involved in multivariatetechniques and are not as interested in performing the analysis, then a conceptual introduction tomultivariate analysis can be found in Kachigan (1991). Everitt and Dunn (2001) provide a highlyreadable introduction also. For a concise description of multivariate analysis see Manly (2016).

We believe that the best way to learn multivariate analysis is to do it on data that you are familiarwith. No book can illustrate all the features found in computer output for a real-life data set. Learn-ing multivariate analysis is similar to learning to swim: you can go to lectures, but the real learningoccurs when you get into the water.

Chapter 2

Characterizing data for analysis

2.1 Variables: their definition, classification, and use

In performing multivariate analysis, the investigator deals with numerous variables. In this chapter,we define what a variable is in Section 2.2. Section 2.3 presents a method of classifying variablesthat is sometimes useful in multivariate analysis since it allows one to check that a commonly usedanalysis has not been missed. Section 2.4 explains how variables are used in analysis and givesthe common terminology for distinguishing between the two major uses of variables. Section 2.5includes some examples of classifying variables and Section 2.6 discusses other characteristics ofdata and references exploratory data analysis.

2.2 Defining statistical variables

The word variable is used in statistically oriented literature to indicate a characteristic or propertythat is possible to measure. When we measure something, we make a numerical model of the thingbeing measured. We follow some rule for assigning a number to each level of the particular char-acteristic being measured. For example, the height of a person is a variable. We assign a numericalvalue to correspond to each person’s height. Two people who are equally tall are assigned the samenumeric value. On the other hand, two people of different heights are assigned two different values.Measurements of a variable gain their meaning from the fact that there exists unique correspondencebetween the assigned numbers and the levels of the property being measured. Thus two people withdifferent assigned heights are not equally tall. Conversely, if a variable has the same assigned valuefor all individuals in a group, then this variable does not convey useful information to differentiateindividuals in the group.

Physical measurements, such as height and weight, can be measured directly by using physicalinstruments. On the other hand, properties such as reasoning ability or the state of depression of aperson must be measured indirectly. We might choose a particular intelligence test and define thevariable “intelligence” to be the score achieved on this test. Similarly, we may define the variable“depression” as the number of positive responses to a series of questions. Although what we wish tomeasure is the degree of depression, we end up with a count of yes answers to some questions. Theseexamples point out a fundamental difference between direct physical measurements and abstractvariables.

Often the question of how to measure a certain property can be perplexing. For example, if theproperty we wish to measure is the cost of keeping the air clean in a particular area, we may be ableto come up with a reasonable estimate, although different analysts may produce different estimates.The problem becomes much more difficult if we wish to estimate the benefits of clean air.

On any given individual or thing we may measure several different characteristics. We wouldthen be dealing with several variables, such as age, height, annual income, race, sex, and level ofdepression of a certain individual. Similarly, we can measure characteristics of a corporation, suchas various financial measures. In this book we are concerned with analyzing data sets consisting ofmeasurements on several variables for each individual in a given sample. We use the symbol P to de-

11

12 CHAPTER 2. CHARACTERIZING DATA FOR ANALYSIS

note the number of variables and the symbol N to denote the number of individuals, observations,cases, or sampling units.

2.3 Stevens’s classification of variables

In the determination of the appropriate statistical analysis for a given set of data, it is useful toclassify variables by type. One method for classifying variables is by the degree of sophisticationevident in the way they are measured. For example, we can measure the height of people accordingto whether the top of their head exceeds a mark on the wall; if yes, they are tall; and if no, they areshort. On the other hand, we can also measure height in centimeters or inches. The latter technique isa more sophisticated way of measuring height. As a scientific discipline advances, the measurementof the variables used in it tends to become more sophisticated.

Various attempts have been made to formalize variable classification. A commonly acceptedsystem is that proposed by Stevens (1955). In this system, measurements are classified as nominal,ordinal, interval, or ratio. In deriving his classification, Stevens characterized each of the fourtypes by a transformation that would not change a measurement’s classification. In the subsectionsthat follow, rather than discuss the mathematical details of these transformations, we present thepractical implications for data analysis.

As with many classification schemes, Stevens’s system is useful for some purposes but not forothers. It should be used as a general guide to assist in characterizing the data and to make sure thata useful analysis is not overlooked. However, it should not be used as a rigid rule that ignores thepurpose of the analysis or limits its scope (Velleman and Wilkinson, 1993).

Nominal variables

With nominal variables each observation belongs to one of several distinct categories. The cate-gories are not necessarily numerical, although numbers may be used to represent them. For example,“sex” is a nominal variable. An individual’s gender is either male or female. We may use any twosymbols, such as M and F, to represent the two categories. In data analysis, numbers are used as thesymbols since many computer programs are designed to handle only numerical symbols. Since thecategories may be arranged in any desired order, any set of numbers can be used to represent them.For example, we may use 0 and 1 to represent males and females, respectively. We may also use 1and 2 to avoid confusing zeros with blanks. Any two other numbers can be used as long as they areused consistently.

An investigator may rename the categories, thus performing a numerical operation. In doingso, the investigator must preserve the uniqueness of each category. Stevens expressed this last ideaas a “basic empirical operation” that preserves the category to which the observation belongs. Forexample, two males must have the same value on the variable “sex,” regardless of the two numberschosen for the categories. Table 2.1 summarizes these ideas and presents further examples. Nominalvariables with more than two categories, such as race or religion, may present special challenges tothe multivariate data analyst. Some ways of dealing with these variables are presented in Chapter 8.

Ordinal variables

Categories are used for ordinal variables as well, but there also exists a known order among them.For example, in the Mohs Hardness Scale, minerals and rocks are classified according to ten levelsof hardness. The hardest mineral is diamond and the softest is talc (Pough, 1998).

Any ten numbers can be used to represent the categories, as long as they are ordered in mag-nitude. For instance, the integers 1–10 would be natural to use. On the other hand, any sequenceof increasing numbers may also be used. Thus, the basic empirical operation defining ordinal vari-ables is whether one observation is greater than another. For example, we must be able to determinewhether one mineral is harder than another. Hardness can be tested easily by noting which mineral

2.3. STEVENS’S CLASSIFICATION OF VARIABLES 13

Table 2.1: Stevens’s measurement system

Type ofmeasurement Basic empirical operation ExamplesNominal Determine equality of Company names

categories RaceReligionSoccer players’ numbers

Ordinal Determine greater than Hardness of mineralsor less than (ranking) Socioeconomic status

Rankings of winesInterval Determine equality of Temperature in degrees

differences between levels FahrenheitCalendar dates

Ratio Determine equality of Heightratios of levels Weight

DensityDifference in time

can scratch the other. Note that for most ordinal variables there is an underlying continuum beingapproximated by artificial categories. For example, in the above hardness scale fluorite is definedas having a hardness of 4, and calcite, 3. However, there is a range of hardness between these twonumbers not accounted for by the scale.

Often investigators classify people, or ask them to classify themselves, along some continuum(see Luce and Narens, 1987). For example, a physician may classify a patient’s disease status asnone = 1, mild = 2, moderate = 3, and severe = 4. Clearly, increasing numbers indicate increasingseverity, but it is not certain that the difference between not having an illness and having a mildcase is the same as between having a mild case and a moderate case. Hence, according to Stevens’sclassification system, this is an ordinal variable.

Interval variables

An interval variable is a variable in which the differences between successive values are always thesame. For example, the variable “temperature,” in degrees Fahrenheit, is measured on the intervalscale since the difference between 12◦ and 13◦ is the same as the difference between 13◦ and 14◦ orthe difference between any two successive temperatures. In contrast, the Mohs Hardness Scale doesnot satisfy this condition since the intervals between successive categories are not necessarily thesame. The scale must satisfy the basic empirical operation of preserving the equality of intervals.

Ratio variables

Ratio variables are interval variables with a natural point representing the origin of measurement,i.e., a natural zero point. For instance, height is a ratio variable since zero height is a naturallydefined point on the scale. We may change the unit of measurement (e.g., centimeters to inches),but we would still preserve the zero point and also the ratio of any two values of height. Temperatureis not a ratio variable since we may choose the zero point arbitrarily, thus not preserving ratios.

There is an interesting relationship between interval and ratio variables. The difference betweentwo interval variables is a ratio variable. For example, although time of day is measured on theinterval scale, the length of a time period is a ratio variable since it has a natural zero point.


Other classifications

Other methods of classifying variables have also been proposed. Many authors use the term cate-gorical to refer to nominal and ordinal variables where categories are used.

We mention, in addition, that variables may be classified as discrete or continuous. A variable iscalled continuous if it can take on any value in a specified range. Thus the height of an individualmay be 70 or 70.4539 inches. Any numerical value in a certain range is a conceivable height.

A variable that is not continuous is called discrete. A discrete variable may take on only certainspecified values. For example, counts are discrete variables since only zero or positive integers areallowed. In fact, all nominal and ordinal variables are discrete. Interval and ratio variables can becontinuous or discrete. This latter classification carries over to the possible distributions assumedin the analysis. For instance, the normal distribution is often used to describe the distribution ofcontinuous variables.

Statistical analyses have been developed for various types of variables. In Chapter 6 a guideto selecting the appropriate descriptive measures and multivariate analyses will be presented. Thechoice depends on how the variables are used in the analysis, a topic that is discussed next.

2.4 How variables are used in data analysis

The type of data analysis required in a specific situation is also related to the way in which eachvariable in the data set is used. Variables may be used to measure outcomes or to explain why aparticular outcome resulted. For example, in the treatment of a given disease a specific drug maybe used. The outcome variable may be a discrete variable classified as “cured” or “not cured.”The outcome variable may depend on several characteristics of the patient such as age, geneticbackground, and severity of the disease. These characteristics are sometimes called explanatory orpredictor variables. Equivalently, we may call the outcome the dependent variable and the char-acteristics the independent variable. The latter terminology is very common in statistical literature.This choice of terminology is unfortunate in that the “independent” variables do not have to be sta-tistically independent of each other. Indeed, these independent variables are usually interrelated ina complex way. Another disadvantage of this terminology is that the common connotation of thewords implies a causal model, an assumption not needed for the multivariate analyses described inthis book. In spite of these drawbacks, the widespread use of these terms forces us to adopt them.

In other situations the dependent or outcome variable may be treated as a continuous variable.For example, in household survey data we may wish to relate monthly expenditure on cosmetics perhousehold to several explanatory or independent variables such as the number of individuals in thehousehold, their gender, and the household income.

In some situations the roles that the various variables play are not obvious and may also change,depending on the question being addressed. Thus a data set for a certain group of people may containobservations on their sex, age, diet, weight, and blood pressure. In one analysis, we may use weightas a dependent or outcome variable with height, sex, age, and diet as the independent or predictorvariables. In another analysis, blood pressure might be the dependent or outcome variable, withweight and other variables considered as independent or predictor variables.

In certain exploratory analyses all the variables may be used as one set with no regard to whetherthey are dependent or independent. For example, in the social sciences a large number of variablesmay be defined initially, followed by attempts to combine them into a smaller number of summaryvariables. In such an analysis the original variables are not classified as dependent or independent.The summary variables may later be used either as outcome or predictor variables. In Chapter 6multivariate analyses described in this book will be characterized by the situations in which theyapply according to the types of variables analyzed and the roles they play in the analysis.

2.5. EXAMPLES OF CLASSIFYING VARIABLES 15

2.5 Examples of classifying variables

In the depression data example several variables are measured on the nominal scale: sex, maritalstatus, employment, and religion. The general health scale is an example of an ordinal variable.Income and age are both ratio variables. No interval variable is included in the data set. A partiallisting and a codebook for this data set are given in Chapter 3.

One of the questions that may be addressed in analyzing these data is “Which factors are relatedto the degree of psychological depression of a person?” The variable “cases” may be used as thedependent or outcome variable since an individual is considered a case if his or her score on thedepression scale exceeds a certain level. “Cases” is an ordinal variable, although it can be considerednominal because it has only two categories. The independent or predictor variable could be any orall of the other variables (except ID and measures of depression). Examples of analyses withoutregard to variable roles are given in Chapters 14 and 15 using the variables C1 to C20 in an attemptto summarize them into a small number of components or factors.

Sometimes, Stevens’s classification system is difficult to apply, and two investigators could dis-agree on a given variable. For example, there may be disagreement about the ordering of the cate-gories of a socioeconomic status variable. Thus the status of blue-collar occupations with respect tothe status of certain white-collar occupations might change over time or from culture to culture. Sosuch a variable might be difficult to justify as an ordinal variable, but we would be throwing awayvaluable information if we used it as a nominal variable. Despite these difficulties, Stevens’s systemis useful in making decisions on appropriate statistical analysis, as will be discussed in Chapter 6.

2.6 Other characteristics of data

Data are often characterized by whether the measurements are accurately taken and are relativelyerror free, and by whether they meet the assumptions that were used in deriving statistical testsand confidence intervals. Often, an investigator knows that some of the variables are likely to haveobservations that have errors. If the effect of an error causes the numerical value of an observation tonot be in line with the numerical values of most of the other observations, these extreme values maybe called outliers and should be considered for removal from the analysis. But other observationsmay not be accurate and still be within the range of most of the observations. Data sets that containa sizeable portion of inaccurate data or errors are called “dirty” data sets.

Special statistical methods have been developed that are resistant to the effects of dirty data.Other statistical methods, called robust methods, are insensitive to departures from underlyingmodel assumptions. In this book, we do not present these methods but discuss finding outliers andgive methods of determining if the data meet the assumptions. For further information on statisticalmethods that are well suited for dirty data or require few assumptions, see Hoaglin et al. (2000);Schwaiger and Opitz (2003), or Fox and Long (1990).

2.7 Summary

In this chapter statistical variables were defined. Their types and the roles they play in data analysiswere discussed. Stevens’s classification system was described. These concepts can affect the choiceof analyses to be performed, as will be discussed in Chapter 6.

2.8 Problems

2.1 Classify the following types of data by using Stevens’s measurement system: decibels of noiselevel, father’s occupation, parts per million of an impurity in water, density of a piece of bone,rating of a wine by one judge, net profit of a firm, and score on an aptitude test.

2.2 In a survey of users of a walk-in mental health clinic, data have been obtained on sex, age,household roster, race, education level (number of years in school), family income, reason


for coming to the clinic, symptoms, and scores on screening examination. The investigatorwishes to determine what variables affect whether or not coercion by the family, friends, ora governmental agency was used to get the patient to the clinic. Classify the data accordingto Stevens’s measurement system. What would you consider to be possible independent vari-ables? Dependent variables? Do you expect the dependent variables to be independent of eachother?

2.3 For the chronic respiratory study data described in Appendix A, classify each variable accord-ing to Stevens’s scale and according to whether it is discrete or continuous. Pose two possibleresearch questions and decide on the appropriate dependent and independent variables.

2.4 Repeat problem 2.3 for the lung cancer data set described in Table 13.1.2.5 From a field of statistical application (perhaps your own field of specialty), describe a data set

and repeat the procedures described in Problem 2.3.2.6 If the RELIG variable described in Table 3.4 of this text was recoded 1 = Catholic, 2 = Protes-

tant, 3 = Jewish, 4 = none, and 5 = other, would this meet the basic empirical operation asdefined by Stevens for an ordinal variable?

2.7 Give an example of nominal, ordinal, interval, and ratio variables from a field of applicationyou are familiar with.

2.8 Data that are ordinal are often analyzed by methods that Stevens reserved for interval data.Give reasons why thoughtful investigators often do this.

2.9 The Parental HIV data set described in Appendix A includes the following variables: job sta-tus of mother (JOBMO, 1=employed, 2=unemployed, and 3=retired/disabled) and mother’seducation (EDUMO, 1=did not complete high school, 2=high school diploma/GED, and3=more than high school). Classify these two variables using Stevens’s measurement system.

2.10 Give an example from a field that you are familiar with of an increased sophistication ofmeasuring that has resulted in a measurement that used to be ordinal now being interval.

Chapter 3

Preparing for data analysis

3.1 Processing data so they can be analyzed

Once the data are available from a study there are still a number of steps that must be undertakento get them into shape for analysis. This is particularly true when multivariate analyses are plannedsince these analyses are often done on large data sets. In this chapter we provide information ontopics related to data processing.

Section 3.2 describes the statistical software packages used in this book. Note that several otherstatistical packages offer an extensive selection of multivariate analyses. In addition, almost allstatistical packages and even some of the spreadsheet programs include at least multiple regressionas an option.

The next topic discussed is data entry (Section 3.3). Data collection is often performed usingcomputers directly via Computer Assisted Personal Interviewing (CAPI), Audio Computer AssistedSelf Interviewing (ACASI), via the Internet, or via phone apps. For example, SurveyMonkey andGoogle Forms are free and commercially available programs that facilitate sending and collectingsurveys via the Internet. Nonetheless, paper and pencil interviews or mailed questionnaires are stilla form of data collection. The methods that need to be used to enter the information obtained frompaper and pencil interviews into a computer depend on the size of the data set. For a small data setthere are a variety of options since cost and efficiency are not important factors. Also, in that casethe data can be easily screened for errors simply by visual inspection. But for large data sets, carefulplanning of data entry is necessary since costs are an important consideration along with gettinga data set for analysis that is as error-free as possible. Here we summarize the data input optionsavailable in the statistical software packages used in this book and discuss some important options.

Section 3.4 covers combining and updating data sets. The operations used and the options avail-able in the various packages are described. Initial discussion of missing values, outliers, and trans-formations is given and the need to save results is stressed.

Section 3.5 discusses methods to conduct research in a reproducible manner and the importanceof documenting steps taken during data preparation and analysis in a manner that is human-readable.Finally, in Section 3.6 we introduce a multivariate data set that will be widely used in this book andsummarize the data in a codebook.

We want to stress that the procedures discussed in this chapter can be time consuming and frus-trating to perform when large data sets are involved. Often the amount of time used for data entry,editing, and screening can far exceed that used on statistical analyses. It is very helpful to eitherhave computer expertise yourself or have access to someone you can get advice from occasionally.Of note, our definition of large data sets includes data sets such as those publicly available from theCenters for Disease Control and Prevention (CDC). More complicated issues arise when data setsthat are much larger (in the order of terabytes). These arise, e.g., with genetic data or internet databases. Such sets do not fall within the scope of this book.

17

18 CHAPTER 3. PREPARING FOR DATA ANALYSIS

3.2 Choice of a statistical package

There is a wide choice of statistical software packages available. Many packages, however, arequite specialized and do not include many of the multivariate analyses given in this book. For ex-ample, there are statistical packages that are aimed at particular areas of application or give tests forexact statistics that are more useful for other types of work. In choosing a package for multivari-ate analysis, we recommend that you consider the statistical analyses listed in Table 6.2 and checkwhether the package includes them.

In some cases the statistical package is sold as a single unit and in others you purchase a ba-sic package, but you have a choice of additional programs so you can buy what you need. Someprograms require yearly license fees, others are free.

Ease of use

Some packages are easier to use than others, although many of us find this difficult to judge–we likewhat we are familiar with. In general, the packages that are simplest to use have two characteristics.First, they have fewer options to choose from and these options are provided automatically by theprogram with little need for programming by the user. Second, they use the “point and click” methodknown as graphical user interface (GUI) for choosing what is done rather than requiring the user towrite out statements. However, many current point and click programs do not leave the user with anaudit trail of what choices have been made.

On the other hand, software programs with extensive options have obvious advantages. Also,the use of written statements (or commands) allows you to have a written record of what you havedone. Such a record makes it easier to re-run programs and to facilitate reproducibility. The recordof the commands used can be particularly useful in large-scale data analyses that extend over aconsiderable period of time and involve numerous investigators. Still other programs provide theuser with a programming language that allows the users great freedom in what output they canobtain.

Packages used in this book

In this book, we make specific reference to four general-purpose statistical software packages (listedin alphabetical order): R v3.5, SAS v9.4, SPSS v25, and Stata v15

R is a language created for performing statistical analyses and provides rich data visualizationcapabilities. The user writes the language expressions that are read and immediately executed bythe program. This process allows the user to write a function, run it, see what happens, and then usethe result of the function in a second function. R is a free and open source program where muchof the added functionality comes from external contributed packages that must be installed. TheCRAN task views aim to provide some guidance on which packages are relevant for tasks relatedto a certain topic. It is important to note that R is typically used through an integrated develop-ment environment (IDE) program called RStudio (RStudio Team, 2015). There are numerous bookswritten on writing programs in R for different areas of application; for example, see Matloff (2011),Hothorn and Everitt (2014), Cotton (2013), Maindonald and Braun (2010), Muenchen (2011) or anyof the topic-specialized books in The R Series of textbooks from CRC. Because it is a free programthe developers point out that it “comes with ABSOLUTELY NO WARRANTY”.

The SAS philosophy is that the user should string together a sequence of procedures to per-form the desired analysis. Some data management and analysis features are available via point andclick operations (SAS/ASSIST and SAS/EG). In addition to large volumes of manuals for SAS,numerous texts have been written on using SAS; for example, see Khattree and Naik (1999), Derand Everitt (2014), Delwiche and Slaughter (2012), Marasinghe and Koehler (2018), or Freund andLittell (2000).

SPSS was originally written for survey applications. It offers a number of comprehensive pro-

3.3. TECHNIQUES FOR DATA ENTRY 19

grams and users can choose specific options that they desire. It provides excellent data entry pro-grams and data manipulation procedures. It can be used either by clicking through the file menusystem or by writing and executing commands. In addition to the manuals, books such as the onesby Abu-Bader (2010) or Green and Salkind (2016) are available.

Stata is similar to SAS and R in that an analysis consists of a sequence of commands with theirown options. Analyses can also be performed via the file menu system with the option to save thecommands generated. Features are available to easily log and rerun an analysis. Alternatively, a GUIcan be used to select and run analyses. It also includes numerous data management features, a veryrich set of graphics options, and a growing set of community-contributed commands for specializedtasks. These are presented through publications such as the Stata Journal. Several books are availablewhich discuss statistical analysis using Stata; see Lalanne and Mesbah (2016), Hamilton (2012),Hills and Stavola (2012), or Cleves et al. (2016) among others.

Since R, SAS and Stata primarily are used by writing and executing a series of commands, effortis required to learn these languages. However, doing so provides the user with a highly versatileprogramming tool for statistical computing.

When you are learning to use a package for the first time, there is no substitute for readingthe on-line HELP, manuals, or texts that present examples. However, at times the sheer numberof options presented in these programs may seem confusing, and advice from an experienced usermay save you time. Many programs offer default options, and it often helps to use these whenyou run a program for the first time. In this book, we frequently recommend which options touse. On-line HELP is especially useful when it is programmed to offer information needed for thepart of the program you are currently using (context sensitive). Links to the websites for the foursoftware programs discussed in this book can be found in the UCLA web site cited in the preface.Despite the fact that we are providing tables which summarize the commands for the discussedstatistical analysis techniques and comment on some in the text, this book is not intended to provideinstructions on how to use the statistical software packages. For software specific instructions, thereader is referred to the printed or on-line manuals and books dedicated to that purpose.

There are numerous statistical packages and programming languages that offer statistical pack-ages or modules that are not included in this book. We have tried to choose those that offer a widerange of multivariate techniques.

For information on other packages, you can refer to the statistical computing software reviewsections of The American Statistician or journals in your own field of interest.

3.3 Techniques for data entry

Appropriate techniques for entering data for analysis depend mainly on the size of the data set andthe form in which the data set is stored. As discussed below, all statistical packages can use datain a spreadsheet (or rectangular) format. Each column represents a specific variable and each rowhas the data record for a case or observation. The variables are in the same order for each case.For example, for the depression data set given later in this chapter, looking only at the first threevariables and four cases, we have

ID Sex Age1 2 682 1 583 2 454 2 50

where for the variable “sex,” 1 = male and 2 = female, and “age” is given in years.Typically each row represents an individual case. What is needed in each row depends on the

unit of analysis for the study. By unit of analysis, we mean what is being studied in the analysis. Ifthe individual is the unit of analysis, as it usually is, then the data set just given is in a form suit-able for analysis. Another situation is when the individuals belong to one household, and the unitof analysis is the household but data have been obtained from several individuals in the household.


Alternatively, for a company, the unit of analysis may be a sales district and sales made by differentsalespersons in each district are recorded. Data sets given in the last two examples are called hier-archical or clustered data sets and their form can get to be quite complex. Some statistical packageshave limited capacity to handle hierarchical data sets. In other cases, the investigator may have touse a relational database package such as Access to first get the data set into the rectangular orspreadsheet form used in the statistical package.

As discussed below, either one or two steps are involved in data entry. The first one is enteringthe data into the computer if data have been collected on paper. Another typical step is to transferthe data to the desired statistical package.

Data entry

Before entering data in most statistical, spreadsheet, or database management packages, the investi-gator first names the file where the data are stored, states how many variables will be entered, namesthe variables, and provides information on these variables. Note that in the example just given welisted three variables which were named for easy use later. The file could be called “depress.” Statis-tical packages commonly allow the user to designate the format and type of variable, e.g., numericor alphabetic, calendar date, or categorical. They allow you to specify missing value codes, thelength of each variable, and the placement of the decimal points. Each program has slightly differ-ent features so it is critical to read the appropriate online HELP statements or manual, particularlyif a large data set is being entered.

The two commonly used formats for data entry are the spreadsheet and the form. By spread-sheet, we mean the format given previously where the columns are the variables and the rows thecases. This method of entry allows the user to see the input from previous records, which oftengives useful clues if an error in entry is made. The spreadsheet method is very commonly used,particularly when all the variables can be seen on the screen without scrolling.

With the form method, only one record, the one being currently entered, is on view on thescreen. There are several reasons for using the form method. An entry form can be made to looklike the original data collection form so that the data entry person sees data in the same place onthe screen as it is in the collection form. A large number of variables for each case can be seen ona computer monitor screen and they can be arranged in a two-dimensional array, instead of just theone-dimensional array available for each case in the spreadsheet format. Flipping pages (screens) ina display may be simpler than scrolling left or right for data entry. Short coding comments can beincluded on the screen to assist in data entry. Also, if the data set includes alphabetical informationsuch as short answers to open-ended questions, then the form method is preferred.

The choice between these two formats can be a matter of personal preference, but in general thespreadsheet is used for data sets with a small or medium number of variables and the form is usedfor a larger number of variables and for studies requiring detailed records of data entry and potentialdata changes. In some cases a scanner can be used to enter the data and then an optical characterrecognition program converts the image to the desired text and numbers.

To make the discussion more concrete, we present the features given in a specific data entrypackage. The SPSS data entry program provides a good mix of features that are useful in enteringlarge data sets. It allows either spreadsheet or form entry and switching back and forth between thetwo modes. In addition to the features already mentioned, SPSS provides what is called “skip andfill.” In medical studies and surveys, it is common that if the answer to a certain question is no,a series of additional questions can then be skipped. For example, subjects might be asked if theyever smoked, and if the answer is yes they are asked a series of questions on smoking history. Butif they answer no, these questions are not asked and the interviewer skips to the next section of theinterview. The skip-and-fill option allows the investigator to specify that if a person answers no,the smoking history questions are automatically filled in with specified values and the entry cursormoves to the start of the next section. This saves a lot of entry time and possible errors.

Another feature available in many packages is range checking. Here the investigator can enter

3.3. TECHNIQUES FOR DATA ENTRY 21

upper and lower values for each variable. If the data entry person enters a value that is either lowerthan the low value or higher than the high value, the data entry program provides a warning. Forexample, for the variable “sex,” if an investigator specifies 1 and 2 as possible values and the dataentry person hits a 3 by mistake, the program issues a warning. This feature, along with input byforms or spreadsheet, is available also in SAS.

Each software has its own set of features and the reader is encouraged to examine them beforeentering medium or large data sets, to take advantage of them.

Mechanisms of entering data

Data can be entered for statistical computation from different sources. We will discuss four of them.

1. entering the data along with the program or procedure statements for a batch-process run;2. using the data entry features of the statistical package you intend to use;3. entering the data from an outside file which is constructed without the use of the statistical pack-

age;4. importing the data from another package using the operating system such as Windows or MAC

OS.

Of note, the above data entry approaches are typically considered acceptable for observationaldata and/or for small studies. However, some studies, such as most clinical trials, require muchmore stringent data quality controls and procedures to track over the course of the study includingwhen and by whom data are entered or changed. For initial data entry for such studies data are oftenentered twice potentially by two different individuals. This is referred to as double data entry andthe resulting files are compared to identify data entry errors. Further details can be found in manyclinical trial books and articles such as Friedman et al. (2015), and Piantadosi (2017).

The first of the four methods listed above can only be used with a limited number of programswhich use program or procedure statements, for example R, SAS or Stata. It is only recommendedfor very small data sets that are not going to be used very many times. For example, a SAS data setcalled “depress” could be made by stating:

data depress;

input id sex age;

cards;

1 2 68

2 1 58

3 2 45

4 2 50

:

run;

Similar types of statements can be used for the other programs which use the spreadsheet type offormat.

The disadvantage of this type of data entry is that there are only limited editing features avail-able to the person entering the data. No checks are made as to whether or not the data are withinreasonable ranges for this data set. For example, all respondents were supposed to be 18 years oldor older, but there is no automatic check to verify that the age of the third person, who was 45 yearsold, was not erroneously entered as 15 years. Another disadvantage is that the data set disappearsafter the program is run unless additional statements are made. In small data sets, the ability to savethe data set, edit typing, and have range checks performed is not as important as in larger data sets.

The second strategy is to use the data entry package or system provided by the statistical programyou wish to use. This is always a safe choice as it means that the data set is in the form required bythe program and no data transfer problems will arise. Table 3.1 summarizes the built-in data entry


Table 3.1: Built-in data entry features of the statistical packages

R SAS SPSS Stata

Spreadsheet entry Yes Yes Yes YesForm entry No Yes Yes NoRange check User Yes Yes NoLogical check User Yes Yes NoSkip and fill User Use SCL Yes NoVerify mode No No Yes No

features of the four statistical packages used in this book. Note that for SAS, PROC COMPARE can beused to verify data obtained by double data entry. In general, as can be seen in Table 3.1, SPSS andSAS have extensive data entry features.

The third method is to enter the data into a secondary program such as a spreadsheet or datamanagement program, or a form-based data entry program, and then import it into your statisticalsoftware package of choice.

The advantage of this method is that an available program that you are familiar with can be usedto enter the data. Excel and Google sheets provides entry in the form of a spreadsheet and is widelyavailable. Access allows entry using forms and provides the ability to combine different data sets.Once the data sets are combined in Access, it is straightforward to transfer them to Excel. GoogleForms and other survey software such as Qualtrics and SurveyMonkey allow data entry using formsfrom that can be distributed online or through email.

Many of the statistical software packages import Excel files. In addition, many of the statisticalpackages also allow the user to import data from other statistical packages. For example, R willimport SAS, SPSS, and Stata data files, but SPSS will import only SAS and Stata data files. Manysoftware packages also allow you to export data in a format designed for use in another statisticalpackage. For example SPSS can not read R data files directly, but R can export SPSS data filesdirectly, or it can write a plain text ASCII file (mentioned next) which can be read into any statisticalsoftware program. One suggestion is to first check the manual or HELP for the statistical packageyou wish to use to see which types of data files it can import.

A widely used transfer method is to create an ASCII file from the data set. ASCII (AmericanStandard Code for Information Interchange) files are more commonly known as plain text files andcan be created by almost any spreadsheet, data management, or word processing program. Instruc-tions for reading ASCII files are given in the statistical packages. The disadvantage of transferringASCII files is that typically only the data are transferred, and variable labels and information con-cerning the variables have to be reentered into the statistical package. This is a minor problem ifthere are not too many variables. If this process appears to be difficult, or if the investigators wish toretain the variable labels, then they can run a special-purpose program such as STAT/TRANSFERthat will copy data files created by a wide range of spread sheet, data base and statistical softwareprograms and put them into the right format for access by other spread sheet, database and statisticalprograms.

Finally, if the data entry program and the statistical package both use the Windows operatingsystem, then three methods of transferring data may be considered depending on what is imple-mented in the programs. First, the data in the data entry program may be highlighted and movedto the statistical package using the usual copy and paste options. Second, dynamic data exchange(DDE) can be used to transfer data. Here the data set in the statistical package is dynamically linkedto the data set in the entry program. If you correct a variable for a particular case in the entry pro-gram, the identical change is made in the data set in the statistical package, Third, object linking and

3.4. ORGANIZING THE DATA 23

embedding (OLE) can be used to share data between a program used for data entry and statisticalanalysis. Here also the data entry program can be used to edit the data in the statistical program.The investigator can activate the data entry program from within the statistical package.

If you have a very large data set to enter, it is often sensible to use a professional data enteringservice. A good service can be very fast and can offer different levels of data checking and adviceon which data entry method to use. But whether or not a professional service is used, the followingsuggestions may be helpful for data entry.

1. Whenever possible, code information in numbers not letters.2. Code information in the most detailed form you will ever need. You can use the statistical pro-

gram to aggregate the data into coarser groupings later. For example, it is better to record age asthe exact age at the last birthday rather than to record the ten-year age interval into which it falls.

3. The use of range checks or maximum and minimum values can eliminate the entry of extremevalues but they do not guard against an entry error that falls within the range. If minimizingerrors is crucial then the data can be entered twice into separate data files. One data file can besubtracted from the other and the resulting nonzeros examined. Alternatively, some data entryprograms have a verify mode where the user is warned if the first entry does not agree with thesecond one or have special commands to allow for comparison of two data sets.

4. If the data are stored on a personal computer, then backup copies should be made on an externalstorage device, such as an external hard drive, or on a cloud storage system such as Box, Dropboxor Google Drive or similar systems. Backups should be updated regularly as changes are madein the data set.

5. For each variable, use a code to indicate missing values. Avoid using potential observed values(such as -9 or 99) for coding missing data. Most programs have their own way to indicate missingvalues (such as “.” or “NA”). The manuals or HELP statements should be consulted so that youcan match what they require with what you do.

6. When entering data representing calendar dates, be consistent across all records and use a stan-dard representation such as ISO 8601. This looks like YYYY-MM-DD, which means a four digityear, 2 digit day and 2 digit month, in that order with each separated by hyphens. Example:2018-07-01 is July 1st, 2018. Times have a similar ISO8601 convention of hh:mm:ss, which canbe read as the two digit hour using the 24 hour clock system, two digit minutes, 2 digit seconds,separated by a colon. Consult the HELP manual for your software program for more informationon how the program handles date and time formats.

To summarize, there are three important considerations in data entry: accuracy, cost, and easeof use of the data file. Whichever system is used, the investigator should ensure that the data file isfree of typing errors, that time and money are not wasted, and that the data file is readily availablefor future data management and statistical analysis.

3.4 Organizing the data

Prior to statistical analysis, it is often necessary to make some changes in the data set. Table 3.2summarizes the common options in the programs described in this book.

Combining data sets

Combining data sets is an operation that is commonly performed. For example, in biomedical stud-ies, data may be taken from medical history forms, a questionnaire, and laboratory results for eachpatient. These data for a group of patients need to be combined into a single rectangular data setwhere the rows are the different patients and the columns are the combined history, questionnaire,and laboratory variables. In longitudinal studies of voting intentions, the questionnaire results for


each respondent must be combined across time periods in order to analyze change in voting inten-tions of an individual over time. There are essentially two steps in this operation. The first is sortingon some key variable (given different names in different packages) which must be included in bothof the separate data sets to be merged. Usually this key variable is an identification or ID variable(case number). The second step is combining the separate data sets side-by-side, matching the cor-rect records with the correct person using the key variable. Sometimes one or more of the data itemsare missing for an individual. For example, in a longitudinal study it may not be possible to locate arespondent for one or more of the interviews. In such a case, a symbol or symbols indicating missingvalues will be inserted into the spaces for the missing data items by the program. This is done sothat you will end up with a rectangular data set or file, in which information for an individual is putinto the proper row, and missing data are so identified.

Data sets can be combined in the manner described above in R by using the merge functionwith the by argument to specify the name of the matching variable (such as ID). This functionhas additional by.x or by.y arguments that can be used for more complex situations (see the helpfile). A popular user-written package called dplyr contains join statements that behave similarto merge. The function cbind is available to “bind columns” of data, but it can yield unexpectedresults if the numbers of data points or data types are different in the data sets to be merged. It isalso left up to the analyst to be certain that the rows in each data set are listed in the same order.

Combining data sets in SAS can be done by using the MERGE statement followed by a BY state-ment and the variable(s) to be used to match the records. The data must first be sorted by the valuesof the matching variable, say ID. An UPDATE statement can also be used to add variables to a masterfile. In SPSS, you use the JOIN MATCH command followed by the data files to be merged if you arecertain that the cases are already listed in precisely the same order and each case is present in allthe data files. Otherwise, you first sort the separate data files on the key variable and use the JOIN

MATCH command followed by the BY key variable.In Stata, you use the first data file and then use a merge m:m key variable using the second

data file statement. The m:m component specifies that either 1 or many records with the same keyvariable value are merged.

If you have knowledge of the Structured Query Language (SQL) programming language, it isuseful to know that both SAS and R have the ability to process SQL queries. Consult your chosenpackage’s help documentation to learn more about these methods.

In any case, it is highly desirable to list (view) the individual data records to determine that themerging was done in the manner that you intended. If the data set is large, then only the first andlast 25 or so cases need to be listed to see that the results are correct. If the separate data sets areexpected to have missing values, you need to list sufficient cases so you can see that missing recordsare correctly handled.

Another common way of combining data sets is to put one data set at the end of another data set.This process is referred to as concatenation. For example, an investigator may have data sets thatare collected at different places and then combined together. In an education study, student recordscould be combined from two high schools, with one simply placed at the bottom of the other set.

Concatenation is done using the rbind function in R, and PROC APPEND in SAS. In SPSS theJOIN command with the keyword ADD can be used to combine cases from two to five data files, andin Stata the append command is used.

It is also possible to update the data files with later information using the editing functions ofthe package. Thus, a single data file can be obtained that contains the latest information. This optioncan also be used to replace data that were originally entered incorrectly.

When using a statistical package that does not have provision for merging data sets, it is recom-mended that a spreadsheet program be used to perform the merging and then, after a rectangular datafile is obtained, the resulting data file can be transferred to the desired statistical package. In general,the newer spreadsheet programs have excellent facilities for combining data sets side-by-side or foradding new cases.


Missing values

There are two types of missing data. The first type occurs when no information is obtained from acase, individual, or sampling unit. This type is called unit nonresponse. For example, in a survey itmay be impossible to reach a potential respondent or the subject may refuse to answer. In a biomed-ical study, records may be lost or a laboratory animal may die of unrelated causes prior to measuringthe outcome. The second type of nonresponse occurs when the case, individual, or sampling unit isavailable but yields incomplete information. For example, in a survey the respondent may refuse toanswer questions on income or only fill out the first page of a questionnaire. Busy physicians maynot completely fill in a medical record. This type of nonresponse is called item nonresponse. Ingeneral, the more control the investigator has of the sampling units, the less apt unit nonresponseor item nonresponse is to occur. In surveys the investigator often has little or no control over therespondent, so both types of nonresponse are apt to happen. For this reason, much of the researchon handling nonresponse has been done in the survey field and the terminology used reflects thisemphasis.

The seriousness of either unit nonresponse or item nonresponse depends mainly on the mag-nitude of the nonresponse and on the characteristics of the nonresponders. If the proportion ofnonresponse is very small, it is seldom a problem and if the nonresponders can be considered tobe a random sample of the population then it can be ignored (see Section 10.2 for a more com-plete classification of nonresponse). Also, if the units sampled are highly homogeneous then moststatisticians would not be too concerned. For example, some laboratory animals have been bred fordecades to be quite similar in their genetic background. In contrast, people in most major countrieshave very different backgrounds and their opinions and genetic makeup can vary greatly.

When only unit nonresponse occurs, the data gathered will look complete in that informationis available on all the variables for each case. Suppose in a survey of students 80% of the femalesrespond and 60% of the males respond and the investigator expects males and females to responddifferently to a question (X). If in the population 55% are males and 45% are females, then insteadof simply getting an overall average of responses for all the students, a weighted average could bereported. For males w1 = .55 and for females w2 = .45. If X1 is the mean for males and X2 is themean for females, then a weighted average could be computed as

X =∑wiX i

∑wi=

w1X1 +w2X2

w1 +w2

Another common technique is to assign each observation a weight and the weight is enteredinto the data set as if it were a variable. Observations are weighted more if they come from sub-groups that have a low response rate. This weight may be adjusted so that the sum of the weightsequals the sample size. When weighting data, the investigator is assuming that the responders andnonresponders in a subgroup are similar.

In this book, we do not discuss such weighted analyses in detail. A more complete discussion ofusing weights for adjustment of unit nonresponse can be found in Groves et al. (2002) or Little andRubin (2002). Several types of weights can be used and it is recommended that the reader considerthe various options before proceeding. The investigator would need to obtain information on theunits in the population to check whether the units in the sample are proportional to the units in thepopulation. For example, in a survey of professionals taken from a listing of society members if thesex, years since graduation, and current employment information is available from both the listingof the members and the results of the survey, these variables could be used to compute subgroupweights.

The data set should also be screened for item nonresponse. As will be discussed in Section 10.2,most statistical analyses require complete data on all the variables used in the analysis. If even onevariable has a missing value for a case, that case will not be used. Most statistical packages provideprograms that indicate how many cases were used in computing common univariate statistics such


as means and standard deviations (or report how many cases were missing). Thus it is simple to findwhich variables have few or numerous missing values.

Some programs can also indicate how many missing values there are for each case. Other pro-grams allow you to transpose or flip your data file so the rows become the columns and the columnsbecome the rows (Table 3.2). Thus the cases and variables are switched as far as the statisticalpackage is concerned. The number of missing values by case can then be found by computing theunivariate statistics on the transposed data. Examination of the pattern of missing values is impor-tant since it allows the investigator to see if it appears to be distributed randomly or only occurs insome variables. Also, it may have occurred only at the start of the study or close to the end.

Once the pattern of missing data is determined, a decision must be made on how to obtain acomplete data set for analysis. For a first step, most statisticians agree on the following guidelines.

1. If a variable is missing in a very high proportion of cases, then that variable could be deleted, butthis could represent a limitation to the study that might need to be noted.

2. If a case is missing many variables that are crucial to your analysis, then that case could bedeleted. If a substantial proportion of cases have this issue, this could be a problem with thegeneralizability of the analysis results.

You should also carefully check if there is anything special about the cases that have numerousmissing data as this might give you insight into problems in data collection. It might also give someinsight into the population to which the results actually apply. Likewise, a variable that is missingin a high proportion of the respondents may be an indication of a special problem. Following theguidelines listed previously can reduce the problems in data analysis but it will not eliminate theproblems of reduced efficiency due to discarded data or potential bias due to differences between thedata that are complete and the grossly incomplete data. For example, this process may result in a dataset that is too small or that is not representative of the total data set. That is, the missing data may notbe missing completely at random (see Section 10.2). In such cases, you should consider methodsof imputing (or filling-in) the missing data (see Section 10.2 and the books by Rubin (2004); Littleand Rubin (2002); Schafer (1997); Molenberghs and Kenward (2007), or Laaksonen (2018).

Item nonresponse can occur in two ways. First, the data may be missing from the start. In thiscase, the investigator enters a code for missing values at the time the data are entered into the com-puter. One option is to enter a symbol that the statistical package being used will automaticallyrecognize as a missing value. For example, a period, an asterisk (*), or a blank space may be rec-ognized as a missing value by some programs. Commonly, a numerical value is used that is outsidethe range of possible values. For example, for the variable “sex” (with 1 = male and 2 = female) amissing code could be 9. A string of 9s is often used; thus, for the weight of a person 999 could beused as a missing code.

That value should then be replaced within the software program, using an appropriate command,by that program’s specific missing code.For example, using SAS one could state

if sex = 9, then sex = . ;

Similar statements are used for the other programs. The reader should check the manual for theprecise statement. We recommend against the use of missing value codes that could potentiallybe observed values. Otherwise results of the statistical analyses can be misleading or nonsensicalbecause some actual observations could be considered missing.

If the data have been entered into a spreadsheet program, then it is recommended to leave thecells with missing data blank. Most statistical packages will recognize a blank value as missing.

The second way in which values can be considered missing is if the data values are beyond therange of the stated maximum or minimum values. For example, if the age of a respondent is enteredas 167 and it is not possible to determine the correct value, then the 167 should be replaced with amissing value code so an obviously incorrect value is not used.

Further discussion of the types of missing values and of ways of handling item nonresponse indata analysis is given in Section 10.2. Here, we will briefly mention one simple method.


The replacement of missing values with the mean value of that variable is a common optionin statistical software packages and is the simplest method of imputation. This method results inunderestimation of the variances and covariances that are subsequently used in many analyses. Thuswe do not recommend the use of this method.

Detection of outliers

Outliers are observations that appear inconsistent with the remainder of the data set (Barnett andLewis, 1994). One method for determining outliers has already been discussed, namely, settingminimum and maximum values. By applying these limits, extreme or unreasonable outliers areprevented from entering the data set.

Often, observations are obtained that seem quite high or low but are not impossible. These valuesare the most difficult ones to cope with. Should they be removed or not? Statisticians differ in theiropinions, from “if in doubt, throw it out” to the point of view that it is unethical to remove an outlierfor fear of biasing the results. The investigator may wish to eliminate these outliers from the analysesbut report them along with the statistical analysis. Another possibility is to run the analyses twice,both with the outliers and without them, to see if they make an appreciable difference in the results.Most investigators would hesitate, for example, to report rejecting a null hypothesis if the removal ofan outlier would result in the hypothesis not being rejected. We recommend that whatever decisionis made regarding outliers that such decision and potentially its consequences are made transparentand are justified in any report or publication.

A review of formal tests for detection of outliers is given in Barnett and Lewis (1994). To makethe formal tests you usually must assume normality of the data. Some of the formal tests are knownto be quite sensitive to nonnormality and should only be used when you are convinced that thisassumption is reasonable. Often an alpha level of 0.10 or 0.15 is used for testing if it is suspectedthat outliers are not extremely unusual. Smaller values of alpha can be used if outliers are thoughtto be rare.

The data can be examined one variable at a time by using histograms and box plots if the variableis measured on the interval or ratio scale. A questionable value would be one that is separated fromthe remaining observations. For nominal or ordinal data, the frequency of each outcome can benoted. If a recorded outcome is impossible, it can be declared missing. If a particular outcomeoccurs only once or twice, the investigator may wish to consolidate that outcome with a similar one.We will return to the subject of outliers in connection with the statistical analyses starting in Chapter7, but mainly the discussion in this book is not based on formal tests.

Transformations of the data

Transformations are commonly made either to create new variables with a form more suitablefor analysis or to achieve an approximate normal distribution. Here we discuss the first possibility.Transformations to achieve approximate normality are discussed in Chapter 5.

Transformations to create new variables can either be performed as a step in organizing the dataor can be included later when the analyses are being performed. It is recommended that they be doneas a part of organizing the data. The advantage of this is that the new variables are created once andfor all, and sets of instructions for running data analysis from then on do not have to include thedata transformation statements. This results in shorter sets of instructions with less repetition andchance for errors when the data are being analyzed. This is almost essential if several investigatorsare analyzing the same data set.

One common use of transformations occurs in the analysis of questionnaire data. Often theresults from several questions are combined to form a new variable. For example, in studying theeffects of smoking on lung function it is common to ask first a question such as:


Have you ever smoked cigarettes? yesor no

If the subjects answer no, they skip a set of questions and go on to another topic. If they answeryes, they are questioned further about the amount in terms of packs per day and length of time theysmoked (in years). From this information, a new pack–year variable is created that is the number ofyears times the average number of packs. For the person who has never smoked, the answer is zero.Transformation statements are used to create the new variable.

Each package offers a slightly different set of transformation statements, but some general op-tions exist. The programs allow you to select cases that meet certain specifications using IF state-ments. Here for instance, if the response is no to whether the person has ever smoked, the newvariable should be set to zero. If the response is yes, then pack–years is computed by multiplyingthe average amount smoked by the length of time smoked. This sort of arithmetic operation is pro-vided for and the new variable is added to the end of the data set. Variables that are generated orcalculated based on other originally collect information are often called calculated variables.

Additional options include taking means of a set of variables or the maximum value of a setof variables. For example a survey may ask about the total number of times in the past month anindividual has used marijuana, cocaine or LSD in separate questions, but a researcher may be moreinterested in simply the total number of times in the past month that an individual has used any ofthese drugs.

Another common arithmetic transformation involves simply changing the numerical valuescoded for a nominal or ordinal variable. For example, for the depression data set, sex was codedmale = 1 and female = 2. In some of the analyses used in this book, we have recoded that to male =0 and female = 1 by simply subtracting one from the given value.

Saving the results

After the data have been screened for missing values and outliers, and transformations made to formnew variables, the results are saved in a master file that can be used for analysis. We recommendthat a copy or copies of this master file be made on an external storage device such as a CD orUSB drive, or in a cloud storage service so that it can be stored outside the computer. A summaryof decisions made in data screening and transformations used should be stored with the master file.Enough information should be stored so that the investigator can later describe what steps weretaken in organizing the data.

If the steps taken in organizing and preparing the data were performed by typing commands in astatistical programming language, it is recommended that a copy of these commands be recorded ina file and stored along with the data sets. The file containing these commands is commonly referredto as a code file or script file. Then, should the need arise, the manipulation can be redone by simplyediting the code file instructions rather than completely recreating them. We discuss the importanceof this process further in the next section on reproducibility.

If results are saved interactively (point and click), then it is recommended that multiple copiesbe saved along the way until you are perfectly satisfied with the results and that a memo facility orother program be used to document your steps. Some packages such as SPSS and Stata do give youthe code that is a result of executing a series of file menu commands. Figure 3.1 summarizes thesteps taken in data entry and data management.

3.5 Reproducible research and literate programming

Reproducibility is the ability for any researcher to take the same data set and run the same setof software program instructions as another researcher and achieve the same results (Patil et al.,2016). This process allows others to verify existing findings and build upon them. As data setsbecome larger and more complex, requiring more sophisticated and computationally intensive anal-

3.5. REPRODUCIBLE RESEARCH AND LITERATE PROGRAMMING 29

Data Entry� assigning attributes to data entering data screening out-of-range values

Data Management�

combining data sets

identifying patterns of missing data detecting outliers transformations of data checking results

Saving working data set

Creating a codebook

Figure 3.1: Preparing Data for Statistical Analysis

ysis methods, the need to provide sufficient information to enable reproducibility of results becomesmore important.

The goal is to create an exact record of what was done to a data set to produce a specific result.To achieve reproducibility, we believe that three things must be present:

1. The un-processed data are connected directly to software code file(s) that perform data prepara-tion techniques such as the ones discussed in this chapter and in Chapter 5.

2. The processed data are connected directly to other software code file(s) that perform the analyses.3. All data and code files are self-contained such that they could be given to another researcher to

execute the code commands on a separate computer and achieve the same results as the originalauthor.

Incorporating reproducible research techniques into your workflow not only provides a benefit tothe analysts and their collaborators, but also the scientific community in general. In addition, somescientific journals are requesting that authors publish their code and data along with the manuscript(Loder and Groves, 2015; Sturges et al., 2015; Piwowar et al., 2007; Gandrud, 2015).

It is important to note that sometimes making research data available to the scientific communityor beyond needs to be done in such a way that confidentiality of participants are protected. Proce-dures to protect confidentiality might be interfering with exact replication. For more information onmethods to protect biological and social data we refer readers to Hauser et al. (2010), and on theethics on sharing medical data see Hollis (2016).

Literate programming is the programming paradigm introduced by Knuth (1984) intendedto present an explanation of the code being written in a natural language such as English. Thescientist explains the logic of the program or analysis process in the natural language, with small


code snippets included at each step. The code snippets create a full set of instructions that can beexecuted or compiled to produce a result, such as a data analysis report or a series of data pre-processing steps.

Imagine you are tasked with analyzing a large data set that includes a lot of data-preprocessing,statistical analyses, and creating graphics. You could process the data using a combination of man-ual and menu driven edits and produce tables and figures individually and copy them into a wordprocessing program where you write up the results in paragraph form.

Now imagine that you find out that there were additional errors in the original data, or thatadditional data records are now available and need to be included in the analysis. You are now facedwith completely repeating all the effort to process the data, conduct statistical analyses and createthe report.

Practicing reproducible research techniques using literate programming tools allows such majorupdates to be a simple matter of re-compiling all coded instructions using the updated data set. Theeffort then is reduced to a careful review and update of any written results.

Literate programming tools such as those listed in Table 3.3 use additional markup languagessuch as Markdown or LATEX to create formatted documents with section headers, bold and italicizedwords, tables and graphics with built-in captions in a streamlined manner that is fully synchronizedwith the code itself. The author writes the text explanations, interpretations, and code in the statisti-cal software program itself, and the program will execute all commands and combine the text, codeand output all together into a final dynamic document.

For details on how to use literate programming tools such as those listed in Table 3.3 we referthe reader to references such as Xie (2015) and Leisch and R-Core (2017) for programming in R,Lenth and Højsgaard (2007) for programming in SAS, and Haghish (2016a) and Haghish (2016b)for programming in Stata.

For a more general discussion on tools, practices and guidelines and platforms to con-duct reproducible research see Stodden et al. (2014). For a detailed and open source guideto enhancing reproducibility in scientific results and writing, see Martinez et al. located athttps://ropensci.github.io/reproducibility-guide/

3.6 Example: Depression study

In this section we discuss a data set that will be used in several succeeding chapters to illustratemultivariate analyses. The depression study itself is described in Chapter 1.

The data given here are from a subset of 294 observations randomly chosen from the original1000 respondents sampled in Los Angeles. This subset of observations is large enough to providea good illustration of the statistical techniques but small enough to be manageable. Only data fromthe first time period are included. Variables are chosen so that they would be easily understood andwould be sensible to use in the multivariate statistical analyses described in Chapters 7–18.

The codebook, the variables used, and the data set are described below.

Codebook

In multivariate analysis, the investigator often works with a data set that has numerous variables,perhaps hundreds of them or more. An important step in making the data set understandable is tocreate a written codebook that can be given to all the users. The codebook should contain a descrip-tion of each variable and the variable name given to each variable for use in the statistical package.Some statistical packages have limits on the length of the variable names so that abbreviations areused. Often blank spaces are not allowed, so dashes or underscores are included. Some statisticalpackages reserve certain words that may not be used as variable names. The variables should belisted in the same order as they are in the data file. The codebook serves as a guide and record forall users of the data set and as documentation needed to interpret results.

Table 3.4 contains a codebook for the depression data set. In the first column the variable number

3.6. EXAMPLE: DEPRESSION STUDY 31

is listed, since that is often the simplest way to refer to the variables in the computer. A variablename is given next, and this name is used in later data analysis. These names were chosen to beeight characters or less so that they could be used by all the statistical programs at the time when thedata set was created (early 1980’s). This eight character limitation on variable names is no longer arestriction. It is helpful to choose variable names that are easy to remember and are descriptive ofthe variables, but short to reduce space in the display.

Finally a description of each variable is given in the last column of Table 3.4. For nominal orordinal data, the numbers used to code each answer are listed. For interval or ratio data, the unitsused are included. Note that income is given in thousands of dollars per year for the household;thus an income of 15 would be $15, 000 per year. Additional information that is sometimes givenincludes the number of cases that have missing values, how missing values are coded, the largestand smallest value for that variable, simple descriptive statistics such as frequencies for each answerfor nominal or ordinal data, and means and standard deviations for interval or ratio data. Additionalcolumns could be used to add information regarding the variables (e.g., if there were changes in thevariables for different versions of the data collection instrument), or to indicate whether the variableis recorded as collected or whether and how it was calculated based on other variables. We note thatone package (Stata) can produce a codebook for its users that includes much of the information justdescribed.

Depression variables

The 20 items used in the depression scale are variables 9–28 and are named C1, C2,. . . , C20. (Thewording of each item is given later in the text, in 14.2.) Each item was written on a card and therespondent was asked to tell the interviewer the number that best describes how often he or shefelt or behaved this way during the past week. Thus respondents who answered item C2, “I feltdepressed,” could respond 0–3, depending on whether this particular item applied to them rarelyor none of the time (less than 1 day: 0), some or little of the time (1–2 days: 1), occasionally or amoderate amount of the time (3–4 days: 2), or most or all of the time (5–7 days: 3).

Most of the items are worded in a negative fashion, but items C8–C11 are positively worded.For example, C8 is “I felt that I was as good as other people.” For positively worded items the scoresare reversed: that is, a score of 3 is changed to be 0, 2 is changed to 1, 1 is changed to 2, and 0 ischanged to 3. In this way, when the total score of all 20 items is obtained by summation of variablesC1–C20, a large score indicates a person who is depressed. This sum is the 29th variable, namedCESD (short for: Center for Epidemiological Studies Depression).

Persons whose CESD score is greater than or equal to 16 are classified as depressed since thisvalue is the common cutoff point used in the literature (Aneshensel and Frerichs, 1982). Thesepersons are given a score of 1 in variable 30, the CASES variable. The particular depression scaleemployed here was developed for use in community surveys of noninstitutionalized respondents(Comstock and Helsing, 1977; Radloff, 1977).

Data set

As can be seen by examining the codebook given in Table 3.4 demographic data (variables 2–8),depression data (variables 9–30), and general health data (variables 32–37) are included in this dataset. Variable 31, DRINK, was included so that it would be possible to determine if an associationexists between drinking and depression.

The actual data for the first 30 of the 294 respondents included here are listed in Table 3.5. Therest of the data set, along with the other data sets used in this book, are available on the CRC Pressand UCLA web sites (see Appendix A).


3.7 Summary

In this chapter we discussed the steps necessary before statistical analysis can begin. The first ofthese is the decision of what computer and software packages to use. Once this decision is made,data entry and organizing the data can be started.

Note that investigators often alter the order of these operations. For example, some prefer tocheck for missing data and outliers and to make transformations prior to combining the data sets.This is particularly true in analyzing longitudinal data when the first data set may be available wellbefore the others. This may also be an iterative process in that finding errors may lead to enteringnew data to replace erroneous values. Again, we stress saving the results on some other external orcloud storage device after each set of changes.

Four statistical packages — R, SAS, SPSS, and Stata — were noted as the packages used in thisbook. In evaluating a package it is often helpful to examine the data entry and data manipulationfeatures they offer. The tasks performed in data entry and organization are often much more difficultand time consuming than running the statistical analyses, so a package that is easy and intuitive touse for these operations is a real help. If the package available to you lacks needed features, thenyou may wish to perform these operations in one of the spreadsheet or relational database packagesand then transfer the results to your statistical package.

3.8 Problems

3.1 Enter the data set given in Table 9.1, Chemical companies’ financial performance (Section9.3), using a data entry program of your choice. Make a codebook for this data set.

3.2 Using the data set entered in the previous problem, delete the P/E variable for the Dow Chem-ical company and D/E for Stauffer Chemical and Nalco Chemical in a way appropriate forthe statistical package you are using. Then, use the missing value features in your statisticalpackage to find the missing values and replace them with an imputed value.

3.3 Transfer or read in a data set that was entered into a spreadsheet program into your statisticalsoftware package.

3.4 Describe the person in the depression data set who has the highest total CESD score.3.5 For the statistical package you intend to use, describe how you would add data from three

more time periods for the same subjects to the depression data set.3.6 Combine the results from the following two questions into a single variable that measures the

total number of days the individual has been sick during the time period.: This would allowone variable to be used for analysis involving this data.a. Have you been sick during the last two weeks?Yes, go to b.No

b. How many days were you sick?

3.7 Consistency checks are sometimes performed to detect possible errors in the data. If a data setincluded information on sex, age, and use of contraceptive pill, describe a consistency checkthat could be used for this data set.

3.8 In the Parental HIV data set, the variable LIVWITH (who the adolescent was living with) wascoded 1=both parents, 2=one parent, and 3=other. Transform the data so it is coded 1=oneparent, 2=two parents, and 3=other using the features available in the statistical package orspreadsheet program you are using.

3.9 From the variables ACUTEILL and BEDDAYS described in Table 3.4, create a single variable

3.8. PROBLEMS 33

that takes on the value 1 if the person has been both bedridden and acutely ill in the last twomonths and that takes on the value 0 otherwise.


Table 3.2: Data management features of the statistical packages.

R* SAS SPSS Stata

Merging data merge, MERGE MATCH FILES mergesets join

Adding data rbind, PROC APPEND, ADD FILES appendsets cbind SET

Hierarchical reshape, Write multiple CASESTOVARS reshape,data sets reshape2, OUTPUT statements frlink

tidyr RETAIN

Transpose data t PROC TRANSPOSE FLIP xpose

Missing value mice PROC MI, MULTIPLE miimputation PROC MIANALYZE IMPUTATION

Calendar dates as.Date INFORMAT FORMATS dateschron, lubridate

*Monospace font denotes the function name. Normal font denotes a user written package containingfunctions to perform the specified task

Table 3.3: Literate Programming Tools

Software Addons or Packages

R RMarkdown, Sweave, knitrSAS SASWeaveSPSSStata MarkDoc, Ketchup, Weaver, Dyndoc, Markdown

3.8. PROBLEMS 35

Table 3.4: Codebook for depression data

Variable Variablenumber name Description

1 ID Identification number from 1 to 2942 SEX 1 = male; 2 = female3 AGE Age in years at last birthday4 MARITAL 1 = never married; 2 = married; 3 = divorced;

4 = separated; 5 = widowed5 EDUCAT 1 = less than high school; 2 = some high

school; 3 = finished high school; 4 = somecollege; 5 = finished bachelor’s degree;6 = finished master’s degree; 7 = finisheddoctorate

6 EMPLOY 1 = full time; 2 = part time; 3 = unemployed;4 = retired; 5 = houseperson; 6 = in school;7 = other

7 INCOME Thousands of dollars per year8 RELIG 1 = Protestant; 2 = Catholic; 3 = Jewish;

4 = none; 5 = other9–28 C1–C20 “Please look at this card and tell me the

number that best describes how often you feltor behaved this way during the past week.” 20items from depression scale (already reflected;see text)0 = rarely or none of the time (less than 1 day);1 = some or a little of the time (1–2 days);2 = occasionally or a moderate amount of thetime (3–4 days); 3 = most or all of the time(5–7 days)

29 CESD Sum of C1–20; 0 = lowest level possible;60 = highest level possible

30 CASES 0 = normal; 1 = depressed, where depressed isCESD≥16

31 DRINK Regular drinker? 1 = yes; 2 = no32 HEALTH General health? 1 = excellent; 2 = good;

3 = fair; 4 = poor33 REGDOC Have a regular doctor? 1 = yes; 2 = no34 TREAT Has a doctor prescribed or recommended that

you take medicine, medical treatments, orchange your way of living in such areas assmoking, special diet, exercise, or drinking?1 = yes; 2 = no

35 BEDDAYS Spent entire day(s) in bed in last two months?0 = no; 1 = yes

36 ACUTEILL Any acute illness in last two months? 0 = no;1 = yes

37 CHRONILL Any chronic illness in last year? 0 = no;1 = yes


Tabl

e3.

5:D

epre

ssio

nda

tafo

rth

efir

st30

resp

onde

nts

AC

MB

CH

AE

IH

RE

UR

RM

NR

CD

EE

TD

TO

IE

PC

EC

AR

AG

RD

EN

OS

AT

DL

OL

CC

CC

CC

CC

CC

CE

SI

LD

EA

II

BI

EG

AU

OM

IC

CC

CC

CC

CC

11

11

11

11

11

2S

EN

TO

AY

LL

SD

XE

LC

YE

G1

23

45

67

89

01

23

45

67

89

0D

SK

HC

TS

LL

11

268

52

44

10

00

00

00

00

00

00

00

00

00

00

02

21

10

01

22

158

34

115

10

01

00

00

00

00

10

01

01

00

04

01

11

10

01

33

245

23

128

10

00

01

00

00

00

00

11

10

00

04

01

21

10

00

44

250

33

39

10

00

01

10

30

00

00

00

00

00

05

02

11

20

01

55

233

43

135

10

00

00

00

33

00

00

00

00

00

06

01

11

11

10

66

124

23

111

10

00

00

00

01

00

12

00

21

00

07

01

11

10

11

77

258

22

511

12

11

21

00

22

00

00

03

00

00

115

02

31

10

11

88

122

13

19

10

12

02

10

00

00

00

00

11

11

010

02

12

20

10

99

247

23

423

20

11

00

30

00

00

30

32

30

00

016

11

41

11

01

1010

130

22

135

40

00

00

00

00

00

00

00

00

00

00

01

11

20

00

1111

220

12

325

40

01

01

21

00

10

12

21

12

30

018

11

21

20

00

1212

257

23

224

10

00

00

00

00

00

00

02

20

00

04

02

21

11

11

1313

139

22

128

11

10

00

00

00

00

10

20

10

01

18

01

31

10

10

1414

261

53

413

10

00

01

00

10

00

10

00

00

00

14

01

11

10

10

1515

223

23

115

20

00

00

00

00

00

00

01

31

02

18

01

11

20

00

1616

221

12

16

11

12

01

11

12

20

11

21

11

20

021

11

31

11

01

1717

223

14

18

13

32

33

32

23

22

21

23

20

10

342

11

12

21

10

1818

255

42

319

11

01

11

00

00

02

00

00

00

00

06

02

31

11

11

1919

226

16

115

10

00

00

00

00

00

00

00

00

00

00

02

22

21

10

2020

164

52

49

40

00

00

00

30

00

00

00

00

00

03

01

21

20

00

2121

244

13

16

20

00

00

00

03

00

00

00

00

00

03

01

11

10

01

2222

225

23

135

10

00

10

00

00

00

01

01

01

00

04

01

21

10

11

2323

272

53

47

20

00

00

00

00

00

02

00

00

00

02

01

21

10

01

2424

261

23

119

20

00

00

00

00

00

02

02

00

00

04

02

31

10

01

2525

243

33

16

10

00

01

01

21

00

10

11

20

00

010

02

31

10

01

2626

252

22

519

21

21

01

00

00

00

11

03

20

00

012

01

31

10

00

2727

223

23

513

10

00

00

00

30

00

01

10

00

10

06

02

21

20

10

2828

173

42

45

20

12

02

20

00

20

00

00

00

00

09

01

31

10

01

2929

234

23

219

20

22

01

02

11

11

23

23

32

00

228

11

21

20

00

3030

234

23

120

10

00

00

00

00

00

10

00

00

00

01

01

21

20

01

Chapter 4

Data Visualization

4.1 Introduction

Visualizing data is one of the most important things we can do to become familiar with the data.There are often features and patterns in the data that cannot be uncovered with summary statisticsalone. There tends to be two forms in which data can be presented; Summary tables are used forcomparing exact values between groups for example, and plots for conveying trends and patternswhen exact numbers are not always necessary to convey a story. This chapter introduces a series ofplot types for both categorical and continuous data. We start with visualizations for a single vari-able only (univariate), then combinations of two variables (bivariate), and lastly a few examples anddiscussion of methods for exploring relationships between more than two variables (multivariate).Additional graphs designed for a specific analysis setting are introduced as needed in other chaptersof this book.

This chapter uses several data sets described in Appendix A. Specifically, we use the parentalHIV and the depression data sets to demonstrate different visualization techniques. Almost allgraphics in this chapter are made using R, with section 4.5 containing a discussion of graphicalcapabilities to create these graphs in other statistical software programs.

There are three levels of visualizations that can be created, with examples shown in Figure 4.1a, band c.• For your eyes only (4.1a): Made by the analyst, for the analyst, these plots are quick and easy

to create, using the default options without any annotation or context. These graphs are meant tobe looked at once or twice for exploratory analysis in order to better understand the data.• For an internal report (4.1b): Some chosen plots are then cleaned up to be shared with others,

for example in a weekly team meeting or to be sent to co-investigators participating in the study.These plots need to be capable of standing on their own, but can be slightly less than perfect.Axis labels, titles, colors, annotations and other captions are provided as needed to put the graphin context.• For publication or external report (4.1c): These are meant to be shared with other stakeholders

such as the public, your collaborator(s) or administration. Very few plots make it this far. Theseplots should have all the “bells and whistles” as they appear in formal reports, and are often savedto an external file of a specific size or file type, with high resolution. For publication in mostprinted journals and books, figures typically need to be in black and white (possibly grayscale).Along with having the audience in mind, it is important to give thought to the purpose of the

chart. “The effectiveness of any visualization can be measured according to how well it fulfills thetasks it was designed for.” (A. Cairo, personal communication, Aug 9, 2018).

37

38 CHAPTER 4. DATA VISUALIZATION

Histogram of parhiv$bsi_overall

parhiv$bsi_overall

Fre

quen

cy

0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5

020

4060

8010

012

014

0

(a) For your eyes only

0

10

20

30

40

0 1 2 3

bsi_overall

coun

t

(b) For an internal report

0

10

20

30

40

0 1 2 3

BSI score

Fre

quen

cy

Brief Symptom Inventory Score

(c) For publication

Figure 4.1: Three levels of graphic quality and completeness using histograms as examples. Thesize of each plot above are the same, the inclusion and formatting of titles and axes will impact sizeof the plotting region.

4.2 Univariate Data

This section covers how to visualize a single variable or characteristic. We start with plots for cat-egorical data, then cover plots for continuous data. Visualization is one of the best methods toidentify univariate outliers, skewness, low frequencies in certain categories and/or other oddities inthe distribution of the data.

4.2.1 Categorical Data

Categorical (nominal or ordinal) data are summarized by reporting the count, or frequency, ofrecords in the data set that take on the value for each category of the variable of interest. (See Section2.3 for a review of data type classifications.) Common methods to display the counts of categoricaldata include tables, dot plots, and pie charts. The subsections below discuss and demonstrate eachof these types.

Tables

A table is the most common way to organize and display summary statistics of a categorical variableusing just numbers. Tables should show both the frequency (N) and the percent for each category.That way readers can compare relative group sizes and the overall magnitude of data at the sametime. Some software packages, such as SPSS, will automatically display percentages and generatea total row for frequency tables, others packages such as R require a follow up commands such asprop.table for percentages or addmargins for the total.

Table 4.1: Education level among mothers with HIV

Education Level N PercentLess than High School 79 43.2%High School Graduate/GED* 57 31.1%Post Secondary 47 25.7%*GED: General Education Development, an alternativeto a High School Diploma.

Table 4.1 shows that about a quarter (47, 25.7%) of mothers in the parental HIV data set havepost-secondary school education level.

4.2. UNIVARIATE DATA 39

79

5747

0

20

40

60

80

Less thanHigh School

High SchoolGraduate/GED

PostSecondary

Highest educational level attained

coun

t

Frequency of educational level

(a) A bar chart comparing frequencies

43.2%

31.1%25.7%

0%

25%

50%

75%

100%



PostSecondary

Highest educational level attained

Percent of mothers with each educational level

(b) A bar chart comparing percentages

Figure 4.2: Two bar charts showing the distribution of highest level of education attained

Bar Charts

A bar chart (Figure 4.2) takes these frequencies and draws bars for each category (shown along thehorizontal axis) where the heights of the bars are determined by the frequencies seen in the table(Figure 4.2a). A reasonable modification is to put the percentages on the vertical axis (Figure 4.2b).This is a place to be cautious however. Some programs by default will exclude the missing databefore calculating percentages, so the percentages shown are for available data. Other programswill display a bar for the missing category and display the percentages out of the full data set. Foreither choice it is advised that the analyst understand what the denominator is.

The ordering of categories is important for readability. Nearly all statistical software packageswill set the automatic factor ordering to alphabetical, or according to the numerical value that isassigned to each category. If the data are ordinal, tables and plots should read left to right alongwith that ordering, such as the educational level example in Figure 4.2. Sometimes there is a partialordering such as years of high school education and then different degrees that are not necessarilyeasy to summarize in years (can one year of vocational school be considered equivalent to one yearof college or one year of community college?). In these situations it is left to the researcher to decideon how to define and justify the order of the categories using subject matter expertise.

Cleveland dot plot

Bars use a lot of ink, and the width of the bar is typically meaningless. Cleveland dot plots (Cleve-land, 1993) provide an alternative method to display the frequencies using less ink and space thanbar charts (Figure 4.6). This is especially helpful when there are a large number of categories tovisualize. We use marital status as an example. Because it is a nominal variable (in contrast to theprevious ordinal variable examples), these summary data are best displayed in descending order offrequency.

An important note is that Cleveland dot plots plot summary data, not raw data. After summa-rizing the data, such as calculating the frequency of records per category, we now only have onedata point per category to plot. Examples of plots that include each individual data point are shownlater in this chapter. There are numerous ways to depict data using dots on a graph. We attempt tobe clear in our explanation of each plot discussed in this chapter, however naming conventions ofvarious “dot plots” are not universally consistent. For example, the type of plot shown in Figure 4.6is referred to as a dot plot with example R code using the function dotchart by Kabacoff (2015).


●

●

●

●

●

73

127

43

13

38

Separated

Widowed

Divorced

Never Married

Married

0 50 100Frequency

Mar

ital S

tatu

s

Frequency of marital status

Figure 4.3: A Cleveland dot plot of marital status

Divorced (14.6%)

Married (43.2%)

Never Married (24.8%)Separated (4.4%)

Widowed (12.9%)

Figure 4.4: A fully labeled pie chart of marital status. Note that percentages may not always addup to 100% due to rounding

Pie Charts

Each wedge of a pie chart (Figure 4.4) contains an internal angle indicating the relative proportionof records in that category. However, human eyes cannot distinguish between angles that are closein size as well as they can distinguish between heights of bars (or lines or dots). As the number ofcategories increases, a necessary component to make a pie chart interpretable is having labels withnames and percentages for each wedge. Depending on the defaults of the software program, thesegments may start either at the 12 o’clock position or the 3 o’clock position.

4.2.2 Continuous Data

Continuous data by definition can take on infinite possible values, so the above plots that displayfrequencies of records within a finite number of categories do not apply here unless the continuousdata are categorized into distinct groups (e.g. income brackets). To visualize continuous data, weneed to display the actual value or the distribution of the data points directly. Common plot typesinclude: stem-leaf plots, stripcharts, histograms, density graphs, boxplots, and violin plots. So, whatdo these plots depict and how are they generated?


Stem-leaf plots

The stem-leaf plot (Tukey, 1972) demonstrates how numbers placed on a line can describe theshape of the distribution of observed values and provide a listing of all individual observations inthe same plot. Since this type of graphic includes each individual data point, the usefulness andreadability diminishes as the number of data points increases. Figure 4.5 displays the values of agefor individuals in the depression data set.

1 | 8888899999 2 | 00000011111122222222233333333333444444444 2 | 5555556666666667777888889999 3 | 00000011111222222222233333444444444 3 | 555566666677777889 4 | 000001222222222333333344 4 | 555566677777788889999 5 | 00000111111222233444 5 | 55556667777778888888999999999 6 | 000000011111222233444 6 | 555556667788889 7 | 000001112233444 7 | 5778899 8 | 011233333 8 | 9

Figure 4.5: A stem-leaf plot of the individual age in the depression data set

Because stem-leaf plots display the value of every observation in the data set, the data valuescan be read directly. The first row displays data from 15 to 19 years of age, or, the second half of the10s place. Note that this study enrolled only adults, so the youngest possible age is 18. There arefive 18 year olds and five 19 year olds in the data set. From this plot one can get an idea of how thedata are distributed and know the actual values (of ages in this example). The second row displaysdata on ages between 20 and 24, or, the first half of the 20s. The third row displays data on agesbetween 25 and 29, or, the second half of the 20s, and so forth.

Stripcharts

Another type of plot where the value of of every observation in the data set is represented on thegraph called a stripchart. Figure 4.6 depicts the age of an individual in the depression data setas a single dot. The points here have been jittered (where equal values are moved slightly apartfrom each other) to avoid plotting symbols on top of each other and thus making them difficult orimpossible to identify.

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●● ●

●

●

●

●

●

●●

●●● ●

●

●

●

●

●

●

●●

●

●

●

●

●●

●

●

●

●● ●

● ●●

●●

● ●

●●

●

●

●

● ●

●

●

●

●●●

●

●●

●

●● ●

●

●

●

●

●●

●

●

●

●●

●●

●● ●

●

●

●

●

●●

●●

●●

●

●

●

●●

●

●

●

●

●

●

●

●●

●●

●

●●

●

●●

●

●●

●●

●

●●

●

●●

●

●●

● ●●●

●

●

●

●

●●

●

●

●

●

●●

●

●

●●●

●

●

●●

●

●●

●

●

● ●

●

● ●●

● ●

●

●

●

●

●

●

●●●

●

●●

●

●●

●

●

●

●

●

●●

●

●

●

●

●

●

●

● ●

●

●

●●

● ●

●

●

●

●

●

●

●

●

●● ●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

● ●

●

●

●

●

●

●

●

● ●

●

●

●●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●

●

●

●

●● ●

●

●

●●

●

●●

●

20 40 60 80Age

Figure 4.6: A stripchart of the individual age in the depression data set

Here we offer more cautionary words due to similar sounding plot names. Some authors mayrefer to this type of plot as a dotplot, a one-way dot, one-dimensional scatterplot, or a stripplot. In


0

20

40

60

20 40 60 80Age

Fre

quen

cy

Distribution of age

(a) A histogram with 8 bins

0

10

20

30

20 30 40 50 60 70 80 90Age

Fre

quen

cy

Distribution of age

(b) Using bins of width range/30

Figure 4.7: Histograms displaying the distribution of age in the depression data set

some programs, dotplots differ from stripcharts in that in dotplots are created where the width ofthe dot is determined by a binning algorithm similar to the ones used to calculate the widths of thebars in a histogram, or (Wilkinson, 1999). As the the defaults of each software program, commandor function may vary, we encourage the reader to refer to the help manual for their chosen programfor more information.

Often we are not interested in the individual values of each data point; rather we want to examinethe distribution of the data or summary measures of the distribution. Example questions might be:where is the majority of the data? Does the distribution look symmetric around some central point?Around what values do the bulk of the data lie? For example, the distribution of ages of individualsin the depression data set ranges from 18 to 89, is slightly skewed with a right tail, unimodal, witha mean around 45.

Histograms

Rather than showing the value of each observation, we often prefer to think of the value as belongingto a bin, or an interval. The heights of the bars in a histogram display the frequencies of values thatfall into those bins. For example, if we grouped the ages of individuals in the depression data setinto 10 year age bins, the frequency table looks like this:

(15,25] (25,35] (35,45] (45,55] (55,65] (65,75] (75,85] (85,95]57 61 42 41 51 26 15 1

In this table, the notation (15,25] indicates that this bin includes values that are between 15 and25, including 25 but excluding 15, and so forth.

To create a histogram, the values of a continuous variable are plotted on the horizontal axis, withthe height of the bar for each bin equal to the frequency of data within that bin (Figure 4.7a). Themain difference between a histogram and a bar chart is that bar charts plot a categorical variableon the horizontal axis, so the vertical bars are separated. The horizontal axis of a histogram iscontinuous, so the bars touch each other, and thus there is no gap between bins because the binsrepresent directly adjacent categories.

The choice of the size of each bin can highlight or hide some features in the data. If there isno scientifically motivated or otherwise prespecified bin size, we might start with the default valuefor the chosen statistical software package, as we did in Figure 4.7b, and then adjust as necessary.Figure 4.7b displays the same data on ages in the depression data set using the default value ofrange/30 chosen by the ggplot2 package in R. For this example the range of ages is 18 to 89, thusthe binwidth is (89−18)

30 = 2.4. This choice of bin width shows the most frequent age at around 23,unlike Figure 4.7a where it appears to be over 25.


0.00

0.01

0.02

0.03

0.04

20 30 40 50 60 70 80 90Age

Den

sity

Distribution of age

Figure 4.8: Density plot for the distribution of age in the depression data set

Kernel density plots

Instead of plotting bars for each bin, we can sometimes get a better (or different) idea of the trueshape of the distribution by creating a kernel density plot. The kernel density is a function, f (x),that is generated from the data set, similar to a histogram. Density plots differ from histograms inthat this function is a smooth continuous function, not a stepwise discrete function that creates barswith flat tops. See Everitt and Skrondal (2010) for more information on how the kernel density iscalculated.

Figure 4.8 shows that the density line smooths out the multitude of peaks and valleys in thehistogram, providing a better idea of the general shape of the data. Notice that the vertical axison a density plot is no longer the frequency or count, but the value of the kernel density. Whilethe density curve in Figure 4.8 is overlaid on top of the histogram with the smaller bin width, thefirst peak of the density plot is around 25, which is more representative of Figure 4.7a which usesthe wider bin size. This highlights the importance of looking at multiple types of graphics to fullyunderstand the distribution of the data.

Boxplots and Violin plots

A boxplot (also called box-whisker plot) display the five number summary (Min, Q(0.25), Median,Q(0.75), Max) in graphical format, where Q(0.25) indicates that 25% of the data are equal to orbelow this value. The data are arranged in ascending order and then separated into four equal sizedgroups, i.e., same number of data points are in each of the four sections of a boxplot (Figure 4.9a).

The box outlines the middle 50%, or the interquartile-range (IQR = Q(0.75)−Q(0.25)) ofthe data, and the horizontal lines (whiskers) extend from the 1st quartile (Q(0.25)) down to theminimum value, and upwards from the third quartile Q(0.75) to the maximum value. This meansthat in Figure 4.9a, the same number of individuals in the depression data set are between the 10year span of ages between 18 and 28, as there are between the 30 year span of ages between 59 and89.

Some statistical packages plot the modified boxplot by default. We first define the fences, i.e.,Q(0.25)− 1.5 ∗ IQR and Q(0.75)+ 1.5 ∗ IQR. Then, in the modified boxplot, the whiskers do notextend all the way out to the maximum and minimum, but out to the data points that are just insidethe fences as calculated by the 1.5∗IQR rule. In the modified boxplots, outliers are typically denotedas points or dots outside the fences. For example, consider the continuous measure of depression,variable CESD (Center for Epidemiological Studies Depression) in Figure 4.9b. Here the upper


18

28

42.5

59

89

Min

Q(0.25)

Median

Q(0.75)

Max

20

40

60

80A

ge

(a) An unadorned boxplot of age

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

0

3

7

12

25

Min

Q(0.25)

Median

Q(0.75)

Largest value

within fence

0

10

20

30

40

CE

SD

(b) A modified boxplot of CESD

Figure 4.9: An unadorned, and a modified boxplot

20

40

60

80

Age

(a) Add a violin plot & the mean

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●●●

●

0

10

20

30

40

CES

D

(b) Add a violin plot & jittered points

Figure 4.10: Boxplot enhancements

whisker extends to 25, the maximum value inside 1.5 ∗ IQR. The points above 25 are consideredpotential outliers. Some researchers choose to extend whiskers out to Q(0.05) and Q(0.95). Seesection for more information on how these values are calculated and used. Additions that makeboxplots much more informative are displayed in Figure 4.10:

1) adding the mean as a point (Figure 4.10a).2) adding a violin plot to show the density (reflected around the mid-line of the boxplot, Figure

4.10a). Violin plots are not commonly used, but they can be very informative in that they candisplay the shape of the kernel density in the same graph as the boxplot.

3) adding the data points directly as jittered dots (Figure 4.10b).

Numerous other modifications of this type of plot and other methods to visualize univariatecontinuous data have been proposed (among others, box-percentile plots by Esty and Banfield, 2003,

4.3. BIVARIATE DATA 45

extending whiskers to specified quantiles by Cleveland, 1985; Reimann et al., 2008, and “mountain”plots by Goldstein, 1996.)

4.3 Bivariate Data

Next we introduce graphical methods to explore relationships between two variables. Many of thesame plotting types such as boxplots and histograms introduced for univariate exploration will beused again here.

4.3.1 Categorical versus Categorical Data

To compare the distribution of one categorical variable across levels of another categorical variable,primarily tables are created. Tables come under several names including cross-tabulation, contin-gency tables and two-way tables. Table 4.2 displays the frequency of gender by education level forindividuals in the depression data set. The value in each cell is the number of records in the data setwith that combination of factor levels. For example, there are 4 males with less than a high-school(HS) degree, 26 females who have completed a Bachelor’s (BS) degree, and 8 males who havecompleted a master’s (MS) degree.

Table 4.2: Two-way frequency table of gender by educational level

<HS Some HS HS Grad Some college BS MS PhD TotalMale 4 19 39 18 17 8 6 111

Female 1 42 75 30 26 6 3 183Total 5 61 114 48 43 14 9 294

When group sizes are not comparable, it is more informative to compare percents instead offrequencies. There are three types of percentages that can be calculated, with each one having itsown purpose. Table 4.3 displays a table of cell percents, where the denominator is the entire sample.There are 13.3% of all respondents in this data set who are male and have graduated high school.Table 4.4 displays the row percents, where the denominator is the row total. Relatively more malesthan females completed a four year degree: 15.3% of males completed a BS degree, compared to14.2% of females. Table 4.5 displays the column percents, where the denominator is the columntotal. The majority of PhD graduates were male; 66.7% of the PhD graduates were male and 33.3%female.

Table 4.3: Cell percents: Percent out of the entire data set

<HS Some HS HS Grad Some college BS MS PhD TotalMale 1.4 6.5 13.3 6.1 5.8 2.7 2.0 37.8

Female 0.3 14.3 25.5 10.2 8.8 2.0 1.0 62.2Total 1.7 20.7 38.8 16.3 14.6 4.8 3.1 100.0

Table 4.4: Row percents: Percent of educational level within each gender

<HS Some HS HS Grad Some college BS MS PhD TotalMale 3.6 17.1 35.1 16.2 15.3 7.2 5.4 100.0

Female 0.5 23.0 41.0 16.4 14.2 3.3 1.6 100.0


Table 4.5: Column percents: Percent of gender within each educational level

<HS Some HS HS Grad Some college BS MS PhDMale 80.0 31.1 34.2 37.5 39.5 57.1 66.7

Female 20.0 68.9 65.8 62.5 60.5 42.9 33.3Total 100.0 100.0 100.0 100.0 100.0 100.0 100.0

0

20

40

60



PostSecondary

Fre

quen

cy

Job Status Employed Unemployed Retired/Disabled

(a) Frequency on the vertical axis

0%

25%

50%

75%

100%



PostSecondary

Per

cent

(b) Percents on the vertical axis

Figure 4.11: Distribution of current job status within highest education attained in the parentalHIV data set

Bar Charts

To visually compare the distribution of one categorical variable within levels of another categoricalvariable, we return to bar charts. Figure 4.11 compares the distribution of job status within thehighest educational level attained using the parental HIV data set.

Stacked bar charts can be informative when plotting percentages instead of counts. Figure 4.11bshows how the proportion of observations in each job status category compare across each level ofhighest educational level attained. This plot is created by plotting column percentages, so that allpercents within a column add up to 100%. The group of respondents whose highest education levelis post secondary has the highest proportion of employed respondents compared to the other twoeducational level groups.

The default for some software programs, such as R, is a stacked bar chart as shown in Figure4.11. For few categories this option could be acceptable. However, consider the proportion of thosewith HS/GED who are unemployed, is it bigger or smaller than the percent of those with postsecondary degrees who are currently unemployed? It is difficult to tell in a stacked bar chart, butmuch easier to see the difference with the bars placed side by side (Figure 4.12).

Alternatively, stacked bar charts could be displayed in a tabular manner with spaces between thebars. This spacing allows for easier comparison within, and across categories. Figure 4.13 providesan example of this method using community-contributed Stata command tabplot (Cox, 2016).

The Cleveland dot plot can also be done across groups. Figure 4.14 demonstrates a slight varia-tion where the dot is placed at the end of the solid line, instead of on a reference line as in Figure 4.6.In some fields this type of plot is referred to as a lollipop plot. This is also the first demonstration ofpaneling, where the data for each level of the grouping variable are set apart from the other levelsusing a rectangular border or frame. This method helps to visually separate the groups.


0.0%

20.0%

40.0%

60.0%



PostSecondary

Per

cent

Job StatusEmployedUnemployedRetired/Disabled

Figure 4.12: Side by side bar chart depicting the percent of current job status within highest edu-cation level attained in the parental HIV data set

9.6 12.5 20.0

58.9 44.6 42.2

31.5 42.9 37.8Retired/Disabled

Unemployed

Employed

Job

Stat

us

Did not complete HS HS diploma/GED More than HS diplomaEducation Status

% of Education category

Figure 4.13: Tabular bar chart comparing job status and educational level

Another way the Cleveland dot plot can be used is to highlight differences in frequencies be-tween two groups. Figure 4.15 shows the difference in the frequency of males and females fromthe Parental HIV data, within each of the mothers job status categories. There are more males thanfemales with unemployed mothers, but the difference in counts between genders is less than 10.There are about 20 more females than males with mothers who are retired or disabled.

Mosaic plot

Bar plots and dot plots show either the row or the column percents of a bivariate comparison. Theycompare the distribution of one categorical variable within levels of a second categorical variable.Mosaic plots provide a graphical method to compare the association between two categorical vari-ables.

Figure 4.16 compares job status to educational level by visualizing the cell proportions as areaof a square. The heights of the boxes correspond to the marginal distribution of educational level,and the widths of the boxes correspond to the marginal distribution of job status. The area of eachsmaller rectangle is proportional to the percent of data with that combination of levels. Using Table4.6 as a numerical reference, 4% of responses in the parental HIV data set have a GED and areemployed, whereas 24.7% have less than a HS education and are currently unemployed. This may


●

●

●

●

●

●

●

●

●

Less thanH

igh School

High S

choolG

raduate/GE

DP

ostS

econdary

10 20 30 40

Employed

Retired/Disabled

Unemployed

Employed

Retired/Disabled

Unemployed

Employed

Retired/Disabled

Unemployed

Frequency

Job

Sta

tus

Figure 4.14: Cleveland dot plot demonstrating the frequency of job status within highest educationattained

●

●

●

Employed

Retired/Disabled

Unemployed

20 30 40 50 60Frequency

Mot

her's

Job

Sta

tus

Gender of teen● Male

Female

Figure 4.15: Cleveland dot plot of the differences in frequency of males and females within themother’s job status

seem like a high proportion of unemployment, but recall that these data were collected in the earlynineties, where there were very limited treatment options for HIV positive individuals.

Table 4.6: Cell percentages for the combination of educational level and job status

Employed Unemployed Retired/DisabledLess than High School 4.0 24.7 13.2

High School Graduate/GED 4.0 14.4 13.8Post Secondary 5.2 10.9 9.8

4.3.2 Continuous versus Categorical Data

When comparing the distribution of a continuous variable across levels of a categorical variable, thesame types of plots seen for a single continuous variable can be used including histograms, densityplots, boxplots and violin plots.


Job Status

Hig

hest

Edu

catio

n Le

vel

PostSecondary



Employed Unemployed Retired/Disabled

Figure 4.16: Mosaic plot comparing job status and educational level

Figure 4.17 demonstrates how to plot the distribution within each group side by side (a), overlay-ing plots onto the same plotting grid (b), or to create a grid of panels (c) with one group per panel. Itis very important to use a shared or common axis when comparing conditional distributions acrossgroups.

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

0

10

20

30

40

Male Female

CE

SD

(a) Side by side

0 10 20 30 40

CESD

Den

sity

genderMaleFemale

(b) Overlapping

Male

Fem

ale

0 10 20 30 40 50

CESD

Fre

quen

cy

(c) Paneled

Figure 4.17: Three methods to compare the distribution of the continuous variable CESD acrosslevels of the categorical variable gender

4.3.3 Continuous versus Continuous Data

The most common method of visualizing the relationship between two continuous variables is thescatterplot (Figure 4.18a). Lines are often added to help see the trend in the data points. The twomost common best fit methods are the straight line (shown as a solid line) and the lowess smootherline (shown as a dashed line). In this plot the points have been colored grey to place the emphasison the lines. These two methods are both regression techniques discussed in Chapter 7.


Line plots

Line plots connect the points with a line. This is typical for time series and in profile plots wherethe goal is to track the data on an individual or a population over time. One line is plotted perindividual. For data sets with a larger number of individuals this process can create an unreadableplot. We suggest plotting data on a random subset of individuals to explore the data or create multiplesuch plots for subsets if feasible.

Figure 4.18b uses the mice data set described in Appendix A where the weights of mice weremeasured periodically for about a month. The mice grew almost at the same rate until about 8days, and then started to separate due to individual and treatment characteristics. This particulartype of plot is also known as a spaghetti plot or a growth curve because it typically presentsa measure of growth over time. We use this plot again in Section 18.7 when discussing how toanalyze longitudinal data.

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ● ●

●

●● ●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●●

●

●

●●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●●

●

●

●

●●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

0

10

20

30

40

20 40 60 80

Age

CE

SD line types

LinearLowess

(a) Scatterplot of age against CESD, with a dashedlowess line and solid best fit linear line

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

250

500

750

1000

5 10 15 20Time (Days)

Wei

ght (

g)

(b) Weight over time for 14 mice

Figure 4.18: Two types of scatterplots for examining the relationship between two continuous vari-ables. Points in the scatterplot (a) are not connected to other points whereas in the line plot (b)points from the same mouse are connected with a line

4.4 Multivariate Data

The techniques of applying colors, shadings, positioning and paneling of data from multiple groupsto visualize bivariate relationships can be extended to visualize relationships among more than twovariables simultaneously.

Figure 4.19 demonstrates how a third dimension can be added onto a bivariate scatterplot bychanging the (a) color or (b) shape of the points according to the level of a third categorical variable,or by changing the (c) size or (d) fill shade of the points according to a continuous variable. Forexample, plot (a) allows us to see that the points in the low range of age with high CESD score areprimarily female (grey dots), and plot (d) that those in the higher income levels (darker shades) tendto have a CESD score below 10.

There are many other ways to examine a multivariate relationship. Even on each of these plotsjust discussed a fourth layer could be added, such as changing the size of the point by income inplots (a) and (b) and shape of the point by gender in plots (c) and (d). Another method to examinea multivariate relationship is to use paneling in two dimensions. Figure 4.20 demonstrates how wecan examine the histogram of overall BSI (brief symptom inventory) score for each combination ofemployment status and highest educational level attained.

A scatterplot matrix is a common tool to examine the bivariate relationships between multi-ple continuous variables simultaneously. Figure 4.21 demonstrates a publication-ready version ofa scatterplot matrix that has many features added, including the pairwise correlation (see Chapter

4.4. MULTIVARIATE DATA 51

●

●●●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●●

●

●

●

●

●

●

●

●

●

● ●

●

●

●●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

● ●

●●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●●●

●●

●

● ● ●

●

●● ●

●

●●

●

●●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●●

●●

●

●

●●

● ●

●

●

●●●

●

●●

●●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●●

●

●●

●●

●

●

●

●

●

● ●

●

●

●

● ●

●

●●

●

●

●

●

●●

●

●

●●

●

●

●●

●●

●●

●

●

●

●●

●

●

●

●

●

●

●

●

●●

●●

●

●

●

●

●

●

●

●

●

●

●

0

10

20

30

40

20 40 60 80

Age

CE

SD gender

●

●

MaleFemale

(a) color

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

● ●

●●

●

●

●

●●

●

●

●

●

●

●

●

●●

●●

●

●●

●●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

● ●

●

●

●●

●

●

●

●

●

●

●●

●

●

●●

●●

●

●

●

●●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●

0

10

20

30

40

20 40 60 80

Age

CE

SD gender

● MaleFemale

(b) shape

●

●● ●●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●●●

●

●●

●

●

●

●

●

●●

●

●

●●

●

●●

●●

●●

●●

●

●●●

●

●●

●

●

●

●

●

●

●

●

●●●

●

● ●

●

●

●

●

●

●

●

●

●

● ●

●

●

●●

●

●

●●

●

●●●

●

●●

●

●

●

●

●

●

●

●●

●

●●●

●●

●●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

● ●● ●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●●● ● ●●

●● ●

●

●● ●●

●●

●

●●

●

●

●

●

●

●●

●

●

●

●

●●

●

●

●

●

●●

●●

●●

● ●● ●

●

●

●●●●

●●● ●

●

●

●

●

●

●

●

●

● ●●

●

●

●●

●

●●●

●

●●

●●●

●

●

●

●

● ●●

●●

● ●

●

● ●

●

●

●

●●●

●

●

●●●

●●●

●●

●●

●

●

●

●●

●

●●

●

●

●

●●

●●

●●

●

●

●

●

●

●

●●

●

●

●

0

10

20

30

40

20 40 60 80

Age

CE

SD

income●

●

●

●

●●●●

010203040506070

(c) size

●

●●●●●

●

●

●

●

●

●

●

●

●

●

●

●

●●●●

●●

●●

●●

●

●

●

●●

●

●●

●

●

●●

●●

●●

●●

●

●●●

●

●●

●

●

●

●

●

●

●

●

●●●

●

● ●

●

●

●●

●

●

●

●

●

● ●

●

●

●●

●

●

●●

●●●●

●

●●

●

●

●

●

●

●

●

●●●

●●●

●●

●●

●●

●

●

●●

●

●●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

● ●● ●●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●●● ● ●●

●● ●

●

●● ●●

●●

●

● ●

●

●

●

●

●

●●

●

●

●

●●

●

●

●

●

●

●●

●●●●

● ●

● ●

●

●

●●●●

●●● ●

●

●

●

●

●

●

●

●

●●●

●

●

●●

●

●●●

●

●●

●●

●

●

●

●

●

● ●●

●●

● ●

●

● ●●

●

●

●●●

●

●

●●●

●●●

●●●●

●

●

●

●●

●

●●

●

●●

●●

●●

●●

●

●

●

●

●

●

●●

●

●

●

0

10

20

30

40

20 40 60 80

Age

CE

SD

20

40

60income

(d) shading

Figure 4.19: Scatterplot of CESD as a function of age, with a third characteristic included usingdifferent methods

7 for details), univariate histograms and lowess lines on the scatterplots. Each diagonal rectangledisplays the histogram of a particular variable. This single plot lets us identify characteristics of thedata such as (1) the distribution of BSI is skewed with a long tail to the right as demonstrated bythe tall bar representing frequency of responses for low values of BSI, and a few very short bars forhigh values of BSI, (2) the parent bonding care sub-scale is also considered skewed since the barsare short for low values of the sub-scale and increase in height as the scale increases, (3) there is amoderate positive correlation (r = 0.37) between the age a youth starts smoking and drinking and(4) none of the other three variables seem to be correlated with BSI since the lowess lines throughthe scatterplots are all approximately horizontal.

We remind the reader that the example of a scatterplot matrix shown in Figure 4.21 was createdwith some customization applied. Each software program has different defaults such as backgroundcoloring and axis labels and tick marks that the user may want to modify. For example a researchermay want to change Figure 4.21 to show all axis labels on the left and bottom for familiar position-ing.

Another way to visualize the relationship between many continuous variables is by examiningthe correlation between all pairs of variables. The correlation is a numeric summary statistic thatquantify the direction and strength of a relationship between two continuous measures. The calcu-lation and interpretation of the correlation can be found in Section 8.7. Figure 4.22 demonstratesone approach where the circles sizes represent both the direction and magnitude of the correlation.Here the shading gradient goes from white (+1) to black (-1), and is not actually recommended for


Employed Unemployed Retired/Disabled

Less thanH

igh School

High S

choolG

raduate/GE

DP

ostS

econdary

0 1 2 3 0 1 2 3 0 1 2 3

02468

10

02468

10

02468

10

Overall BSI score

Fre

quen

cy

Figure 4.20: A histogram of overall BSI score paneled on the combination of two other variables:employment status and highest educational level attained

a diverging scale such as the correlation. For more information on choosing appropriate color gra-dients see Zeileis et al. (2009). This approach to visualizing the correlation matrix is useful in somemultivariate analyses such as principle component analysis discussed in Chapter 14.

4.5 Discussion of computer programs

Each general statistical software package has commands or procedures to produce many, if not all,of the plots or visualizations we describe in this chapter. Table 4.7 shows which command can beused to produce a particular plot using the three major packages discussed in this book. The full Rcode for all tables and plots in this Chapter are available on the CRC Press and UCLA web sites(see Appendix A).

Additional notes for Table 4.7 and most other software command tables in the book:• R: Entries that are in monospace font are functions within Base R. Entries in normal font are

packages that contain functions (not specifically listed here) that are used to create the selectedplot. All packages in R are user written and must be installed prior to use.

• SAS: All entries are individual procedures, called PROCs. Not all are part of BASE SAS. PROCGPLOT, GCHART, and GTL are part of SAS/GRAPH. PROC TEMPLATE is listed here as partof the Graph Template Language, which provides full customization of SAS Graphics.

• SPSS: With the exception of creating tables, all available graphics are best built using the ChartBuilder. Table entries provide guidance for the reader to find the appropriate selection. The ChartBuilder also has tools to easily change the color and shape of the point (or marker).

• Stata: Options within commands are written in (italics). Entries marked with a dagger † arecommunity-contributed commands.

4.5. DISCUSSION OF COMPUTER PROGRAMS 53

Tabl

e4.

7:So

ftwar

eco

mm

ands

for

plot

ting

Vis

ualiz

atio

nsR

*SA

SSP

SSSt

ata

Uni

vari

ate

Cat

egor

ical

Tabl

etable

FRE

QFR

EQ

UE

NC

IES

tabl

e,ta

bula

teB

arC

hart

plot

,ggp

lot2

GC

HA

RT

Bar

grap

hba

r,ca

tplo

t† ,tabp

lot†

Cle

vela

ndD

otPl

otdotchart

,ggp

lot2

SGPL

OT,

FRE

QSc

atte

r/D

otgr

aph

dot

SGPL

OT

-Sum

mar

yPo

intP

lot

Pie

Cha

rtpie

GC

HA

RT

Pie

grap

hpi

e

Con

tinuo

usSt

em-L

eaf

stem

SGPL

OT,

UN

IVA

RIA

TE

EX

AM

INE

stem

Stri

pcha

rt/

stripchart

,ggp

lot2

Scat

ter/

Dot

dotp

lot,

stri

pplo

t†

Dot

plot

SGPL

OT

His

togr

ams

hist

,ggp

lot2

SGPL

OT,

UN

IVA

RIA

TE

His

togr

amhi

stog

ram

Ker

nelD

ensi

typlot

,ggp

lot2

SGPL

OT,

UN

IVA

RIA

TE

His

togr

amkd

ensi

tyB

oxpl

otboxplot

,ggp

lot2

BO

XPL

OT,

SGPL

OT

Box

plot

grap

hbo

x,V

iolin

Plot

sgg

plot

2T

EM

PLA

TE

-vi

oplo

t†

Biv

aria

teC

atv

Cat

Two-

way

tabl

etable

FRE

QC

RO

SSTA

BS

tabl

e,ta

bula

teB

arC

hart

plot

,ggp

lot2

GC

HA

RT

Bar

grap

hba

r,ca

tplo

t† ,tabp

lot†

Mos

aic

Plot

mosaicplot

,vcd

TE

MPL

AT

E-

spin

eplo

t†a

Con

tv.c

atgr

oupi

ngplot

,ggp

lot2

BO

XPL

OT,

SGPL

OT

His

togr

am,B

oxpl

othi

stog

ram

,gra

phbo

x

Con

tvC

ont

Scat

terp

lot

plot

,ggp

lot2

GPL

OT,

SGPL

OT

Scat

ter

scat

ter

Lin

ePl

otplot

,ggp

lot2

GPL

OT,

SGPL

OT

Lin

elin

e

Mul

tivar

iate

pane

ling

ggpl

ot2,

latti

ceSG

PAN

EL

Gro

ups

tab

byco

lors

,siz

eplot

,ggp

lot2

GPL

OT,

GC

HA

RT

-co

lor,

(msi

ze,w

eigh

t)sc

atte

rplo

tmat

rix

psyc

h,la

ttice

,SG

SCA

TT

ER

Scat

ter/

Dot

grap

hm

atri

xpairs

,car

-Sca

tterp

lotM

atri

xco

rrel

atio

nm

atri

xco

rrpl

ottw

oway

scat

ter,

corr

tabl

e†

a Tech

nica

llyno

tam

osai

cpl

ot(H

umm

el,1

996)

* Monospace

font

deno

tes

the

func

tion

nam

e.N

orm

alfo

ntde

note

sa

user

wri

tten

pack

age

cont

aini

ngfu

nctio

nsto

perf

orm

the

spec

ified

task

† Com

mun

ityco

ntri

bute

dco

mm

and

Ital

icte

xtde

note

sop

tions

avai

labl

ew

ithin

seve

ralc

omm

ands


BSI

1.0 2.0 3.0

−0.21 −0.06

5 10 15

0.0

1.0

2.0

3.0

−0.061.

02.

03.

0

●

●

●

●

●

●●

●●

●

●

●

●

●

●●

●

●

●

● ●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●●

●●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

● ●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

Parental care subscale

−0.20 −0.14

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●●

●

●

●●

●●

●

●

●

●

●

●

●●

●

● ●

● ●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

● ● ●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●●

●

●

● ●

● ●

●

●

●

●

●

●●

●

●●

●●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

● ● ●

●●

Age started smoking

46

810

14

0.37

0.0 1.0 2.0 3.0

510

15 ●

●

●

●

●

●

●● ●

● ●

●

●

●

●

●●

●

●

● ●

●

●●

●

●

●

●

●

●

●

●

●●

●

●

●●

●

●

●

●

●

●

● ●

●

●●

●

●

●

●

●

●●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●● ●

●

●

●

●●

●

●●

●

●

●

●

●

● ●

●

●

●●

●

●

●

●

●

●

●

● ●●

● ●

●

●

●

●

●●

●

●

●●

●

●●

●

●

●

●

●

●

●

●

● ●

●

●

●●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●

●●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●● ●

●

●

●

●●

●

●●

●

●

●

●

●

●●

●

●

● ●

●

4 6 8 10 14

●

●

●

●

●

●

●

●●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●●

●

●

●

●

●●

●

●

●

●

● ●

Age started drinking

Figure 4.21: Scatterplot matrix with histograms and density plots along the diagonal, and pairwisecorrelation values above the diagonal

4.6 What to watch out for

• Avoid complexity. We advise against using too many enhancements on a single plot since do-ing so can confuse the reader instead of providing a better understanding. The purpose of mostgraphics is to understand distributional patterns, and to identify odd data points. Not all layersprovide illumination. For example, in Figure 4.19c there is much over-plotting in the lower left sothat it is difficult to see if there is a pattern emerging. In this case coloring the points for differentincome levels may be more helpful than changing the size, although barely.

• Choose colors mindfully. All plots in this textbook are either in black and white or shadedusing a grayscale. This is a necessary adjustment for black and white printing, but also is aconsideration for colorblind readers. We recommend using a colorblind-friendly color palette forpublications involving color.

• Do not add extra dimensions. We do not demonstrate plots such as 3D pie charts, or 3D barcharts in this text. In these cases the third dimension does not provide true information, and isconsidered “chart junk” that can be very misleading. This is not a global recommendation; thereare circumstances in which a third dimension does contain additional information that may beuseful to include in the visualization.

4.6. WHAT TO WATCH OUT FOR 55

−1

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1BSIParental care subscale

Age started smoking


BSI

Parental care subscale

Age started smoking


Figure 4.22: Visual representation of a correlation matrix circle sizes and shading representing thedirection and magnitude of the correlation between all pairs of variables.

• Be truthful with the scaling. Be mindful of the scaling of the vertical axis. For example, Figure4.2b plots a percentage on the vertical axis with a high value of near 50%. We scale the verticalaxis to 100% here to put the difference in percentages in the context of the overall range. A 2%point difference between categories can appear huge if the vertical axis only has a total rangeof say 5%. Similarly Figure specifically has a zero mark for the frequency. Displaying a verticalaxis that is too large or too small relative to the data is one of the most common ways in whichgraphics can be misleading.• Check publishing guidelines. If the goal is to publish using graphics, then be sure to check the

rules carefully. Some publications have rules regarding features such as whether there is a boxaround the plot versus only showing the horizontal and vertical axes.• Be consistent with selected themes. For example, if the first plot has a clear background and a

box outlining the edge, all subsequent plots should have that same theme. If a second categoricalvariable controls the color or shape of the points for one plot, then all subsequent plots that alsouse that same categorical variable should have the same color and/or shape scheme applied.• Do not over-interpret. One should not judge statistical significance based on graphs unless they

are specifically designed for such. Even if they are (seemingly) designed for it (e.g., graphingconfidence intervals for group comparisons or 95% pointwise confidence intervals or confidencebands for graphs) they can produce different and potentially misleading results and interpreta-tions as compared to appropriate hypothesis tests.• Plotting with missing data. Each software program has slightly different default values for

how missing data are handled in different plots. For example, the table function in R does notautomatically include a column for missing data, but a bar chart created using ggplot2 will showa bar for the missing category. In all software packages; values that are missing will be omittedfrom continuous data plots such as histograms and scatterplots, with typically no mention of thesample size being used to create the plot shown.


4.7 Summary

Data visualization is a powerful tool that can be used to explore and understand your data. Datavalues that need to be recoded can easily be identified and trends in the data can be uncovered.Graphics can be used to confirm that the data meet certain criteria, or assumptions that are neededfor further statistical analysis such as the specialized analysis methods presented in the remainderof the book. Data visualizations can also enhance understanding of statistical analysis results, butdo not serve as a substitute. Even though the saying “a picture is worth a thousand words” may betrue, and a graph can provide more information than a block of text or a table of numbers, when itcomes to more than a few variables, the options for graphically representing the data are limited.

We have demonstrated a wide variety of visualizations in this chapter. Some plots require de-tailed written explanations and are more suitable for reports or publications that do not have lengthrestrictions. Authors should attempt to strike a balance between complexity and interpretability ofgraphics, yet always aim to elucidate characteristics of data relevant to the question being asked oranswered.

There are many other types of graphics that we do not discuss such as heatmaps, ridgelines,choropleth maps and word clouds. These are typically considered specialized graphics for specificanalyses. We present some specialized plots in the appropriate chapters of this book but do notattempt to cover all possible ways to display information visually. We recommend looking at EdwardTufte’s pioneering work for historical overviews and inspiration (Tufte, 2001, 2006), and Yau (2011)for how to tell stories real world data. Additional handbooks that include practical advice on how tochoose and create effective visualizations include Munzner (2014); Cairo (2016); Kirk (2019) andHoltz and Conor (2018).

For the programming language details on how to make the graphics shown in this chapter andothers, we refer readers to this book’s supplemental webpage and reference books such as the RGraphics Cookbook by Chang (2013), R Graphics by Murrell (2011), ggplot2: Elegant Graphicsfor Data Analysis by Wickham (2016), R for Data Science: Import, Tidy, Transform, Visualize, andModel Data by Wickham and Grolemund (2017), Statistical Graphics in SAS by Kuhfeld (2010), AVisual Guide to Creating Graphs Interactively by Matange and Bottitta (2016), Handbook of Statis-tical Graphics using SAS by Der and Everitt (2014), the IBM SPSS Statistics 24 Brief Guide (IBM,2016), Building SPSS Graphs to Understand Data by Aldrich and Rodriguez (2012), Speaking StataGraphics by Cox (2014), and A Visual Guide to Stata Graphics by Mitchell (2012).

4.8 Problems

Descriptions of data sets, how to obtain them, and the codebooks can be found in Section 1.2and Appendix A.

4.1 From the lung function data set, determine how many families have one child, two children,and three children between the ages of 7 and 18.

4.2 For the depression data set, determine if any of the variables have observations that do not fallwithin the ranges given in Table 3.4, codebook for depression data.

4.3 For the lung function data set, create a new variable called AGEDIFF = (age of child 1) – (ageof child 2) for families with at least two children. Produce a frequency count of this variable.Are there any negative values? Comment.

4.4 Construct histograms for mothers’ and fathers’ heights and weights from the lung functiondata set. Describe cases that you consider to be outliers.

4.5 For the lung cancer data set,a) construct a histogram of the variable Daysb) for every other variable produce a frequency table of all possible values.

4.6 For the lung function data set, produce a two-way table of gender of child 1 versus gender of

4.8. PROBLEMS 57

child 2 (for families with at least two children). Describe the distribution of genders for thesefamilies.

4.7 For the lung cancer data set,a) produce a separate histogram of the variable Days for small and large tumor sizes (0 and 1

values of the variable Staget)b) compute a two-way frequency table of the variable Staget versus the variable Deathc) comment on the results of (a) and (b).

4.8 Construct a scatterplot of income versus employment status from the depression data set. Fromthe data in this table, decide if there are any adults whose incomes are unusual considering theiremployment status. Are there any adults in the data set whom you think are unusual?

4.9 Using the lung function data explore and describe the following relationships:a) How does the residential area affect the lung function of the parents?b) For the oldest child, plot the relationship between FEV1 and (i) age; (ii) height; (iii) weight

using a scatterplot and lowess line.4.10 Using the parental HIV data set create a few visualizations to explore the relationship between

the variables of interest listed below. In a few sentences describe the information about thegiven relationship you learn from each graph, and what specific features of the graph led youto that conclusion.

a) The relationship between the age when a child starts smoking and when they start drinking.b) Ethnicity, a choice of one neighborhood characteristic, and financial situation of the house-

hold.c) Attendance of religious services and level of religiousness/spiritualism.

4.11 Using a scatterplot matrix, repeat the problem 4.8 for fathers’ measurements instead of thoseof the oldest child. Did you find the same pattern of relationships between body measurementsand FEV1 in fathers as you did for the oldest child?

4.12 Using the mice data, create a profile plot for the average weight of mice per group over time.

Chapter 5

Data screening and transformations

5.1 Transformations, assessing normality and independence

In Section 3.4 we discussed the use of transformations to create new variables. In this chapter wediscuss transforming the data to obtain a distribution that is approximately normal. This is of partic-ular interest for exploratory data analysis. For confirmatory data analysis (as described in Chapter1) one should choose any appropriate transformations of variables prior to performing any analy-ses. Section 5.2 shows how transformations change the shape of distributions. Section 5.3 discussesseveral methods for deciding when a transformation should be made and how to find a suitabletransformation. An iterative scheme is proposed that helps to zero in on a good transformation andstatistical tests for normality are evaluated. Section 5.4 presents simple graphical methods for deter-mining if the data are independent. Section 5.5 provides an overview of the methods used in the fourstatistical software packages for topics discussed in this chapter. In this chapter, we rely heavily ongraphical methods: see Cook and Weisberg (1994) and Tufte (2001).

Each computer software package offers the users information to help decide if their data are nor-mally distributed. The packages provide convenient methods for transforming the data to achieveapproximate normality. They also include some output for checking the independence of the ob-servations. Hence the assumption of independent, normally distributed data that is made in manystatistical tests can be assessed, at least approximately. Note that it has been shown that inferencecan be robust in many research settings even with highly non-normal data (see Lumley et al., 2002).Additionally, many investigators may try to discard the most obvious outliers prior to assessing nor-mality because such outliers can grossly distort the distribution, but discarding outliers is generallynot advised unless it is concluded that an error has occurred in the measurement, recording or entryof these observations. Some researchers also consider removing inconsistent or extreme observa-tions (see Osborne and Overbay, 2004), but such a decision depends heavily on the circumstancessurrounding the research topic and should always be documented.

5.2 Common transformations

For exploratory analysis of data it may be useful to transform certain variables before performingthe analyses. Examples are found in the next section and in Chapter 7. In this section we presentsome common transformations. If you are familiar with this subject, you may wish to skip to thenext section.

To develop a feel for transformations, let us examine a plot of transformed values versus theoriginal values of the variable. To begin with, a plot of values of a variable X against itself producesa 45◦ diagonal line going through the origin, as shown in Figure 5.1.

One of the most commonly performed transformations is taking the logarithm (log) to base 10.Recall that the logarithm is the number that satisfies the relationship X = 10Y . That is, the logarithmof X is the power Y to which 10 must be raised in order to produce X . As shown in Figure 5.2 inplot a, the logarithm of 10 is 1 since 10 = 101. Similarly, the logarithm of 1 is 0 since 1 = 100, andthe logarithm of 100 is 2 since 100 = 102. Other values of logarithms can be obtained from tablesof common logarithms, from a hand calculator with a log function, or from statistical packages by

59

practical multivariate analysis sixth edition ch1-4.pdf1.1 deﬁning multivariate analysis 3 1.2...

Documents