chapter three: design, data collection and data …

28
31 EHR/NSF Evaluation Handbook CHAPTER THREE: DESIGN, DATA COLLECTION AND DATA ANALYSIS In Chapter Two we outlined the steps in the develop- ment and implementation of an evaluation. Another name for that chapter could be “the Soup to Nuts” of evaluation because of its broad-based coverage of issues. In this chapter we focus more closely on selected technical issues, the “Nuts and Bolts” of evaluation, issues that generally fall into the categories of design, data collection and analysis. In selecting these technical issues, we were guided by two priorities: We devoted most attention to topics rel- evant to quantitative evaluations, because, as emphasized in the introduction, in order to be responsive to executive and congres- sional decisionmakers, NSF is usually re- quired to furnish outcome information based on quantitative measurement. We have given the most extensive coverage to topics for which we have located few concise reference materials suitable for NSF/ EHR project evaluators. Butfor all topics, we urge project staff who plan to undertake comprehensive evaluations to make use of the reference materials mentioned in this chapter and in the annotated bibliography. The chapter is organized into four sections: How do you design an evaluation? How do you choose a specific data collection technique? What are some major concerns when collecting data? How do you analyze the data you have collected? How Do You Design an Evaluation? Once you have decided the goals for your study and the questions you want to address, it is time to design the study. What does this mean? According to Scriven (1991) design means:

Upload: others

Post on 18-Dec-2021

2 views

Category:

Documents


0 download

TRANSCRIPT

31EHR/NSF Evaluation Handbook

CHAPTER THREE: DESIGN, DATA COLLECTION AND DATA ANALYSIS

In Chapter Two we outlined the steps in the develop-ment and implementation of an evaluation. Anothername for that chapter could be “the Soup to Nuts” ofevaluation because of its broad-based coverage of issues.In this chapter we focus more closely on selectedtechnical issues, the “Nuts and Bolts” of evaluation,issues that generally fall into the categories of design,data collection and analysis.

In selecting these technical issues, we were guided bytwo priorities:

We devoted most attention to topics rel-evant to quantitative evaluations, because,as emphasized in the introduction, in orderto be responsive to executive and congres-sional decisionmakers, NSF is usually re-quired to furnish outcome information basedon quantitative measurement.

We have given the most extensive coverageto topics for which we have located fewconcise reference materials suitable for NSF/EHR project evaluators. But for all topics, weurge project staff who plan to undertakecomprehensive evaluations to make use ofthe reference materials mentioned in thischapter and in the annotated bibliography.

The chapter is organized into four sections:

• How do you design an evaluation?

• How do you choose a specific data collectiontechnique?

• What are some major concerns when collectingdata?

• How do you analyze the data you have collected?

How Do You Design an Evaluation?

Once you have decided the goals for your study and thequestions you want to address, it is time to design thestudy. What does this mean? According to Scriven(1991) design means:

32 EHR/NSF Evaluation Handbook

“The process of stipulating theinvestigatory procedures to

be followed in doing a certain evaluation.”

Designing an evaluation is one of those “good news —bad news” stories. The good news is that there aremany different ways to develop a good design. The badnews is that there are many ways to develop baddesigns. There is no formula or simple algorithm thatcan be relied upon in moving from questions to anactual study design. Thoughtful analysis, sensitivity,common sense, and creativity are all needed to makesure that the actual evaluation provides informationthat is useful and credible.

This section examines some issues to consider indeveloping designs that are both useful and method-ologically sound. They are:

• Choosing an approach

• Selecting a sample

• Deciding how many times to measure

Choosing an Approach

Since there are no hard and fast rules about designingthe study, how should the evaluator go about choosingthe procedures to be followed? This is usually a 2-stepprocess. In step 1, the evaluator makes a judgmentabout the main purpose of the evaluation, and aboutthe over-all approach which will provide the bestframework for this purpose. This judgment will lead toa decision whether the methodology will be essentiallyqualitative (relying on case studies, observations, anddescriptive materials) or whether the method shouldrely on statistical analyses, or whether a combinedapproach would be best. Will control or comparisongroups be part of the design? If so, how should thesegroups be selected?

While some evaluation experts feel that qualitativeevaluations should not be treated as a technical,scientific process (Guba and Lincoln, 1989) others (forexample, Yin, 1989) have adopted design strategieswhich satisfy rigorous scientific requirements. Con-versely, competently executed quantitative studieswill have qualitative components. The experiencedevaluator will want to see a project in action and

CHAPTER THREE DESIGN, DATA COLLECTION AND DATA ANALYSIS

Thoughtful analysis, sensitivity,common sense, and creativity

are all needed to make surethat the actual evaluationprovides information that

is useful and credible.

33EHR/NSF Evaluation Handbook

conduct observations and informational interviewsbefore designing instruments for quantitative evalua-tion; he or she will also provide opportunities for“open-ended” responses and comments during datacollection.

There is a useful discussion about choosing thegeneral evaluation approach in Herman, Morris, andFitz-Gibbon (1987) which concludes with the follow-ing observation:

“There is no single correct approach to allevaluation problems. The message is this:some will need a quantitative approach;some will need a qualitative approach;

probably most will benefit from acombination of the two.”

In all cases, once fundamental design decisions havebeen made, the design task generally follows the samecourse in step 2. The evaluator:

• Lists the questions which were raised bystakeholders and classifies them as requiring anImplementation, Progress or SummativeEvaluation.

• Identifies procedures which might be used toanswer these questions. Some of theseprocedures probably can be used to answerseveral questions; clearly, these will havepriority.

• Looks at possible alternative methods, takinginto account strength of the findings yielded byeach approach (quality) as well as practicalconsiderations especially time and costconstraints, staff availability, access toparticipants, etc.

An important consideration at this point is minimizinginterference with project functioning: making as fewdemands as possible on project personnel and partici-pants, and avoiding procedures which may be per-ceived as threatening or critical.

All in all, the evaluator will need to use a great deal ofjudgment in making choices and adjusting designs,and will seldom be in a position to fully implement textbook recommendations. Some of the examples de-tailed in Chapter Six illustrate this point.

DESIGN, DATA COLLECTION AND DATA ANALYSIS CHAPTER THREE

An important considerationat this point is minimizing

interference withproject functioning.

34 EHR/NSF Evaluation Handbook

CHAPTER THREE DESIGN, DATA COLLECTION AND DATA ANALYSIS

When and How to Sample

It is sometimes assumed that an evaluation mustinclude all of the persons who participate in a project.Thus in teacher enhancement programs, all teachersneed to be surveyed or observed; in studies of instruc-tional practices, all students need to be tested; and instudies of reform, all legislators need to be interviewed.This is not the case.

Sampling may be considered or necessary for qualita-tive and quantitative studies. For example, if a projectis carried out in a large number of sites, the evaluatormay decide to carry out a qualitative study in only oneor a few or them. When planning a survey of projectparticipants, the investigator may decide to samplethe participant population, if it is large. Of course, ifthe project involves few participants, sampling isunnecessary and inappropriate.

For qualitative studies, purposeful sampling is oftenmost appropriate. Purposeful sampling means thatthe evaluator will seek out the case or cases which aremost likely to provide maximum information, ratherthan a "typical" or "representative" case. The goal of thequalitative evaluation is to obtain rich, in-depth infor-mation, rather than information from which generali-zations about the entire project can be derived. For thelatter goal a quantitative evaluation is needed.

For quantitative studies, some form of random sam-pling is the appropriate method. The easiest way ofdrawing random samples is to use a list of participants(or teachers, or classrooms, or sites), and select every2nd or 5th or 10th name, depending on the size of thepopulation and the desired sample size. A stratifiedsample may be drawn to insure sufficient numbers ofrare units (for example, minority members, or schoolsserving low-income students).

The most common misconception about sampling isthat large samples are the best way of obtainingaccurate findings. While it is true that larger sampleswill reduce sampling error (the probability that ifanother sample of the same size were drawn, differentresults might be obtained), sampling error is thesmallest of the three components of error which affectthe soundness of sample designs. Two other errors—sample bias (primarily due to loss of sample units)and response bias (responses or observations whichdo not reflect “true” behavior, characteristics or atti-

When planning allocation ofresources, evaluators should givepriority to procedures which will

reduce sample bias and responsebias, rather than to the selection

of larger samples.

35EHR/NSF Evaluation Handbook

tudes)—are much more likely to jeopardize validity offindings. (Sudman, 1976). When planning allocationof resources, evaluators should give priority to proce-dures which will reduce sample bias and responsebias, rather than to the selection of larger samples.

Let’s talk a little more about sample and response bias.Sample bias occurs most often because of non-response(selected respondents or units are not available orrefuse to participate, or some answers and observa-tions are incomplete). Response bias occurs becausequestions are misunderstood or poorly formulated, orbecause respondents deliberately equivocate (for ex-ample to protect the project being evaluated). Inobservations, the observer may misinterpret or miss whatis happening. Exhibit 4 describes each type of bias andsuggests some simple ways of minimizing them.

Exhibit 4

DESIGN, DATA COLLECTION AND DATA ANALYSIS CHAPTER THREE

Sampling Error

Sample Bias

Response Bias

Using a sample, not the entire population to be studied.

Some of those selected to participate did not do so or provided incomplete information.

Responses do not reflect "true" opinions or behaviors because questions were misunderstood or respondents chose not to tell the truth.

Larger samples—these reduce but do not eliminate sampling error.

Repeated attempts to reach non-respondents. Prompt and careful editing of completed instruments to obtain missing data; comparison of characteristics of non-respondents with those of respondents to describe any suspected differences that may exist.

Careful pretesting of instruments to revise mis-understood, leading, or threatening questions. No remedy exists for deliberate equivocation in self-administered interviews, but it can be spotted by careful editing. In personal interviews, this bias can be reduced by a skilled interviewer.

Type Cause Remedies

Three Types of Errors and Their Remedies Three Types of Errors and Their Remedies

36 EHR/NSF Evaluation Handbook

Determining an adequate sample size sounds threat-ening, but is not as difficult as it might seem to be atfirst. Statisticians have computed recommendedsample sizes for various populations. (See Fitz-Gibbonand Morris, 1987.) For practical purposes, however, inproject evaluations, sample size is primarily deter-mined by available resources, by the planned analy-ses, and by the need for credibility.

In making sampling decisions, the overriding consid-eration is that the actual selection must be done byrandom methods, which usually means selectingevery nth case from listings of units (students, instruc-tors, classrooms). Sudman (1976) emphasizes thatthere are many scientifically sound sampling methodswhich can be tailored to all budgets:

“In far too many cases, researchers areaware that powerful sampling methods

are available, but believe they cannot usethem because these methods are too diffi-

cult and expensive. Instead incrediblysloppy ad hoc procedures are invented,

often with disastrous results.”

Deciding How Many Times to Measure

For all types of evaluations (Implementation, Progress,and Summative) the evaluator must decide the fre-quency of data collection and the method to be usedif multiple observations are needed.

For many purposes, it will be sufficient to collect dataat one point in time; for others one time data collectionmay not be adequate. Implementation Evaluationsmay utilize either multiple or one-time data collectionsdepending on the length of the project and any prob-lems that may be uncovered along the way. ForSummative Evaluations, a one-time data collectionmay be adequate to answer some evaluation ques-tions: How many students enrolled in the project? Howmany were persisters versus dropouts? What were themost popular project activities? Usually, such datacan be obtained from records. But impact measuresare almost always measures of change. Has the projectresulted in higher test scores? Have teachers adopteddifferent teaching styles? Have students become moreinterested in considering science-related careers? Ineach of these cases, at a minimum two observationsare needed: baseline (at project initiation) and at a laterpoint, when the project has been operational long

CHAPTER THREE DESIGN, DATA COLLECTION AND DATA ANALYSIS

Impact measures are almostalways measures of change.

37EHR/NSF Evaluation Handbook

enough for possible change to occur.

Quantitative studies using data collected from thesame population at different points in time are calledlongitudinal studies. They often present a dilemmafor the evaluator. Conventional wisdom suggests thatthe correct way to measure change is the “panelmethod,” by which data are obtained from the sameindividuals (students, teachers, parents, etc.) at differ-ent points in time. While longitudinal designs whichrequire interviewing the same students or observ-ing the same teachers at several points in time arebest, they are often difficult and expensive to carryout because students move, teachers are re-as-signed, and testing programs are changed. Fur-thermore loss of respondents due to failure to locateor to obtain cooperation from some segment of theoriginal sample is often a major problem. Depend-ing on the nature of the evaluation, it may bepossible to obtain good results with successivecross-sectional designs, which means drawing newsamples for successive data collections from thetreatment population. (See Love, 1991 for a fullerdiscussion of logistics problems in longitudinaldesigns.)

For example, to evaluate the impact of a program offield trips to museums and science centers for 300high school students, baseline interviews can beconducted with a random sample of 100 studentsbefore the project start. Interviewing another randomsample of 100 students after the project has beenoperational for one year is an acceptable technique formeasuring project effectiveness, provided that at bothtimes samples were randomly selected to adequatelyrepresent the entire group of students involved in theproject. In other cases, this may be impossible.

Designs that involve repeated data collection usuallyrequire that the data be collected using identicalsurvey instruments at all times. Changing questionwording or formats or observation schedules betweentime 1 and time 2 impairs the validity of the timecomparison. At times, evaluators find after the firstround of data collection that their instruments wouldbe improved by making some changes, but they do soat the risk of not being able to use altered items formeasuring change. Depending on the particular cir-cumstances, it may be difficult to sort out whether achanged response is a treatment effect or the effect ofthe modified wording. There is no hard and fast rule

DESIGN, DATA COLLECTION AND DATA ANALYSIS CHAPTER THREE

There is no hard and fast rule fordeciding when changes shouldor should not be made; in the

end technical concernsmust be balanced with

common sense.

38 EHR/NSF Evaluation Handbook

for deciding when changes should or should not bemade; in the end technical concerns must be balancedwith common sense.

How Do You Choose a Specific Data Collection Technique?

In Chapter Two we provided an overview of ways inwhich evaluators can go about collecting data. Asshown in that chapter, there are many different waysto go about answering the same questions. However,the great majority of evaluation designs for projectssupported by NSF/EHR rely at least in part on quan-titative methods using one or several of the followingtechniques:

• Surveys based on self-administeredquestionnaires or interviewer administeredinstruments

• Focus groups

• Results from tests given to students

• Observations (most often carried out inclassrooms)

• Review of records and data bases (not createdprimarily for the evaluation needs of the project).

The discussion in this section focuses on these tech-niques. Evaluators who are interested in using tech-niques not discussed here (for example designs usingunobtrusive measures or videotaped observations)will find relevant information in some of the referencebooks cited in the bibliography.

Surveys

Surveys are a popular tool for project evaluation.They are especially useful for obtaining informationabout opinions and attitudes of participants orother relevant informants, but they are also usefulfor the collection of descriptive data, for examplepersonal and background characteristics (race,gender, socio-economic status) of participants. Sur-vey findings usually lend themselves to quantita-tive analysis; as in opinion polls, the results can beexpressed in easily understood percentages ormeans. As compared to some other data collectionmethods, (for example in-depth interviews or ob-servations) surveys usually provide wider ranging

CHAPTER THREE DESIGN, DATA COLLECTION AND DATA ANALYSIS

39EHR/NSF Evaluation Handbook

but less detailed data and some data may be biasedif respondents are not truthful. However, much hasbeen learned in recent years about improving sur-vey quality and coverage and compared to moreintensive methods, surveys are relatively inexpen-sive and easier to analyze using statistical software.

The cheapest surveys are self-administered: a ques-tionnaire is distributed (in person or by mail) to eligiblerespondents. Relatively short and simple question-naires lend themselves best to this treatment. Themain problem is usually non-response: persons notpresent when the questionnaire is distributed areoften excluded, and mail questionnaires will yieldrelatively low response rates, unless a great deal ofcareful preparation and follow-up work is done.

When answers to more numerous and more complexquestions are needed, it is best to avoid self-adminis-tered questionnaires and to employ interviewers to askquestions either in a face to face situation or over thetelephone. Survey researchers often differentiate be-tween questionnaires, where a series of preciselyworded questions are asked, and interviews which areusually more open-ended, based on an interview guideor protocol and yield richer and often more interestingdata. The trade-off is that interviews take longer, arebest done face-to-face, and yield data which are oftendifficult to analyze. A good compromise is a structuredquestionnaire which provides some opportunity foropen-ended answers and comments.

The choice between telephone and personal interviewsdepends largely on the nature of the projects beingevaluated and the characteristics of respondents. Forexample, as a rule children should be interviewed inperson, as should be respondents who do not speakEnglish, even if the interview is conducted by a bi-lingual interviewer.

Creating a good questionnaire or interview instrumentrequires considerable knowledge and skill. Question word-ing and sequencing are very important in obtaining validresults, as shown by many studies. For a fuller discussion,see Fowler (1993, ch. 6) and Love (1991, ch. 2).

Focus groups

Focus groups have become an increasingly popularinformation gathering technique. Prior to designingsurvey instruments, a number of persons from the

DESIGN, DATA COLLECTION AND DATA ANALYSIS CHAPTER THREE

40 EHR/NSF Evaluation Handbook

population to be surveyed are brought together todiscuss, with the help of a leader, the topics which arerelevant to the evaluation and should be included indeveloping questionnaires. Terminology, comprehen-sion, and recall problems will surface, which should betaken into account when questionnaires or interviewguides are constructed. This is the main role for focusgroups in Summative Evaluations. However, theremay be a more substantive role for focus groups inProgress Evaluations, which are more descriptive innature and often do not rely on statistical analyses.(See Stewart and Shamdasani, 1990 for a full discus-sion of focus groups.)

The usefulness of focus groups depends heavily on theskills of the moderator, the method of participantselection and last, but not least, the understanding ofevaluators that focus groups are essentially an exer-cise in group dynamics. Their popularity is highbecause they are a relatively inexpensive and quickinformation tool, but while they are very helpful in thesurvey design phase, they are no substitute for sys-tematic evaluation procedures.

Test Scores

Many evaluators and program managers feel that if aproject has been funded to improve the academic skillsof students so that they are prepared to enter scientificand technical occupations, improvements in test scoresare the best indicator of a project’s success. Test scoresare often considered “hard” and therefore presumablyobjective data, more valid than other types of measure-ments such as opinion and attitude data, or gradesobtained by students. But these views are not unani-mous, since some students and adults are poor test-takers, and because some tests are poorly designed andmeasure the skills of some groups, especially Whitemales, better than those of women and minorities.

Until recently, most achievement tests were eithernorm-referenced (measuring how a given studentperformed compared to a previously tested popula-tion) or criterion-referenced (measuring if a studenthad mastered specific instructional objectives andthus acquired specific knowledge and skills). Mostschool systems use these types of tests, and it hasfrequently been possible for evaluators to use dataroutinely collected in the schools as the basis for theirsummative studies.

CHAPTER THREE DESIGN, DATA COLLECTION AND DATA ANALYSIS

41EHR/NSF Evaluation Handbook

Because of the many criticisms which have beendirected at tests currently in use, there is now a greatdeal of interest in making radical changes. Experi-ments with performance assessment are under wayin many states and communities. Performance testsare designed to measure problem solving behaviors,rather than factual knowledge. Instead of answeringtrue/false or multiple choice formats, students areasked to solve more complex problems, and to explainhow they go about arriving at answers and solvingthese problems. Testing may involve group as well asindividual activities, and may appear more like aproject than a traditional “test.” While many educatorsand researchers are enthusiastic about these newassessments, it is not likely that valid and inexpensiveversions of these tests will be ready for widespread usein the near future.

A good source of information about test vendors andfor the use of currently available tests in evaluation isMorris, Fitz-Gibbon and Lindheim (1987). An exten-sive discussion of performance-based assessment byLinn, Baker, and Dunbar can be found in EducationalResearcher (Nov. 1991).

Whatever type of test used, there are two criticalquestions that must be considered before selecting atest and using its results:

• Is there a match between what the test measuresand what the project intends to teach? If ascience curriculum is oriented toward teachingprocess skills, does the test measure these skillsor more concrete scientific facts?

• Has the program been in place long enough forthere to be an impact on test scores? With mostprojects, there is a start-up period during whichthe intervention is not fully in place. Looking fortest score improvements before a project is fullyestablished can lead to erroneous conclusions.

A final note on testing and test selection. Evaluatorsmay be tempted to develop their own test instrumentsrather than relying on ones that exist. While this mayat times be the best choice, it is not an option to beundertaken lightly. Test development is more thanwriting down a series of questions, and there are somestrict standards formulated by the American Psycho-logical Association that need to be met in developinginstruments that will be credible in an evaluation. If at

DESIGN, DATA COLLECTION AND DATA ANALYSIS CHAPTER THREE

42 EHR/NSF Evaluation Handbook

CHAPTER THREE DESIGN, DATA COLLECTION AND DATA ANALYSIS

all possible, use of a reliable and validated, establishedtest is best.

Observations

Surveys and tests can provide good measurements ofthe opinions, attitudes, skills, and knowledge of indi-viduals; surveys can also provide information aboutindividual behavior (how often do you go to your locallibrary? what did you eat for breakfast this morning?),but behavioral information is often inaccurate due tofaulty recall or the desire to present oneself in afavorable light. When it comes to measuring groupbehavior (did most children ask questions during thescience lesson? did they work cooperatively? at whichmuseum exhibits did the students spend most of theirtime?) systematic observations are the best methodfor obtaining good data.

Evaluation experts distinguish between three obser-vation procedures: (1) systematic observations, (2)anecdotal records (semi-structured), and (3) observa-tion by experts (unstructured). For NSF/EHR projectevaluations, the first and second are most frequentlyused, with the second to be used as a planning step forthe development of systematic observation instru-ments.

Procedure one yields quantitative information, whichcan be analyzed by statistical methods. To carry outsuch quantifiable observations, subject-specific in-struments will need to be created by the evaluator tofit the specific evaluation. A good source of informationabout observation procedures, including suggestionsfor instrument development, can be found in Henerson,Morris and Fitz-Gibbon (1987, ch. 9).

The main disadvantage of the observation technique isthat behaviors may change when observed. This maybe especially true when it comes to teachers andothers who feel that the observation is in effect carriedout for the purpose of evaluating their performance,rather than the project’s general functioning. Butbehavior changes for other reasons as well, as noted along time ago when the “Hawthorne effect” was firstreported. Techniques have been developed to deal withthe biasing effect of the presence of observers: forexample, studies have used participant observers, butsuch techniques can only be used if the study does notcall for systematically recording observations as events

43EHR/NSF Evaluation Handbook

occur. Another possible drawback is that perhapsmore than any other data collection method, theobservation method is heavily dependent on the train-ing and skills of data collectors. This topic is more fullydiscussed later in this chapter.

Review of Records and Data Bases

Most agencies and funded projects maintain sys-tematic records of some kind about the populationthey serve and the services they provide, but theextent of available information and their accessibil-ity differ widely. The existence of a comprehensiveManagement Information System or data base is ofenormous help in answering certain evaluationquestions which in their absence may require spe-cial surveys. For example, simply by looking atpersonal characteristics of project participants,such as sex, ethnicity, family status etc. evaluatorscan judge the extent to which the project recruitedthe target populations described in the projectapplication. As mentioned earlier, detailed projectrecords will greatly facilitate the drawing of samplesfor various evaluation procedures. Project recordscan also identify problem situations or events (forexample exceptionally high drop-out rates at onesite of a multi-site project, or high staff turnover)which might point the evaluator in new directions.

Existing data bases which were originally set up forother purposes can also play a very important role inconducting evaluations. For example, if the projectinvolves students enrolled in public or private institu-tions which keep comprehensive and/or computer-ized files, this would greatly facilitate the selection of“matched” control or comparison groups for complexoutcome designs. However, gaining access to suchinformation may at times be difficult because of rulesdesigned to protect data confidentiality.

Exhibit 5 summarizes the advantages and drawbacksof the various data collection procedures.

What are Some Major Concerns When Collecting Data?

It is not possible to discuss in one brief chapter thenitty-gritty of all data collection procedures. The readerwill want to consult one or more of the texts recom-mended in the bibliography before attacking any onespecific task. Before concluding this chapter, we wantto address two issues, however, which affect all data

DESIGN, DATA COLLECTION AND DATA ANALYSIS CHAPTER THREE

44 EHR/NSF Evaluation Handbook

Exhibit 5

Procedure

Self-administered questionnaire

Interviewer administered questionnaires (by telephone)

Interviewer administered questionnaires(in person)

Open-ended interviews (in person)

Focus groups

Tests

Observations

Advantages

Inexpensive. Can be quickly administered if distributed to group. Well suited for simple and short questionnaires.

Relatively inexpensive. Avoids sending staff to unsafe neighborhoods or difficulties gaining access to buildings with security arrangements. Best suited for relatively short and non-sensitive topics.

Interviewer controls situation, can probe irrelevant or evasive answers; with good rapport, may obtain useful open-ended comments.

Usually yields richest data, details, new insights. Best if in-depth information is wanted.

Useful to gather ideas, different viewpoints, new insights, improving question design.

Provide "hard" data which administrators and funding agencies often prefer; relatively easy to administer; good instruments may be available from vendors.

If well executed, best for obtaining data about behavior of individuals and groups.

Disadvantages

No control for misunderstood questions, missing data, or untruthful responses. Not suited for exploration of complex issues.

Proportion of respondents without a private telephone may be high in some populations. As a rule not suitable for children, older people, and non-English speaking persons. Not suitable for lengthy questionnaires and sensitive topics. Respondents may lack privacy.

Expensive. May present logistics problems (time, place, privacy, access, safety). Often requires lengthy data collection period unless project employs large interviewer staff.

Same as above (interviewer administered questionnaires); also often difficult to analyze.

Not suited for generalizations about population being studied.

Available instruments may be unsuitable for treatment population; developing and validating new, project-specific tests may be expensive and time consuming. Objections may be raised because of test unfairness or bias.

Usually expensive. Needs well qualified staff. Observation may affect behavior being studied.

Advantages and Drawbacks of Various Data Collection ProceduresAdvantages and Drawbacks of Various Data Collection Procedures

CHAPTER THREE DESIGN, DATA COLLECTION AND DATA ANALYSIS

45EHR/NSF Evaluation Handbook

collections and deserve special mention here: theselection, training, and supervision of data collectors,and pretesting of evaluation instruments.

Selection, Training and Supervision of Data Collection

Selection

All too often, project administrators, and even evalu-ators, believe that anybody can be a data collector andtypically base the selection on convenience factors: anavailable research assistant, an instructor or clerkwilling to work overtime, college students available forpart-time or sporadic work assignments. All of thesemay be suitable candidates, but it is unlikely that theywill be right for all data collection tasks.

Most data collection assignments fall into one of threecategories:

• Clerical tasks (abstracting records, compilingdata from existing lists or data bases, keepingtrack of self-administered surveys)

• Personal interviewing (face-to-face or bytelephone) and test administration

• Observing and recording observations.

There are some common requirements for the suc-cessful completion of all of these tasks: a good under-standing of the project, ability and discipline to followinstructions consistently and to give punctilious anddetailed attention to all aspects of the data collection.Equally important is lack of bias, and lack of vestedinterest in the outcome of the evaluation. For thisreason, as previously mentioned (Chapter Two) it isusually unwise to use volunteers or regular projectstaff as data collectors.

Interviewers need additional qualities: a pleasantvoice and tactful personal manner and the ability toestablish rapport with respondents. For some datacollections, it may be advisable to attempt a matchbetween interviewer and respondent (for example withrespect to ethnicity, or age.) The need for fluency in alanguage other than English (usually Spanish) mayalso be needed; in this case it is important that theinterviewer be bi-lingual, with U.S. work experience,so that instructions and expected performance stan-

DESIGN, DATA COLLECTION AND DATA ANALYSIS CHAPTER THREE

46 EHR/NSF Evaluation Handbook

dards are well understood.

Observers need to be highly skilled and competentprofessionals. Although they too will need to followinstructions and complete structured schedules, it isoften important that they alert the evaluator to un-anticipated developments. Depending on the nature ofthe evaluation, their role in generating informationmay be crucial: often they are the eyes and the ears ofthe evaluator. They should also be familiar with thesetting in which the observations take place, so thatthey know what to look for. For example teachers (orformer teachers or aides) can make good classroomobservers, although they should not be used in schoolswith which they are or were affiliated.

Training

In all cases, sufficient time must be allocated totraining. Training sessions should include performingthe actual task (extracting information from a database, conducting an interview, performing an obser-vation). Training techniques might include role-play-ing (for interviews) or comparing recorded observationsof the same event by different observers. When theproject enters a new phase (for example when a secondround of data collection starts) it is usually advisableto schedule another training session, and to checkinter-rater reliability again.

If funds and technical resources are available, othertechniques (for example videotaping of personal inter-views or recording of telephone interviews) can also beused for training and quality control after permissionhas been obtained from participants.

Supervision

Only constant supervision will ensure quality controlof the data collection. The biggest problem is notcheating by interviewers or observers (although thiscan never be ruled out), but gradual burnout: moretranscription errors, more missing data, fewer probesor follow-ups, fewer open-ended comments on obser-vation schedules.

The project evaluator should not wait to review com-pleted work until the end of the data collection, butshould do so at least once a week. See Fowler (1991)and Henerson, Morris and Fitz-Gibbon (1987) forfurther suggestions on interviewer and observer re-

CHAPTER THREE DESIGN, DATA COLLECTION AND DATA ANALYSIS

47EHR/NSF Evaluation Handbook

cruitment and training.

Pretest of Instruments

When the evaluator is satisfied with the instrumentsdesigned for the evaluation, and before starting anydata collection in the field, all instruments should bepre-tested to see if they work well under field condi-tions. The pre-test also reveals if questions are under-stood by respondents and if they capture theinformation sought by the evaluator. Pre-testing is astep that many evaluators “skip” because of timepressures. However, as has been shown many times,they may do so at their own peril. The time taken upfront to pre-test instruments can result in enormoussavings in time (and misery) later on.

The usual procedure consists of using instrumentswith a small number of cases (for example abstractingdata from 10 records, asking 10-20 project partici-pants to fill out questionnaires, conducting interviewswith 5 to 10 subjects, or completing half a dozen classroom observations). Some of the shortcomings of theinstruments will be obvious as the completed formsare reviewed, but most important is a debriefingsession with data collectors and in some instanceswith the respondents themselves, so that they canrecommend to the evaluator possible modifications ofprocedures and instruments. It is especially impor-tant to pre-test self-administered instruments, wherethe respondent cannot ask an interviewer for help inunderstanding questions. Such pre-tests are best doneby bringing together a group of respondents, askingthem first to complete the questionnaire, and thenleading a discussion about clarity of instructions, andunderstanding the questions and expected answers.

Data Analysis: Qualitative Data

Analyzing the plethora of data yielded by comprehen-sive qualitative evaluations is a difficult task, andthere are many instances of frequent failure to fullyanalyze the results of long and costly data collections.While lengthy descriptive case studies are extremelyuseful in furthering the understanding of social phe-nomena and the implementation and functioning ofinnovative projects, they are ill-suited to outcomeevaluation studies for program managers and fundingagencies. However, more recently, methods have beendevised to classify qualitative findings through the useof a special software program (Ethnograph) and di-

Pre-testing is a step that manyevaluators “skip” because of timepressures. However, as has beenshown many times, they may do

so at their own peril.

DESIGN, DATA COLLECTION AND DATA ANALYSIS CHAPTER THREE

48 EHR/NSF Evaluation Handbook

verse thematic codes. This approach may enableinvestigators to analyze qualitative data quantitativelywithout sacrificing the richness and character ofqualitative analysis. Content analysis which can beused for the analysis of unstructured verbal data, isanother available technique for dealing quantitativelywith qualitative data. Other approaches, includingsome which also seek to quantify the descriptiveelements of case studies, and others which addressissues of validation and verification also suggest thatthe gap between qualitative and quantitative analysesis narrowing. Specific techniques for the analysis ofqualitative data can be found in some of the textsreferenced at the end of this Chapter.

Data Analysis: Quantitative Data

In Chapter Two, we outlined the major steps requiredfor the analysis of quantitative data:

• Check the raw data and prepare data for analysis

• Conduct initial analysis based on evaluation plan

• Conduct additional analyses based on initialresults

• Integrate and synthesize findings.

In this chapter, we provide some additional advice oncarrying out these steps.

Check the Raw Data and Prepare Data for Analysis

In almost all instances, the evaluator will conduct thedata analysis with the help of a computer. Even if thenumber of cases is small, the volume of data collectedand the need for accuracy, together with the availabil-ity of PC’s and user-friendly software, make it unlikelythat evaluators will do without computer assistance.

The process of preparing data for computer analysisinvolves data checking, data reduction, and datacleaning.

Data checking can be done as a first step by visualinspection of the raw data; this check may turn upresponses which are out-of-line, unlikely, inconsis-tent or suggest that a respondent answered questionsmechanically (for example chose always the thirdresponse category in a self-administered question-

CHAPTER THREE DESIGN, DATA COLLECTION AND DATA ANALYSIS

49EHR/NSF Evaluation Handbook

naire).

Data reduction consists of the following steps:

• Deciding on a file format. (This is usuallydetermined by the software to be used.)

• Designing codes (the categories used to classifythe data so that they can be processed bymachine) and coding the data. If instruments are“pre-coded,” for example if respondents wereasked to select an item from a checklist, coding isnot necessary. It is needed for “open-ended”answers and comments by respondents andobservers.

• Data entry (keying the data onto tapes or disks sothat the computer can read them).

Many quality control procedures for coding open-ended data and data entry have been devised. Theyinclude careful training of coders, frequent check-ing of their work, and verification of data entry bya second clerk.

Data cleaning consists of a final check on the data filefor accuracy, completeness and consistency. At thispoint, coding and keying errors will be detected. (Fora fuller discussion of data preparation procedures, seeFowler, 1991).

If these data preparation procedures have been care-fully carried out, chances are good that the data setswill be error-free from a technical standpoint and thatthe evaluator will have avoided the “GIGO” (garbage in,garbage out) problem which is far from uncommon inanalyses based on computer output.

Conduct Initial Analysis Based on the Evaluation Plan

The evaluator is now ready to start generating infor-mation which will answer the evaluation questions. Todo so, it is usually necessary to deal with statisticalconcepts and measurements, a prospect which someevaluators or principal investigators may find terrify-ing. In fact, much can be learned from fairly uncom-plicated techniques easily mastered by persons withouta strong background in mathematics or statistics.Many evaluation questions can be answered throughthe use of descriptive statistical measures, such asfrequency distributions (how many cases fall into a

Solid data preparation procedureshelp avoid “GIGO”- garbage in,

garbage out.

In fact, much can be learned fromfairly uncomplicated techniques

easily mastered by personswithout a strong background in

mathematics or statistics.

DESIGN, DATA COLLECTION AND DATA ANALYSIS CHAPTER THREE

50 EHR/NSF Evaluation Handbook

given category), and measures of central tendency(such as the mean or median which refer to statisticalmeasures which seek to locate the “average” or thecenter of a distribution).

For frequency distributions, the question is mostoften a matter of presenting such data in the mostuseful form for project managers and stakeholders.Often the evaluator will look at detailed distributionsand then decide on a summary presentation, usingtables or graphics. An example is the best way ofillustrating these various issues.

Let us assume that a project had recruited 200 highschool students to meet with a mentor once a weekover a one year period. One of the evaluation questionswas: “How long did the original participants remain inthe program?” Let us also assume that the data wereentered in weeks. If we ask the computer to give us afrequency distribution, we get a long list (if every weekat least one participant dropped out, we may end upwith 52 entries for 200 cases). Eyeballing this un-wieldy table, the evaluator noticed several interestingfeatures: only 50 participants (1/4th of the total)stayed for the entire length of the program; a fewpeople never showed up or stayed only for 1 session.To answer the evaluation question in a meaningfulway, the evaluator decided to ask the computer togroup data into a shorter table, as follows:

Length of Participation

No. ofParticipants

Time

1 week or less 102-15 weeks 30

16-30 weeks 6631-51 weeks 44

52 weeks 50

A bar chart might be another way of presentingthese data as shown in Exhibit 6.

Let us now assume that the evaluator would like asingle figure which would provide some indication ofthe length of time during which participants remainedin the project. There are three measures of centraltendency which provide this answer, the mean (or

CHAPTER THREE DESIGN, DATA COLLECTION AND DATA ANALYSIS

51EHR/NSF Evaluation Handbook

52 Weeks16-30 Weeks 31-51 Weeks2-15 Weeks0-1 Week

arithmetic average), the median (the point at whichhalf the cases fall below and half above), and the mode,which is the category with the largest number of cases.Each of these require that the data meet specificconditions and each has advantages and drawbacks.(See glossary for details.)

In the above example, the only way of computing themean, median, and mode would be from the raw data,prior to grouping the data as shown in Exhibit 7.However, to simplify the discussion we will just dealwith the mean and median (usually the most mean-ingful measures for evaluation purposes), which canbe computed from grouped data. The mean would beslightly above 30 weeks, the median would be slightlyabove 28 weeks. The mean is higher because of theimpact of the last two categories (31-51 weeks and 52weeks). Both measures are “correct,” but they tellslightly different things about the length of timeparticipants remained in the project; the average was30 weeks, which may be a useful figure for estimatingfuture project costs; half of all participants stayed for28 weeks or less, which may be a useful figure fordeciding how to time retention efforts. Exhibit 7illustrates differences in the relative position of themedian, mean, and mode depending on the nature of

Exhibit 6

20

30

40

60

10

50

No

. of P

art

icip

ant

sLength of Participation in Mentoring Project (200 High School Students)

DESIGN, DATA COLLECTION AND DATA ANALYSIS CHAPTER THREE

52 EHR/NSF Evaluation Handbook

Exhibit 7

Positively Skewed Distribution

Negatively Skewed Distribution

Median Median Median Median Median

Mode Mean

Mean Mode

Median

Relationship Of Central Tendency MeasuresRelationship Of Central Tendency Measures In Skewed Score Distributions In Skewed Score Distributions

Source: Jaeger, R. M. (1990). Statistics—A Spectator Sport , pps. 42-43. Newbury Park, CA: Sage.

the data, such as a positively skewed distribution oftest scores (more test scores at the lower end of thedistribution) and for a negatively skewed distribution(more scores at the higher end).

In many evaluation studies, the median is the pre-ferred measure of central tendency because for mostanalyses, it describes the distribution of the databetter than the mode or the mean. For a usefuldiscussion of these issues, see Jaeger (1990).

Conduct Additional Analyses Based on the Initial Results

The initial analysis may give the evaluator a good feelfor project operations, levels of participation, projectactivities, and the opinions and attitudes of partici-

CHAPTER THREE DESIGN, DATA COLLECTION AND DATA ANALYSIS

Comparison of Central Tendency MeasuresIn Skewed Score Distributions

53EHR/NSF Evaluation Handbook

Exhibit 8, a bar graph, is a better way of showing thesame data. Because the table and graph show that onthe whole women dropped out later than men, but thatmost of them also did not complete the entire program,the evaluator may want to re-group the data, forexample break down the 31-51 group further to see ifmost women stayed close to the end of the program.

Cross-tabulations are a convenient technique for ex-amining several variables simultaneously; however,they are often inappropriate because sub-groups be-come too small. One rule of thumb is that a minimumof 20 cases are needed in each subgroup for analysisand for the use of statistical tests to judge the extentto which observed differences are “real” or due tosampling error. In the above example, it might havebeen of interest to look further at men and women indifferent ethnic groups (African American men, Afri-can American women, White men and White women)but among the 200 participants there might not havebeen a sufficient number of African American men orWhite women to carry out the analysis.

pants, staff, and others involved in the project, but itoften raises new questions. These may be answered byadditional analyses which examine the findings ingreater detail. For example, as discussed in some ofthe earlier examples, it may be of interest to comparemore and less experienced teachers’ assessment of theeffectiveness of new teaching materials, or to comparethe opinions of men and women who participated in amentoring program. Or it might be useful to comparethe opinions of women who had female mentors withthose of women who had male mentors. These moredetailed analyses are often based on cross-tabula-tions, which, unlike frequency distributions, dealwith more than one variable. If, in the earlier exampleabout length of participation in mentoring programs,the evaluator wants to compare men and women, thecross-tabulation would look as follows:

Length of Participation by Sex

1 weekor less 10 10 0

2-15 weeks 30 20 1016-30 weeks 66 40 2631-51 weeks 44 10 34

52 weeks 50 20 30

All Students Men Women

One rule of thumb is that aminimum of 20 cases are

needed in each subgroupfor analysis and for

the use of statistical teststo judge the extent to which

observed differences are “real”or due to sampling error.

DESIGN, DATA COLLECTION AND DATA ANALYSIS CHAPTER THREE

54 EHR/NSF Evaluation Handbook

There are other techniques for examining differencesbetween groups and testing the findings to see if theobserved differences are likely to be “true” ones. To useany one of them, the data must meet specific condi-tions. Correlation, t-tests, chi-square, and varianceanalysis are among the most frequently used and havebeen incorporated in many standard statisticalpackages. More complex procedures, designed toexamine a large number of variables and measuretheir respective importance, such as factor analysis,regression analysis, and analysis of co-variance arepowerful statistical tools, but their use requires ahigher level of statistical knowledge. There are specialtechniques for the analysis of longitudinal (panel)data. Many excellent sources are available for decidingabout the appropriateness and usefulness of thevarious statistical methods (Jaeger, 1990; Fitz-Gib-bon and Morris, 1987).

Exploring the data by various statistical procedures inorder to detect new relationships and unanticipatedfindings is perhaps the most exciting and gratifyingevaluation task. It is often rewarding and useful to keepexploring new leads, but the evaluator must not lose trackof time and money constraints and needs to recognizewhen the point of diminishing returns has been reached.

Exhibit 8

Length of Participation in Mentoring Project(200 High School Students, 100 Men and 100 Women)

52 Weeks

31-51 Weeks

16-30 Weeks

2-15 Weeks

1 Week or Less

10 20 30 40 50 60

12345678901123456789011234567890112345678901

Men

Women

123456789012123456789012123456789012123456789012

123456123456123456123456

1234567890123456789012123456789012345678901212345678901234567890121234567890123456789012

123456123456123456123456

123456789012123456789012123456789012123456789012

No. of Participants

CHAPTER THREE DESIGN, DATA COLLECTION AND DATA ANALYSIS

Exploring the data by variousstatistical procedures in order to

detect new relationships andunanticipated findings is perhapsthe most exciting and gratifying

evaluation task.

55EHR/NSF Evaluation Handbook

By following the suggestions made so far in thischapter, the evaluator will be able to answer manyquestions about the project asked by stakeholders con-cerned about implementation, progress, and some out-comes. But the question most often asked by fundingagencies, planners and policy makers who might wantto replicate a project in a new setting is the question:Did the program achieve its objectives? Did it work?What feature(s) of the project were responsible for itssuccess or failure? Outcome evaluation is the evaluator’smost difficult task. It is especially difficult for anevaluator who is familiar with the conceptual andstatistical pitfalls associated with program evaluation.To quote what is considered by many the classic textin the field of evaluation (Rossi and Freeman, 1993):

“The choice (of designs) always involvestrade-offs, there is no single,

always-best design that can beused as the ‘gold standard’.”

Why is outcome evaluation or impact assessment sodifficult? The answer is simply that educational projectsdo not operate in a laboratory setting, where “pure”experiments can yield reliable findings pointing tocause and effect. If two mice from the same litter arefed different vitamins, and one grows faster than theother, it is easy to conclude that vitamin x affectedgrowth more than vitamin y. Some projects will try tomeasure impact of educational innovations by usingthis scientific model: observing and measuring out-comes for a treatment group and a matched compari-son group. While such designs are best in theory, theyare by no means fool-proof: the literature abounds instories about “contaminated” control groups. For ex-ample, there are many stories about teachers whosestudents were to be controls for an innovative pro-gram, and who made special efforts with their stu-dents so that their traditional teaching style wouldyield exceptionally good outcomes. In other cases,students in a control group were subsequently en-rolled in another experimental project. But even if thecontrol group is not contaminated, there are innumer-able questions about attributing favorable outcomesto a given project. The list of possible impediments isformidable. Most often cited is the fallacy of equatinghigh correlation with causality. If attendance in thementoring program correlated with higher test scores,was it because the program stimulated the students tostudy harder and helped them to understand scien-tific concepts better? Or was it because those who

DESIGN, DATA COLLECTION AND DATA ANALYSIS CHAPTER THREE

56 EHR/NSF Evaluation Handbook

chose to participate were more interested in sciencethan their peers? Or was it because the school changedits academic curriculum requirements? Besides poordesign and measurements, the list of factors whichmight lead to spurious outcome assessments includesinvalid outcome measures as well as competing expla-nations, such as changes in the environment in whichthe project operated and Hawthorne effects. On thebasis of his many years of experience in evaluationwork, Rossi and Freeman (1993) formulated ‘The IronLaw of Evaluation Studies’:

“The better an evaluation study istechnically, the less likely it is to show

positive program effects.”

There is no formula which can guarantee a flawlessand definitive outcome assessment. Together with acommand of analytic and statistical methods, theevaluator needs the ability to view the project in itslarger context (the real world of real people) in order tomake informed judgments about outcomes which canbe attributed to project activities. And, at the risk ofdisappointing stakeholders and funding agencies, theevaluator must stick to his guns if he feels thatavailable data do not enable him to give an unqualifiedor positive outcome assessment. This issue is furtherdiscussed in Chapter Four.

Integrate and Synthesize Findings

When the data analysis has been completed, the finaltask is to select and integrate tables, graphs andfigures which constitute the salient findings and willprovide the basis for the final report. Usually theevaluator must deal with several dilemmas:

• How much data must be presented to support aconclusion?

• Should data be included which are interesting orprovocative, but do not answer the originalevaluation questions?

• What to do about inconsistent or contradictoryfindings?

Here again, there are no hard and fast rules. Becauseusually the evaluator will have much more informa-tion than can be presented, judicious selection shouldguide the process. It is usually unnecessary to belabor

CHAPTER THREE DESIGN, DATA COLLECTION AND DATA ANALYSIS

57EHR/NSF Evaluation Handbook

a point by showing all the data on which the conclu-sion is based: just show the strongest indicator. On theother hand, “interesting” data which do not answerone of the original evaluation questions should beshown if they will help stakeholders to understand orseek to address issues of which they may not havebeen aware. A narrow focus of the evaluation mayfulfill contractual or formal obligations, but it deprivesthe evaluator of the opportunity to demonstrate sub-stantive expertise and the stakeholders of the fullbenefit of the evaluator’s work. Finally, inconsistent orcontradictory findings should be carefully examinedto make sure that they are not due to data collectionor analytic errors. If this is not the case, they should beput on the table, as pointing to issues which may needfurther thought or examination.

REFERENCES

American Psychological Association, Educational Re-search Association, and National Council on Measure-ment in Education (1974). Standards for Educationaland Psychological Tests. Washington, DC: AmericanPsychological Association.

Fitz-Gibbon, C. T. & Morris, L. L. (1987). How toDesign a Program Evaluation. Newbury Park, CA:Sage.

Fowler, F. J. (1993). Survey Research Methods.Newbury Park, CA: Sage.

Guba, E. G. & Lincoln, Y. S. (1989). Fourth GenerationEvaluation. Newbury Park, CA: Sage.

Henerson, M. E., Morris, L. L., & Fitz-Gibbon, C. T. (1987).How to Measure Attitudes. Newbury Park, CA: Sage.

Herman, J. L., Morris, L. L., & Fitz-Gibbon, C. T.(1987). Evaluators Handbook. Newbury Park, CA:Sage.

Jaeger, R. M. (1990). Statistics—A Spectator Sport.Newbury Park, CA: Sage.

Linn, R. L., Baker, E. L., & Dunbar, S. B. “Complexperformance-based assessment: expectations and vali-dation criteria.” Educational Researcher, 20-8, 1991.

Love, A. J. (ed.) (1991). Evaluation Methods Sourcebook.Ottawa, Canada: Canadian Evaluation Society.

DESIGN, DATA COLLECTION AND DATA ANALYSIS CHAPTER THREE

58 EHR/NSF Evaluation Handbook

Morris, L. L., Fitz-Gibbon, C. T., & Lindheim, E.(1987). How To Measure Performance and Use Tests.Newbury Park, CA: Sage.

Rossi, P. H. & Freeman, H .E. (1993). Evaluation—ASystematic Approach (5th Edition). Newbury Park, CA:Sage.

Scriven, M. (1991). Evaluation Thesaurus. NewburyPark, CA: Sage.

Seidel, J. V., Kjolseth, R. & Clark, J. A. (1988). TheEthnograph. Littleton, CO: Qualis Research Associates.

Stewart, P. W. & Shamdasani, P. N. (1990). FocusGroups. Newbury Park, CA : Sage.

Sudman, S. (1976). Applied Sampling. New York:Academic Press.

Yin, R. (1989). Case Study Research. Newbury Park,CA: Sage.

CHAPTER THREE DESIGN, DATA COLLECTION AND DATA ANALYSIS