underlying theory & basic issues
TRANSCRIPT
382C Empirical Studies in Software Engineering Lecture 5
© 2000-present, Dewayne E Perry 1© 2000-present, Dewayne E Perry
Underlying Theory &Basic Issues
Dewayne E PerryENS 623
382C Empirical Studies in Software Engineering Lecture 5
© 2000-present, Dewayne E Perry 2
All Too True
382C Empirical Studies in Software Engineering Lecture 5
© 2000-present, Dewayne E Perry 3
ValidityIn software engineering, we worry about various issues:
E-Type systems: Usefulness – is it doing what is neededIs it doing it in an acceptable or appropriate way
S-Type programs: correctness of functionality – is it doing what it is supposed to doAre the structures consistent with the way it should perform
In empirical work, worried about similar kinds of thingsAre we testing what we mean to testAre the results due solely to our manipulationsAre our conclusions justifiedWhat are the results applicable to
The questions correspond to different validity concernsConcerned about the logic of evidence
382C Empirical Studies in Software Engineering Lecture 5
© 2000-present, Dewayne E Perry 4
Validity4 primary types of validity
Construct ValidityInternal ValidityStatistical ConclusionExternal Validity
CommentsThis organization differs somewhat from R&REach sequentially dependent on preceding
382C Empirical Studies in Software Engineering Lecture 5
© 2000-present, Dewayne E Perry 5
Construct ValidityAre we measuring what we intend to measure
Akin to the requirements problem: are we building the right systemIf we don’t get this right, the rest doesn’t matter
Constructs: abstract conceptsTheoretical constructionsMust be operationalized in the experiment
Necessary condition for successful experimentDivide construct validity into three parts:
Intentional ValidityRepresentation ValidityObservation Validity
382C Empirical Studies in Software Engineering Lecture 5
© 2000-present, Dewayne E Perry 6
Construct ValidityIntentional Validity
Do the constructs we chose adequately represent what we intend to studyAkin to the requirements problem where our intent is fair scheduling but out requirement is FIFOAre our constructs specific enoughDo they focus in the right directionEg, is it intelligence or cunningness
382C Empirical Studies in Software Engineering Lecture 5
© 2000-present, Dewayne E Perry 7
Construct ValidityRepresentation Validity
How well do the constructs or abstractions translate into observable measuresTwo primary questions:
Do the sub-constructs properly define the constructsDo the observations properly interpret, measure or test the constructs
2 ways to argue for representation validityFace validity
Claim: on the face of it, seems like a good translationVery weak argumentStrengthened by consensus of experts
Content validityCheck the operationalization against the domain for the constructThe extent to which the tests measure the content of the domain being tested - ie, cover the domainThe more it covers the relevant areas, the more content valid
Both are nonquantitative judgments
382C Empirical Studies in Software Engineering Lecture 5
© 2000-present, Dewayne E Perry 8
Construct ValidityObservation Validity
How good are the measures themselvesDifferent aspects illuminated by
Predictive validityCriterion validityConcurrent validityConvergent validityDiscriminant validity
382C Empirical Studies in Software Engineering Lecture 5
© 2000-present, Dewayne E Perry 9
Construct ValidityPredictive Validity
Observed measure predicts what it should predict and nothing elseEg, college aptitude tests are assessed for their ability to predict success in college
Criterion ValidityDegree to which the results of a measure agree with those of an independent standardEg, for college aptitude, GPA or successful first year
Concurrent ValidityThe observed measure correlates highly with an established set of measuresEg, shorter forms of tests against longer forms
382C Empirical Studies in Software Engineering Lecture 5
© 2000-present, Dewayne E Perry 10
Construct ValidityConvergent Validity
Observed measure correlates highly with other observable measures for the same constructUtility is not that it duplicates a measure but is a new way of distinguishing a particular trait while correlating with similar measures
Discriminant ValidityThe observable measure distinguishes between two groups that differ on the trait in questionLack of divergence argues for poor discriminant validity
R&R discuss various interesting correlations on convergent and discriminant validity among various psychological tests
In terms of validity, reliability and stability
382C Empirical Studies in Software Engineering Lecture 5
© 2000-present, Dewayne E Perry 11
Internal ValidityAre the values of the dependent variables solely the result of the manipulations of the independent variablesHave we ruled out rival hypothesesHave we eliminated confounding variables
Participant variablesExperimenter variablesStimulus, procedural and situational variablesInstrumentationNuisance variables
382C Empirical Studies in Software Engineering Lecture 5
© 2000-present, Dewayne E Perry 12
Internal ValidityNever completely satisfied (Systems never error free either)Campbell, Stanley and Cook
Standardize choice of control groupsTry to isolate potential invalidity sources
Confounding effectsTreatment effect and some other effect cannot be separated
Confounding sources of internal invalidityH: History
takes place between pre and post testMay contaminate post test results
M: Maturationolder/wiser/better between pre/post
I: Instrumentationchange due to test instrument
S: Selectionnature of participantsControl over assignment may have effects
382C Empirical Studies in Software Engineering Lecture 5
© 2000-present, Dewayne E Perry 13
Statistical Conclusion ValidityAre the presumed causal variable X and its effect Y statistically related
Ie, do they covaryIf unrelated then the one cannot be the cause of the other
3 questions (sequentially dependent)Is the study sufficiently sensitiveWhat is the evidence that they covaryHow strongly do they covary
382C Empirical Studies in Software Engineering Lecture 5
© 2000-present, Dewayne E Perry 14
External ValidityTwo positions
The generalizability of the causal relationship beyond that studied/observed
Eg, do studies of very large reliable real-time systems generalize to small .COM companies
The extent to which the results support the claims of generalizability
Eg, do the studies of 5ESS support the claim that they are representative of real-time ultra reliable systems
382C Empirical Studies in Software Engineering Lecture 5
© 2000-present, Dewayne E Perry 15
External Validity (EV)SWE: lab studies tend to be with students and that restricts EV and generalizationEcological validity: representativeness of the real worldEfficacy vs effectiveness studies
Former very rigorous, latter more openFormer needs to be more concerned with EVLatter with internal validity
EV: the demonstrated validity of the generalizations that the researcher intended the research to make at the outset and the validity of the generalized inferences that the researcher offers at the end
382C Empirical Studies in Software Engineering Lecture 5
© 2000-present, Dewayne E Perry 16
GeneralizabilityGeneralizability considerations
PeopleResearchersPlaces, environments, settingsTimeTreatments, levels of treatmentsProcedures, conditions and measurementsTechnology
ReproducibilityKey to generalizability is whether the study can be reproducedReplication is an exact as possible repeat
Same procedure, different sampleLook for congruent results
Replications on a broader set of subjects under additional conditions further strengthens generalizability
382C Empirical Studies in Software Engineering Lecture 5
© 2000-present, Dewayne E Perry 17
CausationPhilosophical issues
Aristotle: formal, material, efficient, final causesAlternatives: concomitant variation, invariable sequenceIssue of power, causal efficacy -> invariably joinedNecessity versus invariable sequenceSufficiencyTemporal precedence
382C Empirical Studies in Software Engineering Lecture 5
© 2000-present, Dewayne E Perry 18
CausationPlurality of causes
In behavioral sciences more in terms of one of the causes than a single causeExperiment while holding everything else constantCausality reserved for experimental resultsWorking definition of cause
Forget infinite regressive trail of reasonNon-philosophical working definition:
A proximal antecedent agent or agency that initiates a sequence of events that re necessary and sufficient in bringing about the observed effects
Proximal since it is occurs at a time near the resultAntecedent as it clearly precedes the effectAgent is set up intentionallyExperimenter exercises the control leverSufficient: effect not seen in absence of treatment
382C Empirical Studies in Software Engineering Lecture 5
© 2000-present, Dewayne E Perry 19
Lack of CausationCorrelation and causation
Correlations show a relationshipWith path analysis, multiple regression analysis, one can begin to make causal inferences, or build causal modelsBut demonstration of causality is a logical and experimental, rather than a statistical problem
Enabling versus causingPermits but does not causeEg: marriage is primary cause of divorce - NOT
OtherNot sufficientMay not be necessary (eg, high IQ and good living)May be probabilistic
Some unsubstantiated causal claims:Drinking wine prevents arteriosclerosisEating broccoli prevents colon cancerPost hoc ergo propter hoc – necessity not demonstrated
382C Empirical Studies in Software Engineering Lecture 5
© 2000-present, Dewayne E Perry 20
Hume’s Classical RulesC&E must be contiguous in space and timeC must be prior to EA constant union between C and EThe same C always produces the same E and the same E never arises but from the same CLike Es imply like CsLike Cs produce like EsCs may have multiple components, ie subCsSome Cs are not complete in themselves
382C Empirical Studies in Software Engineering Lecture 5
© 2000-present, Dewayne E Perry 21
DistillationCo-variation Rule : cause is positively correlated with effectTemporal Precedence Rule : causes must precede effectsInternal Validity Rule :all plausible alternative explanations must be ruled outReality : must settle for best available evidence
382C Empirical Studies in Software Engineering Lecture 5
© 2000-present, Dewayne E Perry 22
ControlConstancy of conditions
Maintaining extraneous conditions that might affect variablesCalibrating various elements in an experiment
Control seriesExpectation controlBehavior control
Control conditionOf primary concern to usControl + treatment
382C Empirical Studies in Software Engineering Lecture 5
© 2000-present, Dewayne E Perry 23
Mill’s MethodMethod of agreement
If X then YIf several instances of this and only X presentThen X is a sufficient condition for Y
Method of DifferenceIf ~X then ~Y
IfY does not occur when X is absentThen X is a necessary condition of Y
Joint MethodShould lead to better, more highly justified conclusions than either method separately
Method of Concomitant VariationRelates changes in the amount or degree of changeY = f(x)
Y is functionally related to variations in XEg, stronger treatments show larger effects
382C Empirical Studies in Software Engineering Lecture 5
© 2000-present, Dewayne E Perry 24
Practical ApplicationPractical problems:
Rare to find perfect covariance, r = 1.0Rare to rule out every plausible alternative hypothesisExperience needed to determine adequate control conditions
Two group designTreatment: if X then YControl: if ~X then ~Y
382C Empirical Studies in Software Engineering Lecture 5
© 2000-present, Dewayne E Perry 25
RandomnessRandomness is fundamental in experiments
Quasi-experimental otherwiseMust compensate for it otherwise
Must have a clear view of its roleIt is “the reasoned basis for inference” in experimentsThe basis for statistical methods
RA Fischer, The Design of Experiments, Edinburgh: Oliver & Boyd, 1935, 1949
Credited with the invention of randomizationIntroduced the formal properties of randomization
382C Empirical Studies in Software Engineering Lecture 5
© 2000-present, Dewayne E Perry 26
The Lady Tasting Tea“A lady declares that by tasting a cup of tea made with milk she can discriminate whether the milk or the tea infusion was first added to the cup.”Experiment
8 cups of tea4 made each wayPresented in random order
often determined by a random number tableSubject knows the experimental designHer task is to determine two sets of 4
Agreeing if possible with the treatments received
382C Empirical Studies in Software Engineering Lecture 5
© 2000-present, Dewayne E Perry 27
The Lady Tasting TeaWhat would be expected if the Lady was “without any faculty of discrimination”
Ie, if she made no changes in her judgments in response to changes in the order of presentationThere are 70 possible divisions of 8 into 4
Randomization has insured that all orderings are equally probablyThe chance of accidentally choosing the right ordering is 1/70
Ie, probability of random ordering agreeing with the Lady’s fixed judgments is 1/70 or
0.014 is the significance level for testing the null hypothesis of having no judgment
7048
=⎟⎟⎠
⎞⎜⎜⎝
⎛
382C Empirical Studies in Software Engineering Lecture 5
© 2000-present, Dewayne E Perry 28
The Lady Tasting TeaExample serves well
The Lady is not a sample from a population of ladies –concerns her aloneHer eight judgments are not independent observations (rule of 4 each)Later cups differ from earlier ones
Inferences are justified because the only probability distribution used in the inference is the one created by the experimenter
382C Empirical Studies in Software Engineering Lecture 5
© 2000-present, Dewayne E Perry 29
Fischer’s argumentExperiments do not require
That experimental units be homogeneousThat experimental units be a random sample from a population of unit
It is sufficient to require that treatments be allocated at random to experimental units for valid inferencesProbability enters the experiment only thru the random assignment of treatments