an evaluation of delphi.pdf

Upload: joe-hung

Post on 02-Apr-2018

222 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/27/2019 An Evaluation of Delphi.pdf

    1/20

    TECHNOLOGICAL FORECASTING AND SOCIAL CHANGE 40, 131-150 (1991)

    An Evaluation of Delphi

    FRED WOUDENBERG

    ABSTRACT

    The literature concerning quantitative applications of the Delphi method is reviewed. No evidence wasfound to support the view that Delphi is more accurate than other judgment methods or that consensus in a

    Delphi is achieved by dissemination of information to all participants. Existing data suggest that consensus isachieved mainly by group pressure to conformity, mediated by the statistical group response that is fed backto all participants.

    IntroductionHuman judgment is necessary in situations of uncertainty and, because these situ-

    ations abound, is very much relied upon. However, numerous reports about flaws inhuman judgment and its inferiority to more formal methods of judgment have appeared[l-6]. Research since the beginning of this century has sought to discover the factorsresponsible for the shortcomings in human judgment, with the aim of developing moreaccurate judgment methods. In reviewing these studies, several authors [7-121 draw thesame three conclusions:

    1. A statistical aggregate of several individual judgments is more accurate thanjudgment of a random individual. Some authors [ 10, 121 refine this by notingthat this holds only for tasks of simple to intermediate difficulty.

    2. Judgments resulting from interacting groups are more accurate than a statisticallyaggregated judgment. Also, interaction leads to stronger agreement.

    3. Unstructured, direct interaction still has disadvantages that lead to suboptimalaccuracy of judgments.

    Starting from these three premises, the most logical next step is to develop judgmentmethods that possess all the advantages, but not the disadvantages, of unstructured, directinteraction. Many of these methods have been developed. In most, one tries to overcomethe disadvantages of unstructured, direct interaction by structuring the interaction betweenparticipants. The nominal group technique, developed by van de Ven and Delbecq [ 131,is the most widely known structured, direct interaction method. In other procedures,interaction is also structured, but no allowance is made for direct interaction. The best-

    FRED WOUDENBERG received a PhD in experimental psychology in 1989.Address reprint requests to Dr. Fred Woudenberg, Scientific Department, Environmental Health Service

    Rotterdam, PO Box 70032, 3000 LP Rotterdam, The Netherlands.

    0 1991 by Elsevier Science Publishing Co., Inc. 0040-1625/91/$3.50

  • 7/27/2019 An Evaluation of Delphi.pdf

    2/20

    132 F. WOUDENBERG

    TABLE 1Judgment Methods in Order of Increasing Accuracy Based on Literature Review or Expectation

    Traditional methods Newly developed methods

    Random Staticized Unstructured, Structured, Delphi

    individual group direct interaction direct interaction

    Accuracy

    Solid line, literature review; dashed line, expectation.

    known structured, indirect interaction method is the Delphi technique, which is the focusof the present article.

    Based on the above-mentioned literature reviews and the expectations derived fromthem [see, e.g., 141, the different judgment methods can be put on a scale of increasing

    accuracy (Table 1). In this article the expected high relative accuracy of Delphi is eval-uated. Both the accuracy of the Delphi method as a whole and the contribution of theseparate Delphi characteristics to its accuracy will be discussed. In addition, the reliabilityof Delphi and, briefly, its capacity to induce consensus are evaluated.

    History of DelphiThe first experiment using a Delphi methodology was performed in 1948 to improve

    betting scores at horse races [15]. The name Delphi was coined by Kaplan [see 161,a philosopher working for the Rand Corporation who headed a research effort directed

    at improving the use of expert predictions in policy-making. Kaplan et al. [ 171 foundthat unstructured, direct interaction did not lead to more accurate predictions than statisticalaggregation of individual predictions. To make better use of the potential of groupinteraction, Gordon, Helmer, and Dalkey, also at the Rand Corporation, developed theDelphi method in the 1950s. Between 1950 and 1963 they conducted 14 Delphi exper-iments, but, as a consequence of its military character, all this work was secret. The firstarticle describing some of this research was published in 1963 [18]. In 1964, Gordonand Helmer published an article that roused worldwide interest in Delphi [19].

    Delphi was developed as a method to increase the accuracy of forecasts. Many othertypes of Delphi have been derived from the original method:

    0 Delphi to estimate unknown parameters [20]0 Policy Delphi [21]l Decision Delphi [22]

    A Delphi has even been used to compile an epidemiological dictionary [23].In the 1950s and 196Os, Delphi was used mainly to make quantitative assessments

    (forecasting dates and estimating unknown parameters). In the 197Os, stress was moreand more put on the educational and communicational possibilities of Delphi [24-281,

    although Dalkey [29] had mentioned these possibilities in 1967. Some authors [30, 311began to call Delphi a communication device and measured its success qualitatively asthe satisfaction of the participants with the method instead of quantitatively as accuracy[321. The evaluation of Delphi in the present article is restricted to the quantitative Delphi,and the conclusions pertain only to this form.

  • 7/27/2019 An Evaluation of Delphi.pdf

    3/20

  • 7/27/2019 An Evaluation of Delphi.pdf

    4/20

    134 F. WOUDENBERG

    TABLE 2Comparison of the Accuracy of Judgment Methods with Delphi as Reference

    Unstructured, Structured,Staticized direct direct

    Study group interaction interaction Delphi

    Campbell 14 I ] 2 1Pfeiffer 1431 2 1Dalkey [201 2 1Farquhar [5 I ] I 2Gustafson et al. [52] 2 2 1 3Sack 1441 2 1Brockhoff [59] I 1Ford [49] I 2Seaver [SS] I 2 2 2Miner [53] 1 112Moskowitz and Bajgier 1541 I 2Rohrbaugh [121 1 1Fischer [57] I I 1 IBoje and Murnighan [48] I 3 2Riggs [33] 2 IParente et al. [42] 2 1Erffmeyer and Lane [141 2 3 I

    Methods are ranked from most accurate (1) to less accurate (2, 3)

    be seen as a new measuring instrument. Evaluation of the accuracy and reliability of

    Delphi, being a judgment method, is therefore seriously hampered by the possible influ-ence of person- and situation-specific biases.

    Accuracy of the Delphi Method as a WholeBecause it is difficult to evaluate the accuracy of a judgment method, it is not

    surprising that the accuracy of the Delphi has been falsely inferred from other criteria,such as consensus [38], the log-normality of first estimates [39], and the relation betweenremoteness and precision of a forecast [40]. The most feasible way to evaluate the accuracyof Delphi is to compare it directly to other judgment methods in the same situation. Themain sources of bias that remain are order effects when the same participants are usedwith the different methods and person-specific effects when different groups of participantsare used. These effects can be minimized with proper randomization.

    A number of studies have been reported in which accuracy was evaluated by com-paring Delphi directly to other methods (see Table 2). In most studies no statisticalcomparison between methods was made. Therefore, the accuracy of the investigatedmethods was rank ordered from most to least accurate. Table 3 lists all 26 possiblepairwise comparisons between methods of the 17 mentioned studies. A slight-but notunequivocal-indication for Delphis expected higher accuracy as compared to unstruc-tured, direct interaction can be observed. A similarly unequivocal suggestion can be foundin Delphis lower accuracy as compared to the staticized group. The two suggestionstaken together (Delphi being more accurate than unstructured, direct interaction, but lessaccurate than a staticized group) are not easy to interpret and can even be called anomalous.

    Note that not only is the evaluation of the accuracy and reliability of the Delphi method hampered byperson- and situation-specific biases, but also its validation and standardization.

  • 7/27/2019 An Evaluation of Delphi.pdf

    5/20

    AN EVALUATION OF DELPHI 135

    TABLE 3Pairwise Comoarisons of the Accuracv of Judement Methods Based on the Results of Table 2

    More

    accurate

    Delphi

    Equally

    accurate

    Less

    accurateStaticized group vs Delphi 2 1 5Unstructured, direct interaction vs Delphi 5 3 2

    Structured, direct interaction vs Delphi 2 4 2

    All methods vs Delohi 9 8 9

    In general (see the Introducrion) , it has been found that unstructured, direct interactiongives more accurate results than does a staticized group. The present suggestions implythat a staticized group is more accurate than unstructured, direct interaction. Unfortu-

    nately, very few direct comparisons between unstructured, direct interaction and a sta-ticized group were made in the reviewed studies. In three comparisons, a staticized groupwas once more accurate and twice equally accurate, slightly supporting the anomalousconclusion. The comparison between Delphi and structured, direct interaction suggeststhat there is no difference in accuracy. Also, the comparisons between Delphi and allother methods (meaningful because Delphi has been proposed as the most accurate judg-ment method available) show no difference.

    Taken together, the reviewed studies do not offer easily interpretable conclusionsor unequivocal outcomes of comparisons. A closer scrutiny of the separate studies doesnot offer a clue as to the factors determining these unequivocal results. A criticismpertaining to all the studies is that experiments were conducted in a laboratory, while thequiescence of the private environment has been mentioned as a distinct advantage ofDelphi. Also, in almost no study were expert participants used. Furthermore, in moststudies no arguments of deviating judges were asked for and fed back to the entire group.

    More specific criticisms can be leveled against the studies, including those seven[ 14, 20, 33, 41-441 that found Delphi to be the most accurate method. The experimentsCampbell [41] describes in his thesis are often cited as evidence of Delphis higheraccuracy as compared to unstructured, direct interaction. Sackman [45], although callingCampbells experiments well conducted, criticizes them because the unstructured, direct

    interacting control group was not really unstructured, but force-fitted into a Delphi-typeformat. Although Campbell did this to make comparison with Delphi feasible, Sackmanconcludes that it obstructs comparison because the participants in Campbells interactinggroup did not get enough opportunity for spontaneous interaction, and therefore this grouphas to be regarded as seriously disadvantaged in comparison to Delphi.

    Pfeiffer [43] reported Delphi to be more accurate than what seems to be a staticizedgroup response for 13 of 16 short-term economic indicators. This result was not obtainedby Pfeiffer himself. In his book, Pfeiffer describes these results in two short paragraphs.He only writes that the experiments were conducted at the University of California inLos Angeles. He seems to refer to the experiments performed by Campbell. But Camp-

    bells research concerned the comparison between Delphi and unstructured, direct inter-action. It is not clear whether Pfeiffer is wrongly referring to this research or rightlyreferring to a separate part of this research that has not been described elsewhere.

    The experiments by Dalkey [20, 46, 471 probably most strongly contributed to theview that the accuracy of judgments can be increased by using the Delphi method. In aseries of 11 experiments, Dalkey asked students to answer almanac-type questions, such

  • 7/27/2019 An Evaluation of Delphi.pdf

    6/20

    136 F. WOUDENBERG

    as In what year was nylon invented? and What was the circulation of Playboy magazineas of January 1, 1966? Dalkeys experiments are strongly criticized for several reasons.First, it should be noted that the judgments in Dalkeys experiments did not alwaysbecome more accurate over rounds. Improvement occurred in two-thirds of the questions,

    and deterioration was found in the remaining one-third. The tasks Dalkey used were verysimple, and some authors [48] considered them unrepresentative of most judgmentalproblems encountered in real life. Also, the nonparametric statistical tests Dalkey usedto evaluate his results are criticized for their simplicity [49]. Dalkey only recorded whetherimprovement over rounds did or did not occur, independent of the absolute change inaccuracy. As a result, very small changes in accuracy can attain much weight. A strikingexample of this is found in a study performed by Brown and Helmer [50], using Dalkeysmethod of analysis. Here, increased accuracy over rounds was also found for two-thirds(13 of 20) and decreased accuracy for one-third (7 of 20) of the questions. Of the 13questions improving in accuracy over rounds, the improvement was O-O. 1 standard scores

    for five questions. 0.1-0.2 standard scores for four questions, 0.2-0.3 standard scoresfor three questions, and almost 0.5 standard score for only one question. That theseincreases are very modest can be seen in the comparison of last-round Delphi scores withthe first-round scores of a random individual judge, which can be regarded as an absoluteminimal baseline. Delphi scores were more accurate on a median of 12.5 of 20 questions,while the random judge was more accurate on the remaining 7.5 questions. An even lessfavorable picture appears when interquartile distances are considered. In the first round,13 of the 20 interquartile distances covered the true value. Because of convergence, onlyseven values fell in the interquartile distance on the last round.

    An unpublished report by Sack [44] is cited by Riggs [33] in two sentences. Theonly information given is that Delphi was more accurate in short-term forecasting thanan unstructured, direct interacting group, but not at the stated level of statistical signif-icance. Riggss own study [33], in which Delphi was found to be more accurate thanunstructured, direct interaction, made use of a modified Delphi. Participants were notanonymous and only two rounds were run.

    In the study by Parent6 et al. [42], Delphi was more accurate than a staticized group,but the comparison between Delphi and the staticized group was done with differentgroups of participants in two different situations. First, for ten events, a group of 300students forecasted if and (if so) when the events would occur during the next months.

    Three groups of 100 students each made forecasts weekly during a time span of one,two, or three months. A staticized group result was calculated based on two events thatoccurred in the designated time period. Following this, a Delphi experiment was con-ducted. This time, 80 new students made forecasts about 30 new events. Again, studentshad to make a forecast every week, all for a time span of two months. Of the 30 events,six occurred in these two months and these six events were used in the analysis. It isclear that the different groups of participants and events, as well as the different numberof participants and events, seriously hinders the comparison between the staticized groupand the Delphi.

    The study by Erffmeyer and Lane [14] compared four procedures using the NASA

    lost on the moon exercise: (a) unstructured, direct interaction, (b) consensus group,(c) structured, direct interaction, and (d) Delphi. Participants in the consensus groupengaged in unstructured direct interaction, but followed guidelines to resolve conflicts.Participants had to rank 15 items of equipment in terms of importance for the survivalof a shipwrecked crew on the moon. The results of four groups, each using one of thefour different procedures, were compared to the correct rank order assigned by NASA

  • 7/27/2019 An Evaluation of Delphi.pdf

    7/20

    AN EVALUATION OF DELPHI 137

    experts. The Delphi group was the most accurate, followed by the consensus group andthe unstructured, direct interacting group; the structured, direct interacting group was theleast accurate. These results can, of course, be questioned because the correct rank orderis unknown. There is no way to tell whether the NASA experts rank order is correct. It

    would be interesting to know if the experts themselves would rank order the itemsidentically under the four different procedures.

    The above-mentioned criticisms of studies reporting Delphi to be more accurate thanone or more other judgment procedures can be and have been used against Delphi. Theseven studies [48, 49, 51-551 in which Delphi was found to be less accurate than at leastone other method can, however, likewise be criticized.

    In the study by Farquhar [5 11, Delphi was found to be less accurate than unstructured,direct interaction in two separate experiments. But the interpretation of these results ishampered by the small and inequal number of participants. The Delphi groups had nineand four participants, and the unstructured, direct interacting groups had five and three.

    Group size can have a profound effect on the accuracy of judgments [20, 561.Gustafson et al. [52] compared Delphi with a staticized group, with unstructured,

    direct interaction, and with structured, direct interaction. Delphi was the least accuratemethod. These results have been questioned on two grounds. Fischer [57] contends thatthe dependent variable used by Gustafson et al. exaggerates small differences, and hereanalyzes the results with another dependent variable. The outcome of this analysis isthat no differences between methods are found. Van de Ven and Delbecq [58] ascribeGustafson et als negative findings with Delphi to the specific variant of Delphi used.This was not a real Delphi, but a derived method called estimate-feedback-estimate.Small groups of four participants sat together and received written feedback. The un-naturalness of written feedback in a gathered group with the prohibition of direct contactcould, according to van de Ven and Delbecq, have induced negative social affiliationwith detrimental effect on the results.

    In the study by Ford [49], a staticized group was more accurate than two versionsof Delphi. But the absolute differences in accuracy were small. The only striking resultwas that over four rounds Delphi estimates became slightly less accurate and staticizedgroup estimates slightly more accurate. Fords study can be criticized for his use of awithin-subjects design in which four groups of subjects all participated in four differentjudgment methods. Although Ford neatly randomized the order in which the four ex-

    perimental groups participated in the four methods by making use of a Latin square, hisresults do suggest order effects. Randomizing order allows for testing order effects, butit is no guarantee that order effects will not occur. For instance, having a Delphi (inwhich feedback is provided) before a staticized group method (iteration of individualjudgments in which feedback is not provided) cannot be expected to be counterbalancedby giving another group of subjects the staticized group method before the Delphi.

    The study by Seaver 1551 compared five methods (no interaction; unstructured, directinteraction; two methods using structured, direct interaction; and Delphi) under six ag-gregation procedures. Averaged over all aggregation procedures, all kinds of interactiontended to decrease the accuracy of probabilistic answers to two-alternative general knowl-

    edge questions. The design and methodology of the study were quite complex. Ten groupsof four persons each answered questions before and after interaction (except in the no-interaction condition). Five sets of questions and the five procedures were balanced usinga Greco-Latin square design. Just as in the study by Ford [49], the randomization oforder does not imply that order effects do not occur. It cannot be determined whetherSeavers data, like those of Ford, suggest order effects, because Seavers paper does not

  • 7/27/2019 An Evaluation of Delphi.pdf

    8/20

    138 F. WOUDENBERG

    give the worked-out order of procedures. In sharp contrast to the statistical sophisticationof the study, the interaction procedures Seaver used were of rudimentary form. Delphiwas barely Delphi-like. Only two rounds were used. Before the second round the ex-perimenter personally informed each group member about the other members assessed

    probabilities and the reasons for them. Therefore, individual scores. but not a groupscore, were fed back to all participants.

    The study by Miner 1531 reports Delphi to be significantly less accurate than astructured, direct interaction method that was the focus of the article. But there was nodifference between Delphi and the most widely known structured, direct interactionmethod, the nominal group technique. Erffmeyer and Lane [ 141 criticize the type ofDelphi used by Miner, primarily for the lack of physical separation of participants.

    Moskowitz and Bajgier [S4] found Delphi to be less accurate than unstructured,direct interaction. The comparison of both methods was, however, not the primary objectof the study. Little attention was paid to the characteristics of the Delphi. Participants

    were not anonymous and feedback of individual scores instead of group scores was given.In the study by Boje and Mumighan [48], Delphi was more accurate than unstruc-

    tured, direct interaction, but less accurate than a staticized group. However, the differencebetween Delphi and the staticized group was small. A confounding factor in this studywas the big difference in the first-round estimates between the groups. They were mostaccurate for the Delphi and subsequently became less accurate over rounds. The unstruc-tured, direct interacting group was least accurate on the first round and remained so. Theaccuracy of the staticized group was between those of the other two groups, but subse-quently improved over rounds.

    Three studies did not find a difference between Delphi and one or more other methods.These studies can also be criticized. In the study by Brockhoff [59], not less than sevenhypotheses were tested at the same time, of which the difference between Delphi andunstructured, direct interaction was only one. Not surprisingly, a confounded result wasfound. Delphi was slightly more accurate with respect to forecasts and slightly less accuratewith respect to fact-finding questions. As the author himself remarks, it cannot be excludedthat still other factors-for instance, group size-influenced these results.

    In the study by Rohrbaugh [ 121, both Delphi estimates and those produced in astructured, direct interaction method became more accurate over rounds, but only the lastincrease was significant. Rohrbaugh found no difference in last-round accuracy between

    both groups; this could be the result of the lower accuracy of first-round estimates in thestructured, direct interacting group. A notable feature of this experiment was that par-ticipants were not instructed to give their most accurate estimates, but to reduce theexisting differences within their group. The influence of these instructions on the endresult cannot be ascertained, but setting a goal other than optimal accuracy can hardlybe expected to optimize accuracy.

    In the study by Fischer [57], all four methods mentioned in Table 2 were compared.Amazingly, no difference between any pair of methods was found. Seaver [55] partlyholds Fischers method of analysis responsible for this. The proper scoring rules Fischerused are, according to Seaver, relatively insensitive to differences between estimates.

    The Delphi variant used could be called atypical. There were eight groups of only threepersons each. Only two rounds were run. The minimal requirements for sharing infor-mation in a Delphi do not seem to be met in such small groups with very little opportunityfor feedback.

    Scrutiny of the separate studies does not lead to more easily interpretable conclusionsor clues for the reasons why no comparison between methods gives an equivocal result.

  • 7/27/2019 An Evaluation of Delphi.pdf

    9/20

    AN EVALUATION OF DELPHI 139

    The only justified conclusion seems to be that factors other than the specific method used(capability of the group leader, motivation of the participants, quality of the instructions,etc.) to a large extent determine the accuracy of an application of a judgment method.In accordance with this, one of the most consistent findings is that the method which was

    the primary focus of an article, and which can be expected to be preferred by the author(s),was almost always found to rank highest in accuracy [14, 20, 33, 41, 43, 49, 52, 531.The only two exceptions are the studies by Brockhoff [59] and Rohrbaugh [ 121. Brockhoffwas unable to show Delphis greater accuracy as compared to unstructured, direct inter-action, and Rohrbaugh could not show his self-developed structured, direct interactionmethod (social judgment analysis) to be more accurate than Delphi. Both authors try tofind methodological explanations for this and also mention additional advantages of theirpreferred method. The predominant positive results with the method of preference suggeststhat the unidentified factors in judgment methods responsible for high accuracy are besttaken care of when the investigator has confidence in the method being used. For this

    reason, it is very difficult to evaluate the greater accuracy-as compared to Delphi-ofa few highly idiosyncratic methods, tested only once by their developer and main pro-ponent [lo, 49, 52, 60, 611.

    Contribution of the Delphi Characteristics to Its Accuracy

    ANONYMITY

    Anonymity in a Delphi is meant to exclude group interaction processes that decreasethe accuracy of group judgment, while preserving positive influences [56,62]. Anonymityhas been criticized [63] for its intrinsic negative effects (lack of a feeling of responsibility

    for the end result) and because it would obviate some intrinsic positive effects of un-structured, direct group interaction (flexibility and the richness of nonverbal communi-cation). A conclusion regarding the (dis)advantages of anonymity is difficult to draw.The only data giving a rough indication concern the satisfaction of participants with theDelphi method. This satisfaction varies strongly [48, 53, 58, 61, 641. A possible problemwith anonymity is low compliance. In one study [22], partial anonymity led to a higherresponse.

    USE AND SELECTION OF EXPERTS

    It seems obvious to use experts in situations of high uncertainty. According to many

    authors [45, 65-681, however, the lack of directly relevant information in uncertainsituations determines judgments more than the information that is available, and conse-quently experts are not more accurate than nonexperts. One author [28] even contendsexperts perform worse than nonexperts, because the former are more strongly influencedby the desirability of an answer. Within the context of Delphi, expertise has morespecifically been investigated by means of the relation between accuracy and self-ratings.Negative [42, 48, 65, 67-701 as well as positive [35, 47, 50, 71, 721 correlations havebeen found. Dalkey [72] tried to reconcile these contradictory results by posing that nosubstantial correlation exists, either positive or negative, for individuals, but that a positivecorrelation holds for groups from approximately seven persons. Dalkey even argues that

    selection of a subgroup with high self-ratings leads to the same increase in accuracy ashe finds with the Delphi method. He proposes a combination of both: subgroups forquestions having enough participants with high self-ratings and a Delphi for the remainingquestions. A disadvantage of selecting a subgroup is that group size is reduced, with thepossibility of a decrease in accuracy due to a larger random error. Research with otherjudgment methods has shown that selection of a subgroup increases accuracy only if the

  • 7/27/2019 An Evaluation of Delphi.pdf

    10/20

    140 F. WOUDENBERG

    group shows large variability in accuracy, if the probability of making a wrong judgmentis great, and if the most accurate judges can be selected with high certainty [73-751.

    ITERATION

    The original purpose of iteration in Delphi is to have only the least-informed par-ticipants change their minds over rounds [56, 621. Critics [45] assert iteration only leadsto boredom. In a great many studies, a slight improvement in accuracy over rounds isfound (for review, see Armstrong [7], Dietz [69], and Erffmeyer et al. [76]), althoughthere are also studies finding no improvement (48, 52, 601. In nearly all studies findingimprovement, most or all improvement takes place between the first and second estimationrounds [7, 10, 20, 28, 69, 771. In a few studies accuracy further increased after thesecond estimation round [60, 761. Iteration cannot be considered independent from feed-back in a Delphi. Only a few studies investigated the effects of iteration and feedbackon accuracy independently from each other. In three studies [42,48,49], a slight increasein accuracy was caused completely by iteration, and in only one [47] by feedback.

    FEEDBACK

    The idea behind providing feedback is to share the total information available to agroup of individual experts. Those experts who find the composite judgment of the groupor the arguments of deviating experts more compelling than their own should subsequentlymodify their judgment [56, 621. In this way, group pressure to conformity is preventedand any change in judgment is caused by new information only. Feedback in a Delphican consist of the statistical summary of the group response as well as arguments ofdeviating judges. There is limited evidence that arguments of deviating judges leads toan increase in accuracy. Three studies [61, 7 1, 781 report a slight increase in accuracydue to arguments over the increase caused by statistical feedback. In one study [20]feedback of arguments decreased accuracy. In nearly all studies, the largest increase inaccuracy in a Delphi is found between the first and second estimation rounds. Only thestatistical group response has then been fed back.

    A host of support can be found for the assertion that statistical feedback inducesconformity. Numerous studies [12, 20, 28, 38, 49, 64, 79-831 show that changes overrounds are in the direction of the group response that has been fed back. For most ofthese studies, Fords (491 remark holds: Delphi methods induce change toward themedian, but rarely cause the median to change. Change toward the fed-back value also

    occurs when this value is false [33, 80, 83, 841. In two studies [20, 351 a pull to themean has been compared directly to a pull to the true. By far the greatest increase inaccuracy was caused by the pull to the mean. In these two studies and in a later one [77],a relation was found between the distance of an individuals judgment to the group responseand the percentage of judges subsequently changing their estimate (see Figure 1). Thisrelation suggests an explanation for the slight increase in accuracy over rounds found inmany Delphis. First-round estimates have been found to be distributed lognormally (seeFigure 2). In such a negatively skewed distribution, the modus lies before the mean andmedian. Consequently, the distance to the median is on average smaller for values to theleft than for values to the right of the median. Two processes may cause a subsequent

    change toward the median:

    1. The relation between distance to the group response and subsequent change isasymptotic. Because values to the left are on average closer to the median thanthose to the right, newly estimated values will come even closer to the left ofthe old median than those to the right.

  • 7/27/2019 An Evaluation of Delphi.pdf

    11/20

    AN EVALUATION OF DELPHI 141

    PROPORTIONWANZING

    Fig. 1. Proportion of participantschan@g opinion versus deviation fromthe median. From Dalkey [20].

    01 I I I 8 I I I-4 -3 -2 -I 0 1 2 3 4

    DEVIATION FROM MEOIAN

    2. The relation between distance to the group response and subsequent change isnot perfect. Also, values on and close to the median will be changed. Becausevalues to the left of the median are on the average closer and, upon new estimation,come even closer to the median than the values to the right, there will be morecrossings from left to right than from right to left.

    The net result is a small increase in the median. If the initial group response underestimatesthe true value, the result is a slight increase in accuracy. If the initial group responseoverestimates the true value, the result is a slight decrease in accuracy. Support for thishas been found in several studies [ 17, 78, 851.

    A definite conclusion regarding feedback in a Delphi is now possible. The smallincrease in accuracy over rounds found in many Delphis can be ascribed partly to themere iteration of judgment (see above) and partly to an artifactual by-product of thepressure to conformity caused by the statistical feedback. In any case, changes in estimatescaused by feedback, whether or not associated with an increase in accuracy, are notinduced by dissemination of information to all participants, but are the result of grouppressure to conformity. This is supported by the finding that feedback of arguments,which can especially be though to disseminate information, has only a negligible effect

    on the accuracy of estimates.

    DENSITY.t3 l--l OBSEQVED

    I LOGNOQMAL

    .6

    .5 1.0 1.5 2.0 2.5 3.0NOLZMALIZED ESTIMATE

    Fig. 2. Distribution of first-round es-timates. From Dalkey [20].

  • 7/27/2019 An Evaluation of Delphi.pdf

    12/20

    TAEILE 4Reliability of Delphi in Studies Conducting Two or ! b i u r r Delphs with the xmr Quedions

    Study licli&ilit\

    Helmer [86]Bender et al. [88]Ament [87]Dalkey [20]Dalkey et al. [72]Sahr [89JWelty [90]Welty 1671Amara and Salancik 1931Grabbe and Pyke [94]Martino [see 451Huckfeldt and Judd [95]Sackman [45]

    0 87,.!! 32 (~ O.OftO.73)O. S' i , -

    , i . 5 . ' O:11J- 0.75;

    0 Xh (0.5j -0.98)

    intuitively goodNonsignificant difference0 X20.77Intuitively goodlntultively goodIntuitively good0.70

    Dagenais [961 0.52 (0.24-1.00)

    Pearson product-moment correlation calculated by the present authorUnspecified correlation coefficient reported by the original authorRank-order coefficient reported by the original authordK ppa reliabilities reported by the original author.

    Reliability

    Because of person- and situation-specific factors. as described in the section onevaluation of Delphi, it would appear difficult to standardize Delphi and to evaluate itsreliability. Reliability has exclusively been evaluated by comparing results of two groupsof participants in the same Delphi. Such a procedure is acceptable if it is assumed thatboth groups have the same bias. With proper randomization. this assumption does notappear unreasonable. Unfortunately, the studies evaluating the reliability of Delphi suffermore serious shortcomings. A drawback of many studies is that comparisons were madebetween groups whose forecasts were obtained in different years. Experiments were fromone [86] to eight [87] years apart. Clearly, forecasting the date of a future event in 1964is not the same as forecasting that date in 1972, when circumstances will have changedconsiderably. As with accuracy, many studies used no statistical criterion. Because re-liability can only be inferred from absolute values and not from comparisons to other

    methods, as with accuracy, there is a serious shortcoming here. Therefore, if raw scoreswere reported, Pearson product-moment correlation coefficients were calculated. Anadditional shortcoming of many studies is that no variances were indicated, making itunclear whether the intuitively judged high reliability was caused by genuine similarityor by overlap of unduly wide confidence intervals. An overview of studies assessing thereliability of Delphi is given in Table 4. They all report a relatively high reliability.

    Helmer [86] puts median results of three experiments next to each other. Varianceswere not considered. One experiment was conducted in 1963 with a traditional Delphimethod [19]. Another was a pretest conducted in 1966 with 23 Rand Corporation em-ployees. The last experiment was done at a conference using 100 conference delegates

    (not physically separated from one other), of which 23 were selected to finish the exercise.The article gives no indication as to how this selection was performed. The 1966 pretestand 1967 conference Delphi could be compared on seven forecasts. One event wasforecasted to occur in the same year by both groups. The other six forecasts differed by2-9 years. The correlation coefficient calculated from the raw data presented by the authorwas 0.87. Conference delegates remarked that serious biases existed in the wording of

  • 7/27/2019 An Evaluation of Delphi.pdf

    13/20

    AN EVALUATION OF DELPHI 143

    the questions, and they did not expect their forecasts to be particularly reliable. The1963 study contained four questions related-but not identical-to those in the other twostudies. Differences ranged from one to 21 years. Because of the lack of similarity andthe small number of questions, computation of coefficients is pointless.

    In the study by Bender et al. [88], three experiments were compared, also only withrespect to median values. One was the above-mentioned study by Gordon and Helmer[ 191. The other two were a pretest and the experiment proper, both conducted by theauthors. The three studies could be compared on 11 forecasts. Correlation coefficientsfor the three pairwise comparisons, calculated from the raw data and excluding questionswith unprecise answers (never, >x), were -0.06, 0.30, and 0.73.

    In the study by Ament [87], two experiments, one the study conducted by Gordonand Helmer [ 191 and the other by the author in 1969, were compared on 31 forecasts.Medians as well as interquartile distances were given. The phrasing of the questions wasnot the same in both studies. For 13 questions, a possible bias in phrasing was indicated

    by the author. Limiting analysis to the 18 questions for which a possible bias was notindicated, interquartile distances overlapped for 15. Although this number seems verylarge, interquartile distances were also large. Excluding indefinite values, means of theinterquartile distances were 22.2 years for the 1963 experiment and 13.8 years for the1969 experiment. Corresponding means of years to forecasted occurrence were 23.7 and20, respectively. The correlation coefficient calculated from 15 median values for whichprecise values in both studies were given was 0.57.

    Dalkey [20] reports correlation coefficients on the basis of the median responses to20 almanac-type questions for seven pairs of groups of different sizes. The coefficientsmonotonically increased with group size, being almost 0.4 for a group of two respondentsto 0.75 for a group of 11 respondents.

    Dalkey et al. [72] continued the above-mentioned study with an experiment involvingeight pairs of groups of 15-20 respondents. Coefficients were calculated from first-roundestimates. First-round scores in a Delphi are in fact staticized group responses. Dalkeyet al.s experiment therefore does not refer to Delphi at all. Another criticism of thisexperiment is that estimates were not used directly. Because correct answers were avail-able, group errors were calculated by taking the natural logarithm of the median estimateof the group divided by the true answer. Dalkeys method of analysis suggests that it isinstructive to look at the relation between accuracy and reliability. Textbooks often remark

    that reliability is a necessary condition for accuracy. But the reverse relation is alsointeresting. High accuracy is a sufficient condition for high reliability. Very easy questions,having high accuracy, automatically lead to high reliability. This high reliability couldbe called spurious. Although Dalkey did not check for this possibility, the high coefficientsfound in his study could merely indicate that he asked simple questions.

    Sahr [89] compared three Delphi studies, but not directly with respect to reliability.Sackman [45] criticizes this study for its presentation of some fifty pages filled withdescriptive quantitative data, without reporting a single statistic indicating variances,standard errors of measurement or product-moment reliability coefficients.

    The two studies by Welty [67, 901 report intuitively high reliability. However, these

    studies were meant to discredit Delphi. Of the two groups compared, one consisted ofexperts and the other of laypeople. By showing that both groups performed equally well,Welty intended to prove that Delphis strong reliance on experts is misplaced. The twoexperiments Welty performed were partial replications with lay participants of a studyby Rescher [91, 921 with experts. Participants had to give estimates on a five-point scaleabout the importance of a number of American cultural values (14 in Weltys first study

  • 7/27/2019 An Evaluation of Delphi.pdf

    14/20

    144 F. WOUDENBERG

    and 17 in the second) in the year 2000. The topic makes one wonder who is to be countedas an expert and who is not. For the first study [90], Welty only reports the results ofan overall sign test. This test showed both groups were not significantly different (p =

    18). In the second study [67], an overall value could not be computed. Of the 16 questions

    that could be compared, only two were significantly different by means of an F-test.However, error variances were quite high. The mean of the 16 standard deviations was0.84 for the experts and 1.13 for the layjudges. Corresponding 99% confidence intervalsthat go along with a one-tailed test at the 1% level (used by Welty) are 1.96 and 2.63,respectively. This leaves very little room for obtaining a difference between two groupsmaking judgments on a five-point scale, where, in fact, 31 of 32 group scores fell betweenpoints two and four. The correlation coefficient based on 15 of the 16 pairs of means,for which values in both groups were given, was 0.82.

    Amara and Salancik [93] also used an ordinal five-point scale on which two groupsof participants had to rate the likelihood of 89 social developments in the 1980s. A

    correlation coefficient of 0.77 was found. The same criticisms as those leveled againstWeltys study are valid. Most participants gave answers in the middle range of the scale,leaving very little room for a difference of more than one scale point.

    In the study by Grabbe and Pyke [94], six experiments, conducted from 1964 through1972, were compared on median forecasts. Of the six studies 15 pairwise comparisonscan be extracted. The number of more-or-less comparable questions between studiesranged from zero to eight, with a mean of 2.6. Only eight comparisons involved identicalquestions: three concerning three questions, two concerning two questions, and threeconcerning one question. The small number of identical questions makes it impossibleto calculate correlation coefficients for any pair of studies, therefore, nothing can be saidquantitatively about the claim of the authors that there appears to be reasonable goodagreement among the dates.

    An attempt to demonstrate the reliability of Delphi by Martin0 is mentioned and atthe same time criticized by Sackman [4.5]. In a number of independent studies Martin0noticed several similar questions that resulted in similar forecasts. No statistical indexeswere reported.

    Huckfeldt and Judd [95] compared the responses of five respondents answering twoquestions twice, separated by two weeks. Respondents had to indicate on a seven-pointscale the likelihood of the occurrence of an event. The authors indicate only the percentage

    of responses differing one, two, three, or more scale points on the two occasions. It canbe read in their data that 23% of the changes were more than one scale point and only10% more than two points different. No rank correlation is reported and, because no rawdata are given, it could not be calculated.

    Along the lines followed by Welty 167, 901, also with the aim of disproving Delphisstrong reliance on experts, Sackman 1451 replicated the much-referred-to Gordon andHelmer study [ 191 several times with students. Spearman rank coefficients were calculatedfor the order of years in which events were forecasted. The average rank correlationbetween the several replications was 0.77.

    Dagenais [96] calculated kappa reliabilities for 16 pairs of groups of students asked

    to identify successful vocational education programs from a list of programs. Kappavalues varied greatly. As with Dalkeys data, a relation with accuracy was not investigated,but could be expected. The highest kappa value Dagenais reports is 1, a value not to beexpected with even the most sophisticated judgment method if something more compli-cated than a simple arithmetic question is asked.

    The correlations reported in or derived from the above-mentioned studies range from

  • 7/27/2019 An Evaluation of Delphi.pdf

    15/20

    AN EVALUATION OF DELPHI 145

    very low to very high. In some cases, the intuitively found high reliability could not besubstantiated quantitatively. Next to the great variation in correlations and the unabilityto quantify some results, the discussed data show many limitations and shortcomings(relation to accuracy, bias in phrasing of questions, different years of obtaining forecast,

    large or unindicated error variance). A methodology to evaluate the reliability of Delphimore thoroughly has been suggested by Hill and Fowles [65]. They suggest reliabilityshould be ascertained in several ways, including test-retest evaluation and investigationof the effects of procedural variations on the results. Their proposal has not been adopted.A definite conclusion regarding reliability of the Delphi method must therefore be post-poned. The present data, however, do suggest the reliability of fhe Delphi method canhardly be expected to exist. Because of person- and situation-specific biases, a newmeasuring instrument is created with every new application of the method. These person-and situation-specific biases inevitably arise as every new set of questions is accompaniedby a new group of experts giving judgments. Person-specific biases can only partly be

    removed by random selection of participants to groups. Differences between groups canalso occur by selective attrition of participants during execution of a Delphi. Attrition inDelphis is very variable [45, 53, 58, 80, 95, 971. Some data suggest attrition in a Delphican be selective. It has, for instance, been found that dropouts are further removed fromthe group median on the first round than holdouts [77], implying that deviating participantsdrop out more often.

    ConsensusIn the introduction it was mentioned that in addition to the general view that inter-

    acting groups give more accurate judgments than a staticized group, interacting groupsalso show stronger agreement. A Delphi is extremely efficient in achieving consensus[12, 20, 28, 38, 49, 64, 79, 81-831. Several studies report stronger consensus with aDelphi than with unstructured, direct interaction [33, 571. As for accuracy, consensus isalmost always maximum after the second estimation round [16, 18, 80, 951. Althoughconsensus can be important, it can never be the primary goal of a Delphi. High consensusis neither a necessary nor a sufficient condition for high accuracy. In most Delphis aslight increase in accuracy over rounds is found (see the section on iteration). In contrast,consensus increases very strongly. Also, a direct comparison shows that consensus in-creases much more strongly than accuracy [20]. Together with the indications that group

    pressure to conformity is very strong in a Delphi, this makes consensus in a Delphisuspect and in no way related to genuine agreement. Consensus in a Delphi is thereforenot a good criterion, not even as a secondary one. Most probably, in situations ofuncertainty lack of consensus is inevitable. As Stewart and Glantz [98] remark, Thesame lack of knowledge that produced the need for a study that relied on expert judgmentvirtually assures that a group of diverse experts will disagree.

    ConclusionsThe data discussed in the present article leave no other possibility open than for a

    negative evaluation of quantitative Delphi. The main claim of Delphi-to remove thenegative effects of unstructured, direct interaction-cannot be substantiated. In manyDelphis a slight increase in accuracy over rounds is found. But this increase can beascribed partly to mere repetition of judgment (possibly by giving judges opportunity tocontemplate their judgments) and partly to an artifactual consequence of group pressureto conformity. A Delphi is extremely efficient in obtaining consensus, but this consensus

  • 7/27/2019 An Evaluation of Delphi.pdf

    16/20

    146 F. WOUDENBERG

    is not based on genuine agreement; rather, it is the result of the same strong group pressureto conformity.

    In view of the manifold applications of Delphi all over the world, the negativeconclusion drawn in the present article may seem surprising. But negative evaluations of

    Delphi have been appearing since the 1960s. In the Rand Corporation study that arousedworldwide interest for Delphi [19], Gordon and Helmer used forecasting as a weakerterm for prediction to indicate the tentative nature of their and related investigations. Thefame of the study seems to be based more on the quality of the participants (Isaac Asimov,Arthur C. Clarke, Carl G. Hempel, and Stephen Toulmin, among others) than on thequality of the study itself or its results [99]. Dalkey, next to Gordon and Helmer the maindeveloper of Delphi, summed up most of the negative aspects of Delphi, including thestrong group pressure to conformity induced by statistical feedback of the group response,in two articles published in 1968 [ 161 and 1969 [201. In 197 1, two review articles appeared[28, 311 that for the first time prudently concluded that Delphi may not be fit for quan-

    titative application. Not in the least prudent, this conclusion is repeated by Sackman inhis well-known Delphi Critique [45]. Although several authors [24, 32, 84, 1001 triedto rebut Sackman, critical articles about Delphi since his vehement attack have not ceasedto appear [ 11, 771. A few review articles favorable to Delphi have also appeared. Riggs[33] reviewed three studies and described one experiment in which Delphi was comparedto other judgment methods. Delphi was the most accurate method in one of the reviewedstudies, the least accurate in another, and more accurate, but not statistically significantin the third. After concluding his own study, finding a nontraditional variant of Delphito be more accurate than unstructured, direct interaction, Riggs concludes, in spite ofthe devious limitations of the study, the evidence lends credibility to the statement thatDelphi procedures are superior to conference methods for long-range forecasting. An-other favorable review was written by Shneiderman [61]. After reviewing five studies,he concludes that the Delphi on the whole produces somewhat more accurate final resultsthan the traditional committee. This conclusion seems not to be based on the reviewedstudies, because of the five he reviewed, Delphi was the most accurate in two, the leastaccurate in two, and equally accurate to another method in the fifth.

    Partly in response to the strong critique leveled against Delphi in the 197Os, thequantitative applications have become dominated by more qualitative applications (seethe history of Delphi above). The proliferation of quantitative applications has been

    publicized by many authors [ 10,26, 30,32, 1011. They describe Delphi with the metaphorart instead of science. It can be questioned whether the art instead of science approachcan be regarded as a watertight defense against the critique of Delphis many drawbacks.Quantitative and qualitative statements cannot strictly be divided. Ordering priorities orsignaling future developments implies making assumptions about which social, cultural,and political developments will occur and when they will occur. In any judgment thathas accuracy as a goal or, at least, underlying assumption, accuracy cannot be expelledas a necessary criterion for the particular judgment method used. Only the pure decisionDelphi, where the primary goal is consensus, could perhaps be excluded from the demandof the criterion of accuracy. For all other applications, qualitative and quantitative alike,

    Aschers [ 1021 remark about forecasts holds: accuracy is an asset because the utilizationof forecasts requires, at a minimum credibility (emphasis by the original author). Thereexist no clues that the drawbacks found in quantitative applications of Delphi do not alsooccur in more qualitative applications. With the negative verdict on quantitative Delphiapplications, it is troubling that the transition from forecasting and estimation to valueDelphis has been lightly made without much in the way of evaluation or reanalysis [27].The problems this creates for Delphi are nicely illustrated in a set of statements taken

  • 7/27/2019 An Evaluation of Delphi.pdf

    17/20

    AN EVALUATION OF DELPHI 147

    from a review by Murray 1261, in which the drawbacks of Delphi are recognized, butthe qualitative type of Delphi is nevertheless favored: Delphi is thus not a science butan art, which can be used when nothing better than opinion can be achieved, whilethe final justification for the technique must be on its usefulness to decision makers.

    Last-resort arguments like this seem, at the least, questionable to justify the use of Delphiwhen it can be clearly shown that Delphi is in no way superior to other (simpler, faster,and cheaper) judgment methods.

    This re\ie\r, \~YIS written lrith the support of a Dutch government grant, aimed atinvestigatiq the usc~~dness of,judgment methods in the assessment of the toxic effects qfhazardous chemiccrls on humans.

    ReferencesI. Dawes. R. M.. and Corrigan. B., Linear Models in Decision Making, Psychological Bulletin 81, 95-

    106 (1974).2.

    3.

    4.

    5

    Hogarth, R. M.. Cognitive Processes and the Assessment of Subjective Probability Distributions, Journalo#rlre Arn

  • 7/27/2019 An Evaluation of Delphi.pdf

    18/20

    148 f. WOUDENUERG

    Opinion, Forecasting. and Group Process by H. Sackman. Technolo,qlccrl b~w( Irstinx und .So~xd ~hctnge7, 195-213 (1975).

    25. Ludlow, J., Delphi Inquiries and Knowledge Utilization, in The Dclpln Meth~xf: Te~-hn~yue~ end App/l -cations. H. A. Linstone and M. Turoff, eds., Addison-Wesley, Reading, MA. 1975.

    26. Murray, T. I., Delphi Methodologies: A Review and Critique, Urhon S~v/~~rn.c . ISi- 15X I lY701.

    27. Skutsch, M., and Schafer, J. L., Goals-Delphis for Urban Plannmg:

  • 7/27/2019 An Evaluation of Delphi.pdf

    19/20

    AN EVALUATION OF DELPHI 149

    56.

    57.

    58.

    59.

    60.

    61.

    62.

    63.

    64.

    65.

    66.

    67.

    68.

    69.

    70.71.

    72.

    73.

    14.

    75.

    76.

    77.

    78.

    79.

    80.

    81.

    82.

    83.

    Martino, J. P., Technological Forecasting for Decision Making, North-Holland, New York, 1983, 14

    PP.Fischer, G. W., When Oracles Fail: A Comparison of Four Procedures for Aggregating SubjectiveProbability Forecasts, Organizational Behavior and Human Performance 28, 96-l 10 (1981).Ven van de, A. H., and Delbecq, A. L., The Effectiveness of Nominal, Delphi, and Interacting Group

    Decision Making Process, Academy of Management Journal 17, 605-621 (1974).Brockhoff, K., The Performance of Forecasting Groups in Computer Dialogue and Face-to-Face Dis-cussion, in The Delphi Method; Techniques and Applications, H. A. Linstone and M. Turoff, eds.,Addison-Wesley, Reading, MA, 1975.Brockhaus, W. L., A Quantitative Analytical Methodology for Judgmental and Policy Decisions, Tech-nological Forecasting and Social Change 7, 127-l 37 ( 1975).Shneiderman, M. V., Empirical Studies of Procedures for Forming Group Expert Judgments, AutomationRemote Control 49, 547-557 (1988).Helmer, O., Analysis of the Future: The Delphi Method, in Technological Forecasting for Industry andGovernment, J. R. Bright, ed., Prentice-Hall, Englewood Cliffs, NJ, 1968.Milkovich, G. T., Annoni, A. J., and Mahoney, T. A., The Use ofDelphi Procedures in ManpowerForecasting, University of Minneapolis Center for the Study of Organizational Performance and HumanEffectiveness, TR-7007 (AD747651), Minnesota, 1972.Scheibe, M., Skutsch, M., and Schafer, J., Experiments in Delphi Methodology, in The Delphi Method;Techniques and Applications. H. A. Linstone and M. Turoff, eds., Addison-Wesley, Reading, MA,1975.Hill, K. Q., and Fowles, J., The Methodological Worth of the Delphi Forecasting Technique, Techno-logical Forecasting and Social Change 7, 179-192 (1975).Vinokur, A., Distribution of Initial Risk Levels and Group Decisions Involving Risk, Journal of Per-sonality and Social Psychology 13, 207-214 (1969).Welty, G., Problems of Selecting Experts for Delphi Exercises, Academy of Management Journal 15,121-124 (1972).Welty, G., The Necessity, Sufficieny and Desirability of Experts as Value Forecasters, in Developmentsin the Methodology of Social Science, W. Lcinfellner and E. Kohler, eds., Reidel, Boston, 1974.Die&, T., Methods for Analyzing Data from Delphi Panels: Some Evidence from a Forecasting Study,Technological Forecasting and Social Change 3 1, 79-85 ( 1987).Wise, G., The Accuracy of Technological Forecasts, 1890-1940, Futures 8, 41 l-419 (1976).Best, R. J., An Experiment in Delphi Estimation in Marketing Decision Making, Journal of MarketingResearch 11, 447-452 ( 1974).Dalkey, N., Brown, B., and Cochran, S., Use of Self-Ratings to Improve Group Estimates: ExperimentalEvaluation of Delphi Procedures, Technological Forecasting 1, 283-291 ( 1970).Einhorn, H. J., Hogarth, R. M., and Klempner, E., Quality of Group Judgment, Psychological Bulletin84, 158-172 (1977).Huber, G. P., Methods for Quantifying Subjective Probabilities and Multi-Attribute Utilities, DecisionSciences 5, 430-458 (1974).

    Stael von Holstein, C. A. S., Probabilistic Forecasting: An Experiment Related to the Stock Market,Organizational Behavior and Human Pet$ormance 8, 139-158 (1972).Erffmeyer, R. C., Erffmeyer, E. S., and Lane, I. M., The Delphi Technique: An Empirical Evaluationof the Optimal Number of Rounds, Group and Organization Studies 11, 120-128 (1986).Bardecki, M. J., Participants Response to the Delphi Method: An Attitudinal Perspective, TechnologicalForecasting and Social Change 25, 28 l-292 ( 1984).Hample, D. J., and Hilpert, F. P., Jr., A Symmetry Effect in Delphi Feedback. Paper presented at theInternational Communication Association Convention, Chicago, 1975 (copies available from the authorsat the Department of Speech Communication, University of Illinois, Urbana, IL 61801).Bamberger, I., and Mair, L., Die Delphi-Methode in der Praxis, Management International Review 16,81-91 (1976).Cyphert, F. R., and Gant, W. L., The Delphi Technique: A Tool for Collecting Opinions in TeacherEducation, The Journal of Teacher Educafion 3, 417-425 (1970).Cyphert, F. R., and Gant, W. L., The Delphi Technique: A Case Study, Phi Delta Kappan 52, 272-273 (1971).Mulgrave, N. W., and Ducanis, A. J., Propensity to Change Responses in a Delphi Round as a Functionof Dogmatism, in The Delphi Method: Techniques and Applications, H. A. Linstone and M. Turoff,eds., Addison-Wesley, Reading, MA, 1975).Nelson, B. W., Statistical Manipulation of Delphi Statements: Its Success and Effects on Convergenceand Stability, Technological Forecasting and Social Change 12, 41-60 (1978).

  • 7/27/2019 An Evaluation of Delphi.pdf

    20/20

    150 F. WOUDENBERG

    84. Scheele, D. S., Consumerism Comes to Delphi: Comments on Delphi Assessmenf, Expert Opi nion.Forecasti ng. and Group Process by H. Sackman. Technol ogical Forecasti ng and Social Change 7. 215-129 (1975).

    85. Daws, J. H., Group Decision and Social Interaction: A Theory of Social Decision Schemes, PsyhologiculRevi ew 80. 97-123 (1973).

    86. Helmer. 0.. The Delphi Method: An Illustration, m Technological Forecast ing for lndusfy and Gov-ernmenr, J. R. Bright, ed., Prentice-Hall, Englewood Cliffs, NJ, 1968.

    87. Ament. R. H., Comparison of Delphi Forecasting Studies in 1964 and 1969, Futures 1, 35-44 (1970).88. Bender, D. A.. Shack, A. E.. Ebright, G. W.. and von Haunalter. G., Delphic Study Examines De-

    velopments in Medicine. Futures I. 289-303 (1969).89. Sahr, R. C.. A Collation of Similar Delphi Forecasts, Institute for the Future, WP-5. April 1970.90. Welty, G. A., A Critique of Some Long-Range Forecasting Developments, Bullerin of fhe Infernalional

    Statistical lnsritufe 54, 4033408 C 97 I ).91. Rescher. N.. A Study of Value Change, Journal of Value Inquiry I. 21-22 (1967).92. Rescher. N., A Questionnaire Study of American Values by 2C00 A.D.. in Val ues and t he Furur e, K.

    Baier and N. Rescher, eds., Free Press, New York, 1969.93. Amara. R. C.. and Salancik, J. R.. Forecasting: From Conjectural Art Toward Science, Technologicul

    Forecasti ng und Social Change 3, 415-426 ( 1972).94. Grabbe. E. M., and Pyke, D. L., An Evaluation of the Forecasting Information Processing Technology

    and Applications, Technol ogict i l Forecasti ng and Soci al Change 4, 143-150 (1972).95. Huckfeldt. V. E., and Judd, R. C.. Issues in Large Scale Delphi Studtes, Technological Forecasting and

    Social Change 6, 75-88 (1974).96. Dagenais. R., The Reliability and Convergence of the Delphi Technrque, The Journal of Generul sy-

    chologv 98. 307-308 (1978).97. Dodge. B. J.. and Clark, R. E.. Research on the Delphi Techmque. Educational Technology April, 58-

    59 (1977).98. Stewart. T. K., and Glantr, M. H., Expert Judgment and Climate Forecasting: A Methodological Critique

    of Clnnatc Change to the Year 2000, CIimafic Change 7, 159-183 (1985).99. Overbury. R. E.. Technological Forecasting. A Criticism of the Delphi Technique, Long Range Planning

    I. 76-77 ( 1969).100. Jillson. I A., Developing Guidelines for the Delphi Method. Technol ogical Forecasti ng und Soci al

    Chanp 7. 221-222 (1975).101. Dijk van, J. A. G. M., Delphi Questionnaire versus Individual and Group Interviews: A Comparison

    Case, Technol ogical Forecasti ng und Soci al Change 37. 293-304 (1990).102. Aschcr. W., Forecast ing: An Appr aisuif or Pol i cy-M akers and Pl anners. John Hopkins University Press,

    Baltimore. 1978. p. 4.

    Recei ved 26 July I W O