evaluating intelligence k wheaton

19
Evaluating Intelligence Kristan J. Wheaton Assistant Professor Mercyhurst College [email protected] 814 824 2023

Upload: kristan-j-wheaton

Post on 12-Nov-2014

1.980 views

Category:

Documents


4 download

DESCRIPTION

Evaluating intelligence is a difficult task. This paper examines why it is so difficult, suggests a new model for thinking about the process of evaluating intelligence and tests that model against several documents prepared by the US National Intelligence Council in the run-up to the Iraq War.

TRANSCRIPT

Page 1: Evaluating Intelligence K Wheaton

Evaluating Intelligence

Kristan J. WheatonAssistant ProfessorMercyhurst College

[email protected] 814 824 2023

Page 2: Evaluating Intelligence K Wheaton

kwheaton Page 2 2/8/2009

Evaluating Intelligence1

Evaluating intelligence is tricky.

Really tricky.

Sherman Kent, one of the foremost early thinkers regarding the analytic process in the US national security intelligence community wrote in 1976, “Few things are asked the estimator more often than "How good is your batting average?" No question could be more legitimate--and none could be harder to answer.” So difficult was the question that Kent reports not only the failure of a three year effort in the 50’s to establish the validity of various National Intelligence Estimates but also the immense relief among the analysts in the Office of National Estimates (forerunner of the National Intelligence Council) when the CIA “let the enterprise peter out.”

Unfortunately for intelligence professionals, the decisionmakers that intelligence supports have no such difficulty evaluating the intelligence they receive. They routinely and publicly find intelligence to be “wrong” or lacking in some significant respect. Abbot Smith, writing for Studies In Intelligence in 1969, cataloged many of these errors in On The Accuracy Of National Intelligence Estimates. The list of failures at the time included the development of the Soviet H-bomb, the Soviet invasions of Hungary and Czechoslovakia, the Cuban Missile Crisis and the Missile Gap. The Tet Offensive, the collapse of the Soviet Union and the Weapons of Mass Destruction fiasco in Iraq would soon be added to the list of widely recognized (at least by decisionmakers) “intelligence failures”.

1 This article originated as a series of posts on my blog, Sources and Methods (www.sourcesandmethods.blogspot.com). This form of “experimental scholarship” -- or using the medium of the internet and the vehicle of the blog as a way to put my research online—provides for more or less real-time peer review. Earlier examples of this genre include: A Wiki Is Like A Room..., The Revolution Begins On Page 5, What Is Intelligence? and What Do Words Of Estimative Probability Mean?. Given its origin and the fact that it is stored electronically in the ISA archive, I will retain the hyperlinks as a form of citation.

In addition, astute readers will note that some of what I write here I have previously discussed in other places, most notably in an article written with my long-time collaborator, Diane Chido, for Competitive Intelligence Magazine and in a chapter of our book on Structured Analysis Of Competing Hypotheses (written with Diane, Katrina Altman, Rick Seward and Jim Kelly). Diane and the others clearly deserve full credit for their contribution to this current iteration of my thinking on this topic.

Page 3: Evaluating Intelligence K Wheaton

kwheaton Page 3 2/8/2009

Nor was the US the only intelligence community to suffer such indignities. The Soviets had their Operation RYAN, the Israelis their Yom Kippur War and the British their Falklands Island. In each case, after the fact, senior government officials, the press and ordinary citizens alike pinned the black rose of failure on their respective intelligence communities.

To be honest, in some cases, the intelligence organization in question deserved the criticism but, in many cases, it did not -- or at least not the full measure of fault it received. However, whether the blame was earned or not, in the aftermath of each of these cases, commissions were duly summoned, investigations into the causes of the failure examined, recommendations made and changes, to one degree or another, ratified regarding the way intelligence was to be done in the future.

While much of the record is still out of the public eye, I suspect it is safe to say that intelligence successes rarely received such lavish attention.

Why do intelligence professionals find intelligence so difficult, indeed impossible, to evaluate while decisionmakers do so routinely? Is there a practical model for thinking about the problem of evaluating intelligence? What are the logical consequences for both intelligence professionals and decisionmakers that derive from this model? Finally, is there a way to test the model using real world data?

I intend to attempt to answer all of these questions but first I need to tell you a story…

A Tale Of Two Weathermen

I want to tell you a story about two weathermen; one good, competent and diligent and one bad, stupid and lazy. Why weathermen? Well, in the first place, they are not intelligence analysts, so I will not have to concern myself with all the meaningless distinctions that might arise if I use a real example. In the second place, they are enough like intelligence analysts that the lessons derived from this thought experiment – sorry, I mean “story” – will remain meaningful in the intelligence domain.

Imagine first the good weatherman and imagine that he only knows one rule: If it is sunny outside today, then it is likely to be sunny tomorrow (I have no idea why he only knows one rule. Maybe he just got hired. Maybe he hasn’t finished weatherman school yet. Whatever the reason, this is the only rule he knows). While the weatherman only knows this one rule, it is a good rule and has consistently been shown to be correct.

His boss comes along and asks him what the weather is going to be like tomorrow. The good weatherman remembers his rule, looks outside and sees sun. He tells the boss, “It is likely to be sunny tomorrow.”

Page 4: Evaluating Intelligence K Wheaton

kwheaton Page 4 2/8/2009

The next day the weather is sunny and the boss is pleased.

Clearly the weatherman was right. The boss then asks the good weatherman what the weather will be like the next day. “I want to take my family on a picnic,” says the boss, “so the weather tomorrow is particularly important to me.” Once again the good weatherman looks outside and sees sun and says, “It is likely to be sunny tomorrow.”

The next day, however, the rain is coming down in sheets. A wet and bedraggled weatherman is sent straight to the boss’ office as soon as he arrives at work. After the boss has told the good weatherman that he was wrong and given him an earful to boot, the good weatherman apologizes but then asks, “What should I have done differently?”

“Learn more rules!” says the boss.

“I will,” says the weatherman, “but what should I have done differently yesterday? I only knew one rule and I applied it correctly. How can you say I was wrong?”

“Because you said it would be sunny and it rained! You were wrong!” says the boss.

“But I had a good rule and I applied it correctly! I was right!” says the weatherman.

Let’s leave them arguing for a minute and think about the bad weatherman.

This guy is awful. The kind of guy who sets low standards for himself and consistently fails to achieve them, who has hit rock bottom and started to dig, who is not so much of a has-been as a won’t-ever-be (For more of these see British Performance Evaluations). He only knows one rule but has learned it incorrectly! He thinks that if it is cloudy outside today, it is likely to be sunny tomorrow. Moreover, tests have consistently shown that weathermen who use this rule are far more likely to be wrong than right.

The bad weatherman’s boss asks the same question: “What will the weather be like tomorrow?” The bad weatherman looks outside and sees that it is cloudy and he states (with the certainty that only the truly ignorant can muster), “It is likely to be sunny tomorrow.”

The next day, against the odds, the day is sunny. Was the bad weatherman right? Even if you thought he was right, over time, of course, this weatherman is likely to be wrong far more often than he is to be right. Would you evaluate him based solely on his last judgment or would you look at the history of his estimative judgments?

There are several aspects of the weathermen stories that seem to be applicable to intelligence. First, as the story of the good weatherman demonstrates, the traditional

Page 5: Evaluating Intelligence K Wheaton

kwheaton Page 5 2/8/2009

notion that intelligence is either “right” or “wrong” is meaningless without a broader understanding of the context in which that intelligence was produced.

Second, as the story of the bad weatherman revealed, considering estimative judgments in isolation, without also evaluating the history of estimative judgments, is a mistake. Any model for evaluating intelligence needs to (at least) take these two factors into consideration.

A Model For Evaluating Intelligence

Clearly there is a need for a more sophisticated model for evaluating intelligence – one that takes not only the results into consideration but also the means by which the analyst arrived at those results. It is not enough to get the answer right; analysts must also “show their work” in order to demonstrate that they were not merely lucky.

For the purpose of this paper, I will refer to the results of the analysis -- the analytic estimate under consideration -- as the product of the analysis. I will call the means by which the analyst arrived at that estimate the process. Analysts, therefore, can be largely (more on this later) correct in their analytic estimate. In this case, I will define the product as true. Likewise, analysts can be largely incorrect in their analytic estimate in which case I will label the product false.

Just as important, however, is the process. If an analyst uses a flawed, invalid process (much like the bad weatherman used a rule proven to be wrong most of the time), then I would say the process is false. Likewise, if the analyst used a generally valid process, one which produced reasonably reliable results over time, then I would say the process was true or largely accurate and correct.

Note that these two spectra are independent of one another. It is entirely possible to have a true process and a false product (consider the story of the good weatherman). It is also possible to have a false process and a true product (such as with the story of the bad weatherman). On the other hand, both product and process are bound tightly together as a true process is more likely to lead to a true product and vice-versa. The Chinese notion of yin and yang or the physicist’s idea of complementarity are useful analogues for the true relationship between product and process.

In fact, it is perhaps convenient to think of this model for evaluating intelligence in a small matrix, such as the one below:

Page 6: Evaluating Intelligence K Wheaton

kwheaton Page 6 2/8/2009

There are a number of examples of each of these four basic combinations. For instance, consider the use of intelligence preparation of the battlefield in the execution of combat operations in the Mideast and elsewhere. Both the product and the process by which it was derived have proven to be accurate. On the other hand,

statistical sampling of voters (polling) is unquestionably a true process but has, upon occasion, generated spectacularly incorrect results (see Truman v. Dewey…)

False processes abound. Reading horoscopes, tea leaves and goat entrails are all false processes which, every once in a while, turn out to be amazingly accurate. These same methods, however, are even more likely to be false in both process and product.

What are the consequences of this evaluative model? In the first place, it makes no sense to talk about intelligence being “right” or “wrong”. Such an appraisal is overly simplistic and omits critical evaluative information. Evaluators should be able to specify if they are talking about the intelligence product or process or both. Only at this level of detail does any evaluation of intelligence begin to make sense.

Second, with respect to which is more important, product or process, it is clear that process should receive the most attention. Errors in a single product might well result in poor decisions, but are generally easy to identify in retrospect if the process is valid. On the other hand, errors in the analytic process, which are much more difficult to detect, virtually guarantee a string of failures over time with only luck to save the unwitting analyst. This truism is particularly difficult for an angry public or a congressman on the warpath to remember in the wake of a costly “intelligence failure”. This makes it all the more important to embed this principle deeply in any system for evaluating intelligence from the start when, presumably, heads are cooler.

Page 7: Evaluating Intelligence K Wheaton

kwheaton Page 7 2/8/2009

Finally, and most importantly, it makes no sense to evaluate intelligence in isolation – to examine only one case to determine how well an intelligence organization is functioning. Only by examining both product and process systematically over a series of cases does a pattern emerge that allows for appropriate corrective action, if necessary at all, to be taken.

The Problems With Evaluating Product

The fundamental problem with evaluating intelligence products is that intelligence, for the most part, is probabilistic. Even when an intelligence analyst thinks he or she knows a fact, it is still subject to interpretation or may have been the result of a deliberate campaign of deception.

• The problem is exacerbated when making an intelligence estimate, where good analysts never express conclusions in terms of certainty. Instead, analysts typically use words of estimative probability (or, what linguists call verbal probability expressions) such as "likely" or "virtually certain" to express a probabilistic judgment. While there are significant problems with using words (instead of numbers or number ranges) to express probabilities, using a limited number of such words in a preset order of ascending likelihood currently seems to be considered the best practice by the National Intelligence Council (Iran NIE, page 5).

Intelligence products, then, suffer from two broad categories of error: Problems of calibration and problems of discrimination. Anyone who has ever stepped on a scale only to find that they weigh significantly more or significantly less than expected understands the idea of calibration. Calibration is the act of adjusting a value to meet a standard.

In simple probabilistic examples, the concept works well. Consider a fair, ten-sided die. Each number, one through ten, has the same probability of coming up when the die is rolled (10%). If I asked you to tell me the probability of rolling a seven, and you said 10%, we could say that your estimate was perfectly calibrated. If you said the probability was only 5%, then we would say your estimate was poorly calibrated and we could "adjust" it to 10% in order to bring it into line with the standard.

Translating this concept into the world of intelligence analysis is incredibly complex.

Page 8: Evaluating Intelligence K Wheaton

kwheaton Page 8 2/8/2009

To have perfectly calibrated intelligence products, we would have to be able to say that, if a thing is 60% likely to happen, then it happens 60% of the time. Most intelligence questions (beyond the trivial ones), however, are unique, one of a kind. The exact set of circumstances that led to the question being asked in the first place and much of the information relevant to its likely outcome are impossible to replicate making it difficult to keep score in a meaningful way.

The second problem facing intelligence products is one of discrimination. Discrimination is associated with the idea that intelligence is either “right” or “wrong”. An analyst with a perfect ability to discriminate always gets the answer right, whatever the circumstance. While the ability to perfectly discriminate between right and wrong analytic conclusions might be a theoretical ideal, the ability to actually achieve such a feat exists only in the movies. Most complex systems are subject to a certain sensitive dependence on initial conditions which precludes any such ability to discriminate beyond anything but trivially short time frames.

If it appears that calibration and discrimination are in conflict, they are. The better calibrated an analyst is, the less likely they are to be willing to definitively discriminate between possible estimative conclusions. Likewise, the more willing an analyst is to discriminate between possible estimative conclusions, the less likely he or she is to be properly calibrating the possibilities inherent in the intelligence problem.

For example, an analyst who says X is 60% likely to happen is still 40% "wrong" when X does happen should an evaluator choose to focus on the analyst's ability to discriminate. Likewise, the analyst who said X will happen is also 40% wrong if the objective probability of X happening was 60% (even though X does happen), if the evaluator chooses to focus on the analyst's ability to calibrate.

Failure to understand the tension between these two evaluative principles leaves the unwitting analyst open to a "damned if you do, damned if you don't" attack by critics of the analyst's estimative work. The problem only grows worse if you consider words of estimative probability instead of numbers.

All this, in turn, typically leads analysts to ask for what Phlip Tetlock, in his excellent book Expert Political Judgment, called "adjustments" when being evaluated regarding the accuracy of their estimative products. Specifically, Tetlock outlines four key adjustments:

• Value adjustments -- mistakes made were the "right mistakes" given the cost of the alternatives

• Controversy adjustments -- mistakes were made by the evaluator and not the evaluated

Page 9: Evaluating Intelligence K Wheaton

kwheaton Page 9 2/8/2009

• Difficulty adjustments -- mistakes were made because the problem was so difficult or, at least, more difficult than problems a comparable body of analysts typically faced

• Fuzzy set adjustments -- mistakes were made but the estimate was a "near miss" so it should get partial credit

This parade of horribles should not be construed as a defense of the school of thought that says that intelligence cannot be evaluated, that it is too hard to do. It is merely to show that evaluating intelligence products is truly difficult and fraught with traps to catch the unwary. Any system established to evaluate intelligence products needs to acknowledge these issues and, to the greatest degree possible, deal with them.

Many of the "adjustments", however, can also be interpreted as excuses. Just because something is difficult to do doesn't mean you shouldn't do it. An effective and appropriate system for evaluating intelligence is an essential step in figuring out what works and what doesn't, in improving the intelligence process. As Tetlock notes (p. 9), "The list (of adjustments) certainly stretches our tolerance for uncertainty: It requires conceding that the line between rationality and rationalization will often be blurry. But, again, we should not concede too much. Failing to learn everything is not tantamount to learning nothing."

The Problems With Evaluating Process

In addition to product failures, there are a number of ways that the intelligence process can fail as well. Requirements can be vague, collection can be flimsy or undermined by deliberate deception, production values can be poor or intelligence made inaccessible through over-classification. Finally, the intelligence architecture, the system in which all the pieces are embedded, can be cumbersome, inflexible and incapable of responding to the intelligence needs of the decisionmaker. All of these are part of the intelligence process and any of these -- or any combination of these -- reasons can be the cause of an intelligence failure.

In this article (and in this section in particular), I intend to look only at the kinds of problems that arise when attempting to evaluate the analytic part of the process. From this perspective, the most instructive current document available is Intelligence Community Directive (ICD) 203: Analytic Standards. Paragraph D4, the operative paragraph, lays out what makes for a good analytic process in the eyes of the Director Of National intelligence:

• Objectivity• Independent of Political Considerations• Timeliness

Page 10: Evaluating Intelligence K Wheaton

kwheaton Page 10 2/8/2009

• Based on all available sources of intelligence• Properly describes the quality and reliability of underlying sources• Properly caveats and expresses uncertainties or confidence in analytic judgments• Properly distinguishes between underlying intelligence and analyst's

assumptions and judgments• Incorporates alternative analysis where appropriate• Demonstrates relevance to US national security• Uses logical argumentation• Exhibits consistency of analysis over time or highlights changes and explains

rationale• Makes accurate judgments and assessments

This is an excellent starting point for evaluating the analytic process. There are a few problems, though. Some are trivial. Statements such as "Demonstrates relevance to US national security" would have to be modified slightly to be entirely relevant to other disciplines of intelligence such as law enforcement and business. Likewise, the distinction between "objectivity" and "independent of political considerations" would likely bother a stricter editor as the latter appears to be redundant (though I suspect the authors of the ICD considered this and still decided to separate the two in order to highlight the notion of political independence).

Some of the problems are not trivial. I have already discussed the difficulties associated with mixing process accountability and product accountability, something the last item on the list, "Makes accurate judgments and assessments" seems to encourage us to do.

Even more problematic, however, is the requirement to "properly caveat and express uncertainties or confidence in analytic judgments." Surely the authors meant to say "express uncertainties and confidence in analytic judgments". While this may seem like hair-splitting, the act of expressing uncertainty and the act of expressing a degree of analytic confidence are quite different things. This distinction is made (though not as clearly as I would like) in the prefatory matter to all of the recently released National Intelligence Estimates. The idea that the analyst can either express uncertainties (typically through the use of words of estimative probability) or express confidence flies in the face of this current practice.

Analytic confidence is (or should be) considered a crucial subsection of an evaluation of the overall analytic process. If the question answered by the estimate is, "How likely is X to happen?" then the question answered by an evaluation of analytic confidence is "How likely is it that you, the analyst, are wrong?" These concepts are analogous to statistical notions of probability and margin of error (as in polling data that indicates that Candidate X is looked upon favorably by 55% of the electorate with a plus or minus 3% margin of error). Given the lack of a controlled environment, the inability to replicate situations important to intelligence analysts and the largely intuitive nature of

Page 11: Evaluating Intelligence K Wheaton

kwheaton Page 11 2/8/2009

most intelligence analysis, an analogy, however, is what it must remain.

What contributes legitimately to an increase in analytic confidence? To answer this question, it is essential to go beyond the necessary but by no means sufficient criteria set by the standards of ICD 203. In other words, analysis which is biased or late shouldn't make it through the door but analysis that is only unbiased and on time meets only the minimum standard.

Beyond these entry-level standards for a good analytic process, what are those elements that actually contribute a better estimative product? The current best answer to this question comes from Josh Peterson's thesis on the topic. In it he argued that seven elements had adequate experimental data to suggest that they legitimately contribute to analytic confidence:

• Use of Structured Methods in Analysis• Overall Source Reliability• Level of Source Corroboration/Agreement• Subject Matter Expertise• Amount of Collaboration Among Analysts• Task Complexity• Time Pressure

There are still numerous questions that remain to be answered. Which element is most important? Is there a positive or negative synergy between two or more of the elements? Are these the only elements that legitimately contribute to analytic confidence?

Perhaps the most important question, however, is how the decisionmaker -- the person or organization the intelligence analyst supports -- likely sees this interplay of elements that continuously impacts both the analytic product and process.

The Decisionmaker's Perspective

Decisionmakers are charged with making decisions. While this statement is blindingly obvious, its logical extension actually has some far reaching consequences.

First, even if the decision is to "do nothing" in a particular instance, it is still a decision. Second, with the authority to make a decision comes (or should come) responsibility and accountability for that decision's consequences (The recent kerfluffle surrounding the withdrawal of Tom Daschle for an appointment in the Obama cabinet is instructive in this matter).

Driving these decisions are typically two kinds of forces. The first is largely internal to

Page 12: Evaluating Intelligence K Wheaton

kwheaton Page 12 2/8/2009

the individual or organization making the decision. The focus here is on the capabilities and limitations of the organization itself: How well-trained are my soldiers? How competent are my salespeople? How fast is my production line, how efficient are my logistics or how well equipped are my police units? Good decisionmakers are often comfortable here. They know themselves quite well. Oftentimes they have risen through the ranks of an organization or even started the organization on their own. The internal workings of a decisionmaker's own organization are easiest (if not easy) to see, predict and control.

The same cannot be said of external forces. The current upheaval in the global market is likely, for example, to affect even good, well-run businesses in ways that are extremely difficult to predict, much less control. The opaque strategies of state and non-state actors threaten national security plans and the changing tactics of organized criminal activity routinely frustrate law enforcement professionals. Understanding these external forces is a defining characteristic of intelligence and the complexity of these forces is often used to justify substantial intelligence budgets.

Experienced decisionmakers do not expect intelligence professionals to be able to understand external forces to the same degree that it is possible to understand internal forces. They do expect intelligence to reduce their uncertainty, in tangible ways, regarding these external forces. Sometimes intelligence provides up-to-date descriptive information, unavailable previously to the decisionmaker (such as the U2 photographs in the run-up to the Cuban Missile Crisis). Decisionmakers, however, find it much more useful when analysts provide estimative intelligence -- assessments about how the relevant external forces are likely to change.

• Note: I am talking about good, experienced decisionmakers here. I do not intend to address concerns regarding bad or stupid decisionmakers in this series of posts, though both clearly exist. These concerns are outside the scope of a discussion about evaluating intelligence and fall more naturally into the realms of management studies or psychology. I have a slightly different take on inexperienced (with intelligence) decisionmakers, however. I teach my students that intelligence professionals have an obligation to teach the decisionmakers they support about what intelligence can and cannot do in the same way the grizzled old platoon sergeant has an obligation to teach the newly minted second lieutenant about the ins and outs of the army.

Obviously then, knowing, with absolute certainty, where the relevant external forces will be and what they will be doing is of primary importance to decisionmakers. Experienced decisionmakers also know that to expect such precision from intelligence is unrealistic. Rather, they expect that the estimates they receive will only reduce their uncertainty about those external forces, allowing them to plan and decide with greater but not absolute clarity.

Page 13: Evaluating Intelligence K Wheaton

kwheaton Page 13 2/8/2009

Imagine, for example, a company commander on a mission to defend a particular piece of terrain. The intelligence officer tells the commander that the enemy has two primary avenues of approach, A and B, and that it is "likely" that the enemy will choose avenue

A. How does this intelligence inform the commander's decision about how to defend the objective?For the sake of argument, let's assume that the company commander interprets the word "likely" as meaning "about 60%". Does this mean that the company commander should allocate about 60% of his forces to defending Avenue A and 40% to defending

Avenue B? That is a solution but there are many, many ways in which such a decision would make no sense at all.

The worst case scenario for the company commander, however, is if he only has enough forces to adequately cover one of the two avenues of approach. In this case, diverting any forces at all will guarantee failure.

Assuming an accurate analytic process and all other things being equal (and I can do that because this is a thought experiment), the commander should align his forces along Avenue A in this worst case situation. This gives him the best chance of stopping the enemy forces. This decisionmaker, with his limited forces, is essentially forced by the situation to treat a 60% probability as 100% accurate for planning purposes. Since many decisions are (or appear to be to the decisionmaker) of this type, it is no wonder that decisionmakers, when they evaluate intelligence, tend to focus on the ability to discriminate between possible outcomes over the ability to calibrate the estimative conclusion.

The Iraq WMD Estimate And Other Pre-War Iraq Assessments

Page 14: Evaluating Intelligence K Wheaton

kwheaton Page 14 2/8/2009

Applying all these principles to a real world case is difficult but not impossible – at least in part. Of the many estimates and documents made public regarding the war in Iraq, three seem close enough in time, space, content and authorship to serve as a test case.

Perhaps the most famous document leading up to the war in Iraq is the much-maligned National Intelligence Estimate (NIE) titled Iraq's Continuing Programs for Weapons Of Mass Destruction completed in October, 2002 and made public (in part) in April, 2004. Subjected to extensive scrutiny by the Commission on the Intelligence Capabilities of the United States Regarding Weapons of Mass Destruction, this NIE was judged "dead wrong" in almost all of its major estimative conclusions (i.e. in the language of this paper, the product was false).

Far less well known are two Intelligence Community Assessments (ICA), both completed in January, 2003. The first, Regional Consequences of Regime Change in Iraq, was made public in April, 2007 as was the second ICA, Principal Changes in Post-Saddam Iraq. Both documents were part of the US Senate's Select Subcommittee on Intelligence report on Pre-War Intelligence Assessments about Post War Iraq and both (heavily redacted) documents are available as appendices to the subcommittee's final report.

The difference between an NIE and an ICA seems modest to an outsider. Both types of documents are produced by the National Intelligence Council and both are coordinated within the US national security intelligence community and, if appropriate, with cleared experts outside the community. The principal differences appear to be the degree of high level approval (NIEs are approved at a higher level than ICAs) and the intended audiences (NIEs are aimed at high level policy makers while ICAs are geared more to the desk-analyst policy level.

In this case, there appears to be at least some overlap in the actual drafters of the three documents. Paul Pillar, National Intelligence Officer (NIO) for the Near East and South Asia at the time was primarily responsible for coordinating (and, presumably drafting) both of the ICAs. Pillar also assisted Robert D. Walpole, NIO for Strategic and Nuclear Programs in the preparation of the NIE (along with Lawrence K. Gershwin, NIO for Science and Technology and Major General John R. Landry, NIO for Conventional Military Issues).

Despite the differences in the purposes of these documents, it is likely safe to say that the fundamental analytic processes -- the tradecraft and evaluative norms -- were largely the same. It is highly unlikely, for example, that standards such as "timeliness" and "objectivity" were maintained in NIEs but abandoned in ICAs.

Why is this important? As discussed in detail earlier in this paper, it is important, in evaluating intelligence, to cast as broad a net as possible, to not only look at examples

Page 15: Evaluating Intelligence K Wheaton

kwheaton Page 15 2/8/2009

where the intelligence product was false but also cases where the intelligence product was true and, in turn, examine the process in both cases to determine if the analysts were good or just lucky or bad or just unlucky. These three documents, prepared at roughly the same time, under roughly the same conditions, with roughly the same resources on roughly the same target allows the accuracy of the estimative conclusions in the documents to be compared with some assurance that doing so may help get at any underlying flaws or successes in the analytic process.

Batting Averages

Despite good reasons to believe that the findings of the Iraq WMD National Intelligence Estimate NIE) and the two pre-war Intelligence Community Assessments (ICAs) regarding Iraq can be evaluated as a group in order to tease out insights into the quality of the analytic processes used to produce these products, several problems remain before we can determine the "batting average".

• Assumptions vs. Descriptive Intelligence: The NIE drew its estimative conclusions from what the authors believed were the facts based on an analysis of the information collected about Saddam Hussein's WMD programs. Much of this descriptive intelligence (i.e. that information which was not proven but clearly taken as factual for purposes of the estimative parts of the NIE) turned out to be false. The ICAs, however, are largely based on a series of assumptions either explicitly or implicitly articulated in the scope notes to those two documents. This analysis, therefore, will only focus on the estimative conclusions of the three documents and not on the underlying facts.

• Descriptive Intelligence vs. Estimative Intelligence: Good analytic tradecraft has always required analysts to clearly distinguish estimative conclusions from the direct and indirect information that supports those estimative conclusions. The inconsistencies in the estimative language along with the grammatical structure of some of the findings make this particularly difficult. For example, the Iraq NIE found: "An array of clandestine reporting reveals that Baghdad has procured covertly the types and quantities of chemicals and equipment sufficient to allow limited CW agent production hidden in Iraq's legitimate chemical industry." Clearly the information gathered suggested that the Iraqi's had gathered the chemicals. What is not as clear is if they were they likely using them for limited CW production or if they merely could use these chemicals for such purposes. A strict constructionist would argue for the latter interpretation whereas the overall context of the Key Judgments would suggest the former. I have elected to focus on the context to determine which statements are

Page 16: Evaluating Intelligence K Wheaton

kwheaton Page 16 2/8/2009

estimative in nature. This inserts an element of subjectivity into my analysis and may skew the results.

• Discriminative vs. Calibrative Estimates: The language of the documents uses both discriminative ("Baghdad is reconstituting its nuclear weapons program") and calibrative language ("Saddam probably has stocked at least 100 metric tons ... of CW agents"). Given the seriousness of the situation in the US at that time, the purposes for which these documents were to be used, and the discussion of the decisionmaker’s perspective earlier, I have elected to treat calibrative estimates as discriminative for purposes of evaluation.

• Overly Broad Estimative Conclusions: Overly broad estimates are easy to spot. Typically these statements use highly speculative verbs such as "might" or "could". A good example of such a statement is the claim: "Baghdad's UAVs could threaten Iraq's neighbors, US forces in the Persian Gulf, and if brought close to, or into, the United States, the US homeland." Such alarmism seems silly today but it should have been seen as silly at the time as well. From a theoretical perspective, these types of statements tell the decisionmaker nothing useful (anything "could" happen; everything is "possible"). One option, then, is to mark these statements as meaningless and eliminate them from consideration. This, in my mind, encourages this bad practice and I intend to count these kinds of statements as false if they turned out to have no basis in fact (I would under this same logic have to count them as true if they turned out to be true, of course).

• Weight of the Estimative Conclusion: Some estimates are clearly more fundamental to a report than others. Conclusions regarding direct threats to US soldiers from regional actors, for example, should trump any minor and indirect consequences regarding regional instability identified in the reports. Engaging in such an exercise might be something appropriate for individuals directly involved in this process and in a better position to evaluate these weights. I, on the other hand, am looking for only the broadest possible patterns (if any) from the data. I have, therefore decided to weigh all estimative conclusions equally.

• Dealing with Dissent: There were several dissents in the Iraq NIE. While the majority opinion is, in some sense, the final word on the matter, an analytic process that tolerates formal dissent deserves some credit as well. Going simply with the majority opinion does not accomplish this. Likewise, eliminating the dissented opinion from consideration gives too much credit to the process. I have chosen to count those estimative conclusions with dissents as both true and false (for scoring purposes only).

Page 17: Evaluating Intelligence K Wheaton

kwheaton Page 17 2/8/2009

Clearly, given the caveats and conditions under which I am attempting this analysis, I am looking only for broad patterns of analytic activity. My intent is not to spend hours quibbling about all of the various ways a particular judgment could be interpreted as true or false after the fact. My intent is to merely make the case that evaluating intelligence is difficult but, even with those difficulties firmly in mind, it is possible to go back, after the fact, and, if we look at a broad enough swath of analysis, come to some interesting conclusions about the process.

Within these limits, then, by my count, the Iraq NIE contained 28 (85%) false estimative conclusions and 5 (15%) true ones. This analysis tracks quite well with the WMD Commission's own evaluation that the NIE was incorrect in "almost all of its pre-war judgments about Iraq's weapons of mass destruction." By my count, the Regional Consequences of Regime Change in Iraq ICA fares much better with a count of 23 (96%) correct estimative conclusions and only one (4%) incorrect one. Finally, the report on the Principal Challenges in Post-Saddam Iraq nets 15 (74%) correct analytic estimates to 4 (26%) incorrect ones. My conclusions are certainly consistent with the tone of the Senate Subcommittee Report.

• It is noteworthy that the Senate Subcommittee did not go to the same pains to compliment analysts on their fairly accurate reporting in the ICAs as the WMD Commission did to pillory the NIE. Likewise, there was no call from Congress to ensure that the process involved in creating the NIE was reconciled with the process used to create the ICAs, no laws proposed to take advantage of this largely accurate work, no restructuring of the US national intelligence community to ensure that the good analytic processes demonstrated in these ICAs would dominate the future of intelligence analysis.

The most interesting number, however, is the combined score for the three documents. Out of the 76 estimative conclusions made in the three reports, 43 (57%) were correct and 33 (43%) incorrect. Is this a good score or a bad score? Such a result is likely much better than mere chance, for example. For each judgment made, there were likely many reasonable hypotheses considered. If there were only three reasonable hypotheses to consider in each case, the base rate would be 33%. On average, the analysts involved were able to nearly double that "batting average".

Likewise it is consistent with both hard and anecdotal data of historical trends in analytic forecasting. Mike Lyden, in his thesis on Accelerated Analysis, calculated that, historically, US national security intelligence community estimates were correct approximately 2/3 of the time.

Former Director of the CIA, GEN Michael Hayden, made his own estimate of analytic accuracy in May of last year, ""Some months ago, I met with a small group of investment bankers and one of them asked me, 'On a scale of 1 to 10, how good is our

Page 18: Evaluating Intelligence K Wheaton

kwheaton Page 18 2/8/2009

intelligence today?' I said the first thing to understand is that anything above 7 isn't on our scale. If we're at 8, 9, or 10, we're not in the realm of intelligence—no one is asking us the questions that can yield such confidence. We only get the hard sliders on the corner of the plate."

Given these standards, 57%, while a bit low by historical measures, certainly seems to be within normal limits and, even more importantly, consistent with what the intelligence community’s senior leadership expects from its analysts.

Final Thoughts

The purpose of this article was not to rationalize away, in a frenzy of legalese, the obvious failings of the Iraq WMD NIE. Under significant time pressure and operating with what the authors admitted was limited information on key questions, they failed to check their assumptions and saw all of the evidence as confirming an existing conceptual framework (While it should be noted that this conceptual framework was shared by virtually everyone else, the authors do not get a free pass on this either. Testing assumptions and understanding the dangers of overly rigid conceptual models is Intel Analysis 101).

On the other hand, if the focus of inquiry is just a bit broader, to include the two ICAs about Iraq completed by at least some of the same people, using many of the same processes, the picture becomes much brighter. When evaluators consider the three documents together, the analysts seem to track pretty well with historical norms and leadership expectations. Like the good weatherman discussed earlier, it is difficult to see how they got it "wrong".

Moreover, the failure by evaluators to look at intelligence successes as well as intelligence failures and to examine them for where the analysts were actually good or bad (vs. where the analysts were merely lucky or unlucky) is a recipe for turmoil. Imagine a football coach who only watched game film when the team lost and ignored lessons from when the team won. This is clearly stupid but it is very close to what happens to the intelligence community every 5 to 10 years. From the Hoover Commission to today, so-called intelligence failures get investigated while intelligence successes get, well, nothing.

The intelligence community, in the past, has done itself no favors for when the investigations do inevitably come, however. The lack of clarity and consistency in the estimative language used in these documents coupled with the lack of auditable process made coming to any sort of conclusion about the veracity of product or process far more difficult than it needed to be. While I do not expect that other investigators would come

Page 19: Evaluating Intelligence K Wheaton

kwheaton Page 19 2/8/2009

to startlingly different conclusions than mine, I would expect there to be areas where we would disagree -- perhaps strongly -- due to different interpretations of the same language. This is not in the intelligence community's interest as it creates the impression that intelligence is merely an “opinion” or, even worse, that the analysts are "just guessing".

Finally, there appears to be one more lesson to be learned from an examination of these three documents. Beyond the scope of evaluating intelligence, it goes to the heart of what intelligence is and what role it serves in a policy debate.

In the days before the vote to go to war, the Iraq NIE clearly answered the question it had been asked, albeit in a predictable way (so predictable, in fact, that few in Washington bother to read it). The Iraq ICAs, on the other hand, come out in January, 2003, two months before the start of the war. They were generated in response to a request from the Director of Policy Planning at the State Department and were intended, as are all ICAs, for lower level policymakers. These reports quite accurately -- as it turns out -- predict the tremendous difficulties should the eventual solution (of the several available to the policymakers at the time) to the problem of Saddam's Hussein's presumed WMDs be war.

What if all three documents had come out at the same time and had all been NIEs? There does not appear to be, from the record, any reason why they could not have been issued simultaneously. The Senate Subcommittee states on page 2 of its report that there was no special collection involved in the ICAs, that it was "not an issue well-suited to intelligence collection." The report went on to state, "Analysts based their judgments primarily on regional and country expertise, historical evidence and," significantly, in light of this paper, "analytic tradecraft." In short, open sources and sound analytic processes. Time was of the essence, of course, but it is clear from the record that the information necessary to write the reports was already in the analysts’ heads.

It is hard to imagine that such a trio of documents would not have significantly altered the debate in Washington. The outcome might still have been war, but the ability of policymakers to dodge their fair share of the blame would have been severely limited. In the end, it is perhaps the best practice for intelligence to answer not only those questions it is asked but also those questions it should have been asked.