coplientary article reprint two dogmas of big data...this is “big data” in the colloquial sense...

About DeloitteDeloitte refers to one or more of Deloitte Touche Tohmatsu Limited, a UK private company limited by guarantee (“DTTL”), its network of member firms, and their related entities. DTTL and each of its member firms are legally separate and independent entities. DTTL (also referred to as “Deloitte Global”) does not provide services to clients. Please see www.deloitte.com/about for a more detailed description of DTTL and its member firms.

Deloitte provides audit, tax, consulting, and financial advisory services to public and private clients spanning multiple industries. With a globally connected network of member firms in more than 150 countries and territories, Deloitte brings world-class capabilities and high-quality service to clients, delivering the insights they need to address their most complex business challenges. Deloitte’s more than 200,000 professionals are committed to becoming the standard of excellence.

This communication contains general information only, and none of Deloitte Touche Tohmatsu Limited, its member firms, or their related entities (collectively, the “Deloitte Network”) is, by means of this communication, rendering professional advice or services. No entity in the Deloitte network shall be responsible for any loss whatsoever sustained by any person who relies on this communication.

© 2014. For information, contact Deloitte Touche Tohmatsu Limited.

i ssue 15 | 2014

Complimentary article reprint

By James guszcza and Bryan richardson > illustration By ulla puggaard

Two dogmas of big dataUnderstanding the power of analytics for predicting human behavior

Deloitte Review

162 T WO DOGMAS OF B IG DATA

DELOIT TEREVIEW.COM Deloitte Review

163T WO DOGMAS OF B IG DATA

Two dogmas of big data

Understanding the power of analytics for predicting human behavior

“Society became statistical. A new type of law came into being, analogous to the laws of nature, but pertaining to people. These new laws were expressed in terms of probability.”

—Ian Hacking

BY JAMES GUSZCZA AND BRYAN RICHARDSON > ILLUSTRATION BY ULLA PUGGAARD

PREDICTING THE PRESENT

Roughly ten years ago, The Economist magazine quoted the science fiction author William Gibson as saying, “The fu-

ture is already here—it's just not very evenly distributed.”1 Gib-son’s comment is not a bad description of the varying degrees to which analytics and data-driven decision-making have been adopted in the public and private spheres. Much has been done, much remains to be done.

Deloitte Review DELOIT TEREVIEW.COM


Today few doubt that, properly planned and executed, data analytic methods en-able organizations to make more effective decisions. Anecdotal evidence abounds. The city of New York recently began deploying building inspectors using the in-dications of a predictive model that flags problematic sites. Before the model was implemented, roughly 13 percent of building inspections resulted in a vacate order. Using the model, this figure rose to 70 percent.2 During the 2012 United States presidential election, the data journalist Nate Silver exemplified with considerable flair the superiority of rigorous data analysis and statistical thinking over unaided expert judgment in forecasting election results.3 Netflix decided to produce its hit series House of Cards, and partially chose the creative team for the series based on an analysis of fine-grained subscriber viewing patterns.4 This cursory list could be extended for pages.5

Academic research corroborates the abundant anecdotal evidence. For exam-ple, Erik Brynjolfsson and his collaborators studied a sample of publicly traded firms. They concluded that the firms in the sample that had adopted a data-driven decision-making approach enjoyed 5–6 percent higher output and productivity than would be expected given their other investments and level of information technology usage.6

This story, itself hardly over a decade old, has lately been complicated by the emergence of “big data” as a dominant theme of discussion. Big data is routinely discussed in transformative terms as a source for innovation. “Data is the new oil,” the saying goes, and it will enable scientific breakthroughs, new business models, and societal transformations. A zeitgeist-capturing book title declares that it is a “revolution that will transform the way we live, work, and think.”7 The Cornell com-puter scientist Jon Kleinberg judiciously declared, “The term itself is vague, but it is getting at something that is real… big data is a tagline for a process that has the potential to transform everything.”8

While there is little doubt that the topic is important, its newness and the term’s vagueness have led to misconceptions that, if left unchecked, can lead to expensive strategic errors. One major misconception is that big data is necessary for analytics to provide big value. Not only is this false, it obscures the fact that the economic val-ue of analytics projects often has as much to do with the psychology of de-biasing decisions and the sociology of corporate culture change as with the volumes and varieties of data involved.

The second misconception is the epistemological fallacy that more bytes yield more benefits. This is an example of what philosophers call a “category error.” De-cisions are not based on raw data; they are based on relevant information. And data volume is at best a rough proxy for the value and relevance of the underlying information.



This essay will tackle each of these points in turn, focusing on applications in-volving the prediction of human behavior in such contexts as students at university, employees on the job, voters at the polls, shoppers in the store, drivers behind the wheel, physicians in the emergency room, and individuals trying to stick to health and medical regimens. An implication of the first point is that rather than wait for the mastery of big data, it is typically possible—and indeed advisable—to pursue near-term applications of analytics that involve readily available data sources. An implication of the second point is that big data is important in predicting behavior, but perhaps not for the reasons most commonly discussed.

MAKING THE GRADE WITHOUT BIG DATA

An example from the domain of university admissions illustrates how analytics can enable better decisions through more granular and disciplined use of tra-

ditional data sources. We recently had the opportunity to work with the Uni-versity of Toronto, a globally ranked Ca-nadian university, to assess the value of incorporating predictive analytics into the undergraduate admissions process. The specific goal was to build a predic-tive model capable of distinguishing likely high-achieving students from the rest of the pack. Such a model would enable the university to make offers to students most likely to succeed, earn high marks, and go on to graduate. The data at our disposal contained millions, not billions, of records in a structured form. This is “big data” in the colloquial sense that programming and statistical science—not just spreadsheet analy-sis—is needed to make sense of it. But it is not “big data” in the more formal “3V” sense of having such high volume, variety, and velocity as to create problems for traditional data processing and analysis technologies.

The potential benefits of this application to students, the university, and society as a whole are apparent. A predictive model provides the admissions officer with a tool that can be used to support making decisions more accurately, consistently, and economically.

One major miscon- cept ion is that big data i s necessary for analyt ics to provide big value. Not only is th is fa lse, i t obscures the fact that the economic value of analyt ics projects often has as much to do with the psychology of de-bias ing decis ions and the sociology of corporate culture change as with the volumes and var iet ies of data involved.



Working in close collaboration with the university’s admissions team, it was decided early on to build a transparent and easily interpretable predictive model that uses readily available high school transcript information to predict a particu-lar indicator of academic success at university. This planning phase of the project is analogous to an architect discussing with the client the overall vision for a new dwelling being commissioned. Just as form follows function in architecture, the technical specifics of a model (the mathematical form, the input data) are often af-fected by its intended use.

With these design elements in place, the hard work began. A well-kept secret of analytics is that, even when the data being analyzed are readily available (in this case it was high school transcript data), considerable effort is needed to prepare the data in a form required for the fun part—data exploration and statistical analy-sis. This process is called “data scrubbing,” connoting the idea that “messy” (raw, transactional, incomplete, or inconsistently formatted) data must be converted into “clean” (rows and columns) data amenable to data analysis.

. . . [a] univers i ty has the means to improve key admiss ions decis ions us ing a transparent,

interpretable

model constructed from an uncontrovers ia l data source us ing common sense, standard stat is-t ica l methodology, and a dash of inspired creat iv i ty.



While it sounds (and indeed can be) tedious, data scrubbing is counterintuitive-ly the project phase where the greatest value is created. Working with detailed high school transcript data, we constructed hundreds of descriptors for each student ap-plicant. This variable creation (or “feature engineering,” in the vernacular) stage is a major point at which domain expertise, the tacit knowledge of experienced data analysts, and creativity can be introduced into the process. Extending the “data is the new oil” metaphor, this is the process of refining the oil into usable form. No-tably, this is the aspect of data science that is most difficult to convey in textbooks and university courses.

At this point, the stage was set for the centerpiece of the project: We used an iter-ative process, guided in equal measures by statistical science and common sense, to select a predictive model containing a small subset of the hundreds of variables cre-ated for consideration. Each of the model variables, somewhat predictive on their own, contributed to a model whose predictive power is greater than the sum of its parts. The model can be viewed as a more granular—and more accurate—alterna-tive to a tried-and-tested predictor: high school grade point average.

Based on an analysis of the model’s predictive accuracy, we estimate that the university can use the model to boost the number of high-achieving students ad-mitted between 5 and 10 percent. Additional data sources and future projects could be considered to further iterate and improve the model’s predictive accuracy and/or build analogous models to support other types of decisions. But for the purpose of this discussion, the major point is that the university has the means to improve key admissions decisions using a transparent, interpretable model constructed from an uncontroversial data source using common sense, standard statistical methodol-ogy, and a dash of inspired creativity.

“IT IS TIME TO DRAW A PRACTICAL CONCLUSION.”

By now there are hundreds of examples, structurally similar to our case study, in which analytics involving the most traditional of data sources outperform

traditional modes of decision-making. It is perhaps surprising that, while such ex-amples have appeared in the business press for hardly a decade, they have been known in the academic psychology community for 60 years. Furthermore, they are explained by advances in the behavioral sciences from the past 30 years. And this explanation has nothing to do with big data.9

Consider a few other examples:

• An emergency room physician making a triage decision

• An educational psychologist recommending a student for specialized tutoring



• A human resources manager making a hiring decision

• An insurance underwriter deciding whether to sell insurance to a complex risk

• A political campaign worker evaluating which voters could most likely be persuaded to vote for a particular candidate.

Each case (as well as any number of analogous cases) involves “sorting” or “pri-oritization” decisions that (a) are central to an organization’s operations (medical tri-age, student retention, hiring); (b) are made repeatedly, typically by experts relying on professional judgment in varying degrees; and (c) incorporate quantifiable infor-mation that is readily available, yet commonly used only in informal or limited ways.

And furthermore, it turns out that in each case a fairly simply predictive scoring equation can be counted on to outperform unaided professional judgment.

The finding in fact dates back to the 1954 publication of the psychologist Paul Meehl’s book Clinical Versus Statistical Prediction. Meehl’s “disturbing little book,” as he later called it, documented 20 studies comparing the predictions of human experts with those of simple models. The types of predictions ranged from how well schizophrenic patients would respond to electroshock to how well prisoners would respond to parole. Meehl concluded that in none of the 20 cases could human ex-perts outperform the models.

If this reminds the reader of Michael Lewis’ Moneyball, it is for a very good rea-son. Lewis’ book recounted the story of a cash-strapped baseball team that out of necessity began to analyze, and act upon, readily available data sources when mak-ing scouting decisions. Because the scouting industry was largely judgment-driven at the time, the market for talent was, literally speaking, inefficient: The “price” (sal-ary) of the “asset” (players) simply did not reflect important publicly available in-formation. Because of this market inefficiency, “better management was able to run circles around taller piles of cash.”10 In a recent Vanity Fair profile of Daniel Kahn-eman, Lewis reported that while writing his book, he was unaware that Meehl’s findings and subsequent findings in behavioral economics (see the sidebar “Go ask Linda”) explained the market inefficiency he had “stumbled upon.”11

Near the end of his career, surveying the field he initiated three decades earlier, Meehl wrote:

There is no controversy in social science which shows such a large body of quanti-

tatively diverse studies coming out so uniformly in the same direction as this one.

When you are pushing over 100 investigations, predicting everything from the out-

come of football games to the diagnosis of liver disease, and when you can hardly

come up with half a dozen studies showing even a weak tendency in favor of the

clinician, it is time to draw a practical conclusion.



It is hard to overstate the importance of Meehl’s “practical conclusion” in an age of cheap computing power and open-source statistical analysis software. Decision-making is central to all aspects of business, public administration, medicine, and education. Meehl’s lesson—routinely echoed in case studies ranging from baseball scouting to evidence-based medicine to university admissions—is that in virtu-ally any domain, statistical analysis can be used to drive better expert decisions. The reason has nothing to do with data volume and everything to do with human psychology.

GO ASK LINDA: WHY EXPERTS NEED EQUATIONSIn Thinking, Fast and Slow, the Nobel Prize-winning founder of behavioral economics Daniel

Kahneman wrote that during his student days, Paul Meehl was one of his heroes.12 So it’s

perhaps no coincidence that the subsequent work of Kahneman and his collaborators has

done much to clarify both our understanding of Meehl’s “disturbing” findings as well as the

widespread applicability of business analytics.

Kahneman writes of two fictitious mental processes that he calls System 1 (“thinking fast”)

and System 2 (“thinking slow”). System 1 mental operations are rapid and automatic; they

are biased toward belief and confirmation rather than analysis and skepticism; they tend to

jump to conclusions and infer causal relations based on thin, “cognitively available” evidence.

They tend to neglect the importance of evidence that is neither emotionally vivid nor in plain

sight. In contrast, System 2 mental operations are slow, deliberate, and seek logical coher-

ence rather than “narrative” or “associative” coherence.

The bulk of our mental operations are System 1 in nature. And the rub is that System 1

thinking turns out to be terrible at statistics. Without time, effort, and either tools or special

training, the human mind will reliably make novice statistical errors. Surprisingly, this often

applies to trained mathematicians and laypeople alike.

So far are we from being natural statistical thinkers that Kahneman calls the human mind

“a machine for jumping to conclusions.” This central theme of behavioral economics is fa-

mously illustrated with “the Linda story.” A fictional character named Linda is described as a

highly intelligent political activist. Now don’t think, blink: Is it more likely that Linda is a bank

teller, or a feminist who happens to work as a bank teller? Most people answer the latter

even though a moment’s thought reveals that this can’t possibly be right.13 Narrative coher-

ence trumps logical coherence in a surprising way.

Predictive models, while fast in a literal sense, are “slow thinkers” par excellence. They can

accurately weigh together 5, 50, or 5,000 pieces of information with equal ease; they never

suffer from low blood sugar; and are immune to cognitive biases and narrative fallacies.

Perhaps in hindsight, Paul Meehl’s disturbing finding isn’t so surprising after all.



B IS FOR “BEHAVIORAL”: THE PROMISE OF BIG DATA

Hopefully the view of business analytics outlined so far lets a bit of air out of the big data bubble. A timely implication of the decades-old work of Paul

Meehl, Daniel Kahneman, and their followers is that analytics projects need not be predicated on big data (in the “3V” sense of the term) to yield economic value and even transform industries. Even in cases where only traditional data sources are brought to the table, predictive models and analytically derived business decision rules provide value by warding off inefficient or biased decisions. In such applica-tions, models have a prosthetic character: They serve as “eyeglasses” for myopic human minds.

But none of this bursts the big data bubble entirely. Once again the realm of “people analytics” applied to professional sports provides a bellwether example. Sports analytics has rapidly evolved in the decade since Moneyball appeared. For example, the National Basketball Association employs player tracking software that feeds real time data into proprietary software so that the data can be analyzed to as-sess player and team performance.14 Returning to William Gibson’s image, profes-sional sports analytics is a domain where “the future is already here.”

Given the time and expense involved in gathering and using big data, it pays to ask when, why, and how big data yields commensurately big value. Discussions of the issue typically focus on various aspects of size or the questionable premise that big data means analyzing entire populations (“N=all” as one slogan has it), rather than mere samples. In reality, data volume, variety, and velocity is but one of many considerations. The paramount issue is gathering the right data that carries the most useful information for the problem at hand.

In the context of predicting or analyzing human behavior the relevant aspect is the behavioral content of emerging data sources. Anyone who has worked with large volumes of behavioral data knows that past behavior often does predict future behavior, and often in surprising ways. For example personal credit information not only predicts who is likely to default on a loan; it is also strongly predictive of who is more or less likely to experience an auto accident. Marketing and lifestyle data can be used not only to predict future purchase behavior, but the presence of such lifestyle diseases as diabetes and hypertension.

The computational social scientist Alex “Sandy” Pentland forcefully articulates this point:

I believe that the power of big data is that it is information about people's behavior

instead of information about their beliefs. It's about the behavior of customers, em-

ployees, and prospects for your new business. It's not about the things you post on

Facebook, and it's not about your searches on Google, which is what most people

think about, and it's not data from internal company processes and RFIDs. This sort



of big data comes from things like location data off of your cell phone or credit card:

It's the little data breadcrumbs that you leave behind you as you move around in

the world.

What those breadcrumbs tell is the story of your life... Who you actually are is deter-

mined by where you spend time, and which things you buy. Big data is increasingly

about real behavior, and by analyzing this sort of data, scientists can tell an enor-

mous amount about you. They can tell whether you are the sort of person who will

pay back loans. They can tell you if you're likely to get diabetes.15

A recent study conducted at the University of Cambridge Psychomet-rics Centre dramatically illustrates the power of such “digital breadcrumbs.” The researchers focused on the social network “likes” (positive attitudes about various pieces of online content) of a sample of 58,000 users. They found that using only this information, they were able to predict ethnic origin with 95per-cent accuracy; male sexual orientation with 88 percent accuracy; political lean-ings (Democrat or Republican) with 85 percent accuracy; religion (Christian or Muslim) with 82 percent accuracy, and so on. The researchers also found weaker, but still significant, correlations between this information and such latent psychological traits as intelligence, openness, ex-traversion, and emotional stability. For example, the researchers found that the in-formation gleaned from social network “likes” is nearly as informative as a person-ality test score measuring an individual’s openness to change.16

Such results raise major ethical and privacy issues that are far from being re-solved. Many organizations will want to avoid such sources of data altogether. Still, such data are already changing business and societal landscapes. Furthermore it is useful to consider whether or how such information could be used in innovative and societally useful, as opposed to invasive, ways.17

Human resources is one promising domain for behavioral data-inspired in-novation. Important aspects of the value that individuals bring to their organiza-tions—healthy habits of informal group engagement, communication, and team participation—are currently measured inconsistently or approximately at best. And because of this, their contributions to organizational success are therefore often understood only murkily and rewarded inconsistently. For example, the leaders of

The second half of the story re lates to emerg-ing sources of behavioral data, and therefore has an ending yet to be writ-ten. Clear ly the capture and use of data emanat-ing from and perta in ing to people’s behaviors is r i fe with ethical i ssues that must be faced.



Google’s “Project Oxygen” leadership analytics study were surprised to find that technical ability ranked least important on the list of eight attributes they found characteristic of effective managers.18 Less quantifiable attributes such as being re-sults-oriented and caring for the career development of team members were found to be more important than the technical abilities initially assumed to be most im-portant for technical managers. Such findings call to mind Woody Allen’s quip, “80 percent of success is showing up.”

Pentland’s own work illustrates the emerging possibilities for creating digi-tal proxies of traditionally unquantifiable traits. Pentland has developed a de-vice known as the “sociometer”—a wearable electronic badge that captures such second-by-second information about people’s tones of voice, body language, and communication patterns. He calls the sort of “digital breadcrumbs” collected by such devices “honest signals” because, unlike survey responses or social media posts, they are not consciously edited. Sociometric data captures aspects of non-verbal communication and social network relationships that can be surprisingly predictive.

For example, sociometric data are predictive of dating behavior and the out-comes of job interviews and salary negotiations.19 They also shine a light on the dynamics of effective teams. Pentland reports being able to predict which team will win a business plan contest using only sociometric data captured about the inter-actions of the team members at a cocktail reception. Analysis of sociometric data suggests the recipe for winning teams’ success: Successful teams are characterized by people talking and listening in equal measure, emanating helpful body language, speaking directly with one another rather than through a domineering leader, and so on.20

Indeed it turns out that there is a measurable concept of the “collective intelli-gence” of groups—highly analogous to individual IQ—which can be partially char-acterized through the use of sociometric data. Anita Woolley of Carnegie-Mellon University and her collaborators constructed a measure of collective intelligence and found that it is roughly as predictive of group performance as IQ is of indi-vidual performance. Surprisingly, collective intelligence is not explained by factors such as group satisfaction, cohesion, or motivation. Instead, the strongest predic-tors of collective intelligence—and group success—are equality of conversational turn-taking (measured using sociometric data) as well as the ability of the group’s members to read social signals (measured using more traditional psychometric data).21 It is likely that these traits contribute to group performance by enabling better flows of ideas.22

Because they have traditionally been hard to measure quantitatively, behavioral traits such as openness and social intelligence are often viewed as ephemeral or



unreliable. However, the steadily increasing availability of computational social sci-ence tools and methods suggests the practical possibility of harnessing behavioral data to create more effective teams and systematically reward beneficial behaviors and personality traits that are currently recognized only sporadically.

It is, to say the least, unclear whether real-time monitoring devices will gain widespread acceptance in the business world or society at large. Still, between the scenario of equipping all employees with sociometric devices and the opposite ex-treme of basing human resource decisions on limited, judgmentally interpreted data, many possibilities can be explored. For example, Salesforce.com is taking steps to hire and cultivate employees based partly on their social intelligence. The data they gather to measure this psychological trait are gleaned from such methods as team-structured workshop-style interview days, personality modeling exercises, and participation on the organization’s online social collaboration tool. The data are gathered transparently and shared with job candidates.23 Similarly, the measure of social sensitivity that Woolley and her collaborators used (alongside sociometric data) to predict collective intelligence is a psychometric test that can be voluntarily taken on-line in approximately 10 minutes.24

COUNTING WHAT COUNTS

We have told a two-part story to counter the two dogmas of big data. The first half of the story is the more straightforward: In domains ranging from

the admissions office to the emergency room to the baseball diamond, measur-ably improved decisions will likely more often than not result from a disciplined, analytically-driven use of uncontroversial, currently available data sources. While more data often enables better predictions, it is not necessary for organizations to master “big data” in order to realize near-term economic benefits. Behavioral sci-ence teaches us that this has as much to do with the idiosyncrasies of human cogni-tion as with the power of data and statistics.

The second half of the story relates to emerging sources of behavioral data, and therefore has an ending yet to be written. Clearly the capture and use of data ema-nating from and pertaining to people’s behaviors is rife with ethical issues that must be faced. At the same time, if the required privacy safeguards can be established, one can envision opt-in uses of behavioral data that serve everyone’s interests.

For example, data from massive open on-line courses (MOOCs) can be used to design better courses of study. Richer behavioral data sources can be used in human resources contexts to select and reward such skills as teamwork and social intelligence in addition to more readily measured technical abilities. Behavioral data could be used to quantify physician “bedside manner” and to improve patient



satisfaction and reduce the frequency of malpractice claims. Telematics data from automobiles can be used to help older drivers stay behind the wheel longer; super-market club-card data can be used to provide early warnings of lifestyle disease risks; and self-tracking data can be used to help people maintain their health.

A sign hanging on Albert Einstein’s door in Princeton’s Institute for Advanced Study read, “Not everything that can be counted counts, and not everything that counts can be counted.” While Einstein’s motto is timeless, emerging behavioral data sources and computational social sciences methods are expanding the domain of what we can count. DR

James Guszcza is a senior fellow of the Singapore Deloitte Analytics Institute and the US national predictive analytics lead for Deloitte Consulting’ LLPs Actuarial, Risk, and Advanced Analytics practice.

Bryan Richardson is a manager with Deloitte Canada and a predictive modeling lead for the Advanced Analytics practice in Canada.

Endnotes1. The Economist, December 4, 2003.2. Viktor Mayer-Schönberger and Kenneth Cukier, “Big data in the Big Apple,” Slate, March 2013, <http://www.slate.

com/articles/technology/future_tense/2013/03/big_data_excerpt_how_mike_flowers_revolutionized_new_york_s_building_inspections.html>; For a fuller discussion, with many other examples, see Mayer-Schönberger and Cukier, Big Data: A Revolution that Will Transform How We Live, Work, and Think, Houghton-Mifflin Harcourt.

3. In an election-day blog, New York Times columnist Timothy Egan memorably characterized judgment-driven political forecasting writing, “In the last days of the election, Peggy Noonan had a “feel” that things were moving Mitt Romney’s way. George Will was more cerebral: His brain told him it would be Romney in a rout. And Michael Barone, who used to have a good divining rod to go along with an encyclopedic knowledge for all numbers political, also predicted a Romney landslide. What they had in common, aside from putting up a brick Tuesday that com-pletely missed the electoral net, was a last-hurrah push for the old-fashioned prediction by gut,” <http://campaign-stops.blogs.nytimes.com/2012/11/06/e-day/#Beeson>.

4. David Carr, “Giving viewers what they want,” The New York Times, February 24, 2013, <http://www.ny-times.com/2013/02/25/business/media/for-house-of-cards-using-big-data-to-guarantee-its-popularity.html?pagewanted=all>.

5. Our own consulting work has given us many similar examples. For example, we have had the occasion to build models to predict which drivers are likely to experience automobile accidents, which individuals are at highest risk of contracting such lifestyle diseases as diabetes and hypertension, which physicians are more likely to be sued for malpractice, which industrial sites pose the greatest worker safety risks, and which non-custodial divorced parents are most likely to go into arrears on their child support payments.

6. Erik Brynjolfsson, Lorin Hitt, and Heekyung Kim, Strength in numbers: How does data-driven decision making affect firm performance?, April 2011, <http://papers.ssrn.com/sol3/papers.cfm?abstract_id=1819486>.

7. Mayer-Schönberger and Cukier, Big Data: A Revolution that Will Transform How We Live, Work, and Think.8. Kleinberg quoted in “How big data became so big,” The New York Times, August 11, 2012.9. In the official sense of the term of involving petabyte-class data that strains the limits of traditional database tech-

nology. “Big data” is also used more informally in the sense of “data that is too rich and complex to be analyzed in spreadsheets without concepts from university-level statistics.” This sense of “big data” is analogous to the colloquial use of the term “rocket scientist” to describe anyone who works with advanced quantitative methods. We have no qualms with the colloquial use of “big data,” only with discussions that conflate the two meanings, leading to the false assumption that expensive IT platforms are necessary for most garden-variety analytics projects. See, James Guszcza, David Schweidel, and Shantanu Dutta, “The personalized and the personal,” Deloitte Review issue 14, January, 2014, <http://dupress.com/articles/dr14-personalized-and-personal/> and James Guszcza et al., “Too big to ignore,” Deloitte Review issue 12, 2013, <http://deloitte.wsj.com/cfo/files/2013/09/TooBigIgnore.pdf>.



10. Michael Lewis, Moneyball: The Art of Winning an Unfair Game, (London, New York: W.W. Norton and Company, 2003).

11. In “The king of human error,” Vanity Fair, December 2011, Lewis states that the findings of Paul Meehl, Daniel Kahneman, and their collaborators explain the mystery he “stumbled upon.” He reports that he became aware of the psychological and behavioral economic implications of his story only after he read the review of Moneyball by the behavioral economists Richard Thaler and Cass Sunstein in The New Republic. Our previous essay, Beyond the numbers discusses Moneyball as well as the Thaler/Sunstein review.

12. Daniel Kahneman, “Intuitions versus formulas,” Thinking, Fast and Slow (New York: Farrar, Straus and Giroux, 2011).

13. The set of all feminist bank tellers is a subpopulation of the set of all bank tellers.14. See for example, John Schuhmann, SportVU adds to the conversation, Hang Time Blog, September 13, 2013, <http://

hangtime.blogs.nba.com/2013/09/16/sportvu-adds-to-the-conversation/>.15. Edge, “Reinventing society in the wake of big data,” a conversation with Alex (Sandy) Pentland, http://www.edge.

org/conversation/reinventing-society-in-the-wake-of-big-data. 16. Michal Kosinskia, David Stillwell, and Thore Graepel, Private traits and attributes are predictable from digital

records of human behavior, Proceedings of the National Academy of Sciences. 17. Guszcza, Schweidel, and Dutta, “The personalized and the personal.”18. Adam Bryant, “Google’s quest to build a better boss,” The New York Times, March 12, 2011, <http://www.nytimes.

com/2011/03/13/business/13hire.html?pagewanted=all&_r=0>. 19. Alex “Sandy” Pentland, “Understanding ‘honest signals’ in business,” MIT Sloan Management Review, Fall 2008, p.

70–75.20. Alex “Sandy” Pentland, “The new science of building great teams,” Harvard Business Review, April 2012.21. Anita Williams Woolley, Christopher F. Chabris, Alex Pentland, et al.,“Evidence of a collective intelligence factor in

the performance of human groups,” Science, October 2010, <http://www.chabris.com/Woolley2010a.pdf>. Woolley and her collaborators denote their measure of collective intelligence c, on analogy with the standard measure of human intelligence g, formulated over a century ago by Charles Spearman. Spearman’s methods are now standard in psychometrics and some areas of modern industrial data science. And indeed Woolley and her collaborators use Spearman’s method of factor analysis to derive c.

22. Alex Pentland, “Chapter 5,” Social Physics: How Good Ideas Spread—The Lessons from a New Science (Penguin, 2014). 23. John Henry and Peter MacLean, “Courting the candidate-customer: The unlikely art of attraction,” Deloitte Review,

Issue 13, 2013, <http://dupress.com/articles/courting-the-candidate-customer/>.24. This is the “Mind in the Eyes” test developed by the University of Cambridge psychometrician Simon Baron-Cohen.

The test measures an individual’s ability to understand others’ emotional states by showing a series of photographs of faces cropped around the eyes. Each image is accompanied by a multiple-choice question asking the emotional state of the person pictured. Baron-Cohen developed his test in 2001 as part of a research project on autism, well before the work of Woolley and her collaborators on collective intelligence. The research suggests that Baron-Cohen’s measure of social intelligence is substantially more important to team success than either the average or the maximum IQ of the team members. Interested readers can test themselves by going to <http://kgajos.eecs.harvard.edu/mite/>. In passing, we can’t help mentioning the irony that Simon Baron-Cohen is the cousin of Sacha Baron-Cohen, a master of confounding the perceptions of even the most socially aware among us.

coplientary article reprint two dogmas of big data...this is “big data” in the colloquial sense...

Documents