cctst - diversity and team science...30 countries and the winner is… ensemble 10.06% bellkor...

Diversity and Team Science

Scott E Page Santa Fe Institute

University of Michigan

Outline

A Great Big Complex World

Diversity and Prediction Diversity Prediction Theorem

Model Diversity Theorem

Categorical Diversity Theorem

Big Science

A Great Big (Complex) World

http://library.sc.edu

http://crissp.eu

Authors Per Paper: Computer Science

Identity and Cognitive Diversity

Authors Per Paper: Computer Science

Courtesy: Jacob Foster

``Few if any funding programs support research on the effectiveness of science teams and larger groups.’’

Conclusions 1&2

#1: Team composition matters.

Diversity is critical

#2: Team composition systematic

Prediction

“All Models Are Wrong”

- George Box

“Hence, our truth is the intersection of independent lies.”

- Richard Levins

1197

Average = 1,197

Diversity Prediction Theorem

Diversity Prediction Theorem

Crowd Error = Average Error - Diversity

Crowd Error = Average Error - Diversity

Crowd Error = Average Error – Diversity

0.6 = 2,956.0 - 2955.4

Actual Number 103 Class Prediction 70.04

Diversity Prediction Theorem Crowd = Average – Diversity

1086 = 3456 – 2370

Number Finnish Athletes

Actual Number 103 Class Prediction 65.6


1401 = 7039 – 5638

Number Finnish Athletes

Actual Number 319

Class Prediction 318.58


0.1 = 202,323.6 – 202,323.5

Latvian Cars Per 1000

Actual Number 319

Class Prediction 369


2542 = 89,743 – 87,201

Latvian Cars Per 1000

Christina Romer

N Models

Distribution across models:

(P1, P2, P3 … Pn)

Pj = probability someone uses model j

F = probability two individuals have the same model

Probability two people use the same model.

Match(P) = (P1)2 + (P2)2 +… (Pn)2

Diversity Index: Δ = 1/Match(P)

Δ = Effective number of parties (political science)

firms (economics)

species (ecology)

Diversity Index

Δ = (p12 + p22 + p32)-1

(1/3)2 + (1/3)2 + (1/3)2 = 3(1/9) = 1/3

Δ = 3

(1/2)2 + (1/4)2 + (1/4)2 = 6/16

Δ = 2.66

Claim: N independent predictive models with a diversity index of Δ and an average variance of V, have an expected squared error equal to:

V/Δ

Economo, Hong, and Page (2015)


Size DIVERSITY (Δ) Matters

Goncola Abecasis

Partition the set of possible instances into categories

Make a prediction for each category

Computer Science

PAC Learning: Valient Robust Classification: Provost and Foster

Ensemble Learning: Desarthy and Sheela

Categorical Predictive Models

Partition the set of possible instances into categories

Make a prediction for each category

300 200 100 200

A B C D

Chloride in Water: mg/L

300 200 100 200

A B C D


240 170

Variation in Data: 20,000 20,000 = (200-200)2 +(300-200)2 +(100-200)2 +(200-200)2


Residual Variation: 12,000 20,000 = (200-240)2 +(300-240)2 +(100-170)2 +(200-170)2


Residual Variation: 11,000 20,000 = (200-240)2 +(300-240)2 +(100-170)2 +(200-170)2

R2: 0.45

Total Variation (20,000)

300 200 100 200

A B C D


Categorization Loss: 10,000 Mean of Category 1: 250 Categorization Loss 1 : 5000 5000 = (200-250)2 +(300-250)2 Mean of Category 2: 150 Categorization Loss 2 : 5000 5000 = (200-150)2 +(100-150)2

Categorization Error

Possible Variation Explained 10,000

Categorization Loss

10,000

300 200 100 200

A B C D


240 170

Prediction Error: 1000= 200 + 800 Prediction Category 1: 240 Actual Value: 250 Prediction Error: 200 = (250-240)2 +(250-240)2 Prediction Category 2: 170 Actual Value: 150 Prediction Error: 800 = (170-150)2 +(170-150)2

Categorization Error

Variation Explained 9,000

Categorization Loss

10,000 Prediction Error

1000


Variation = Explained + Category Loss + Predictive Error

Value of Distinct Categories

Distinct Categories result in diverse categorization losses and by the Diversity Prediction Theorem lower error.

Six years of data Half million users 17,700 movies Data divided into (training, testing) Testing Data dived into (probe, quiz, test)

Singular Value Decomposition

Each movie represented by a vector: (p1,p2,p3,p4…pn)

Each person represented by a vector: (q1,q2,q3,q4…qn)

Christina and David Rom

Robert Bell

BellKor

50 dimensions

107 models

Best Model: 6.8%

BellKor

50 dimensions

107 models

Best Model: 6.8%

Combination of Models: 8.4%

BellKor’s Pragmatic Chaos

Best Model 8.4% Ensemble: 10.1%

Enter ``The Ensemble’’

23 Teams

30 Countries

And the Winner is…

Ensemble 10.06%

Bellkor 10.06%

But, the Real Winner is…

Ensemble 10.06%

Bellkor 10.06%

50-50 Combination 10.19%

Combine accurate and diverse models to make good predictions.

Big Science

Leaving Our Silos

Medicine Sociology

Chemistry Economics

Freeman and Huang

Citations Impact Factor

# Addresses + + # References + + # Past papers + + Homophily - -

Science & Engineering Social Science

Papers > 100 Cites

0.00%

0.02%

0.04%

0.06%

0.08%

0.10%

0.12%

0.14%

Team

Team

Solo Solo

Jones B, Wuchty S, Uzzi B (2008) Multi-University Research Teams: Shifting Impact, Geography, and Stratification in Science. Science 322: 1259

Inter-University Collaboration Increases Impact

Cummings, J. N., Kiesler, S., Zadeh, R., & Balakrishnan, A. (2013). Group heterogeneity increases the risks of large group size: A longitudinal study of productivity in research groups. Psychological Science, 24(6), 880-890.

Best papers have low proximity

Best patents have low proximity

Atypical Connections Variable Odds Ratio

Years since PHD 1.14

Prior Cites 2.25

Author Count 0.8

Depth (HHI) 3.29

Atypical Connect 15.17

``Recombinant search and breakthrough idea generation: An analysis of high impact papers in the social sciences’’ Melissa A. Schilling and Elad Green Research Policy, 2011, vol. 40, issue 10, pages 1321-1331

Melissa Schilling

Q?

Diversity and Team ScienceOutlineSlide Number 3Slide Number 4Slide Number 5Slide Number 6Slide Number 7Slide Number 8Identity and Cognitive Diversity�Slide Number 10Slide Number 11Conclusions 1&2Prediction�Slide Number 14“All Models Are Wrong”“Hence, our truth is the intersection of independent lies.” Slide Number 17Slide Number 18Slide Number 1911971197Diversity Prediction Theorem�Diversity Prediction TheoremSlide Number 24Slide Number 25Slide Number 26Slide Number 27Slide Number 28Slide Number 29Slide Number 30Slide Number 31Model Diversity TheoremSlide Number 33Slide Number 34Slide Number 35Diversity IndexModel Diversity TheoremSize DIVERSITY (Δ) MattersSlide Number 39Slide Number 40Categorical Diversity TheoremSlide Number 42Computer ScienceCategorical Predictive ModelsSlide Number 45Slide Number 46Slide Number 47Slide Number 48Slide Number 49Slide Number 50Slide Number 51Slide Number 52Slide Number 53Slide Number 54Slide Number 55Slide Number 56Categorical Diversity TheoremValue of Distinct CategoriesSlide Number 59Slide Number 60Slide Number 61Singular Value DecompositionSlide Number 63BellKorBellKorBellKor’s Pragmatic ChaosEnter ``The Ensemble’’And the Winner is…But, the Real Winner is… Big ScienceSlide Number 72Slide Number 73Freeman and HuangSlide Number 75Inter-University Collaboration Increases ImpactSlide Number 77Slide Number 78Slide Number 79Slide Number 80Slide Number 81Atypical ConnectionsQ? ��

cctst - diversity and team science...30 countries and the winner is… ensemble 10.06% bellkor...

Documents