cctst - diversity and team science...30 countries and the winner is… ensemble 10.06% bellkor...
TRANSCRIPT
-
Diversity and Team Science
Scott E Page Santa Fe Institute
University of Michigan
-
Outline
-
A Great Big Complex World
Diversity and Prediction Diversity Prediction Theorem
Model Diversity Theorem
Categorical Diversity Theorem
Big Science
-
A Great Big (Complex) World
-
http://library.sc.edu
-
http://crissp.eu
-
Authors Per Paper: Computer Science
-
Identity and Cognitive Diversity
-
Authors Per Paper: Computer Science
Courtesy: Jacob Foster
-
``Few if any funding programs support research on the effectiveness of science teams and larger groups.’’
-
Conclusions 1&2
#1: Team composition matters.
Diversity is critical
#2: Team composition systematic
-
Prediction
-
“All Models Are Wrong”
- George Box
-
“Hence, our truth is the intersection of independent lies.”
- Richard Levins
-
1197
-
1197
Average = 1,197
-
Diversity Prediction Theorem
-
Diversity Prediction Theorem
Crowd Error = Average Error - Diversity
-
Crowd Error = Average Error - Diversity
-
Crowd Error = Average Error – Diversity
0.6 = 2,956.0 - 2955.4
-
Actual Number 103 Class Prediction 70.04
Diversity Prediction Theorem Crowd = Average – Diversity
1086 = 3456 – 2370
Number Finnish Athletes
-
Actual Number 103 Class Prediction 65.6
Diversity Prediction Theorem Crowd = Average – Diversity
1401 = 7039 – 5638
Number Finnish Athletes
-
Actual Number 319
Class Prediction 318.58
Diversity Prediction Theorem Crowd = Average – Diversity
0.1 = 202,323.6 – 202,323.5
Latvian Cars Per 1000
-
Actual Number 319
Class Prediction 369
Diversity Prediction Theorem Crowd = Average – Diversity
2542 = 89,743 – 87,201
Latvian Cars Per 1000
-
Christina Romer
-
Model Diversity Theorem
-
N Models
Distribution across models:
(P1, P2, P3 … Pn)
Pj = probability someone uses model j
F = probability two individuals have the same model
-
Probability two people use the same model.
Match(P) = (P1)2 + (P2)2 +… (Pn)2
-
Diversity Index: Δ = 1/Match(P)
Δ = Effective number of parties (political science)
firms (economics)
species (ecology)
-
Diversity Index
Δ = (p12 + p22 + p32)-1
(1/3)2 + (1/3)2 + (1/3)2 = 3(1/9) = 1/3
Δ = 3
(1/2)2 + (1/4)2 + (1/4)2 = 6/16
Δ = 2.66
-
Claim: N independent predictive models with a diversity index of Δ and an average variance of V, have an expected squared error equal to:
V/Δ
Economo, Hong, and Page (2015)
Model Diversity Theorem
-
Size DIVERSITY (Δ) Matters
-
Goncola Abecasis
-
Categorical Diversity Theorem
-
Partition the set of possible instances into categories
Make a prediction for each category
-
Computer Science
PAC Learning: Valient Robust Classification: Provost and Foster
Ensemble Learning: Desarthy and Sheela
-
Categorical Predictive Models
Partition the set of possible instances into categories
Make a prediction for each category
-
300 200 100 200
A B C D
Chloride in Water: mg/L
-
300 200 100 200
A B C D
Chloride in Water: mg/L
240 170
-
Variation in Data: 20,000 20,000 = (200-200)2 +(300-200)2 +(100-200)2 +(200-200)2
-
Variation in Data: 20,000 20,000 = (200-200)2 +(300-200)2 +(100-200)2 +(200-200)2
Residual Variation: 12,000 20,000 = (200-240)2 +(300-240)2 +(100-170)2 +(200-170)2
-
Variation in Data: 20,000 20,000 = (200-200)2 +(300-200)2 +(100-200)2 +(200-200)2
Residual Variation: 11,000 20,000 = (200-240)2 +(300-240)2 +(100-170)2 +(200-170)2
R2: 0.45
-
Total Variation (20,000)
-
300 200 100 200
A B C D
Chloride in Water: mg/L
-
Categorization Loss: 10,000 Mean of Category 1: 250 Categorization Loss 1 : 5000 5000 = (200-250)2 +(300-250)2 Mean of Category 2: 150 Categorization Loss 2 : 5000 5000 = (200-150)2 +(100-150)2
-
Categorization Error
Possible Variation Explained 10,000
Categorization Loss
10,000
-
300 200 100 200
A B C D
Chloride in Water: mg/L
240 170
-
Prediction Error: 1000= 200 + 800 Prediction Category 1: 240 Actual Value: 250 Prediction Error: 200 = (250-240)2 +(250-240)2 Prediction Category 2: 170 Actual Value: 150 Prediction Error: 800 = (170-150)2 +(170-150)2
-
Categorization Error
Variation Explained 9,000
Categorization Loss
10,000 Prediction Error
1000
-
Categorical Diversity Theorem
Variation = Explained + Category Loss + Predictive Error
-
Value of Distinct Categories
Distinct Categories result in diverse categorization losses and by the Diversity Prediction Theorem lower error.
-
Six years of data Half million users 17,700 movies Data divided into (training, testing) Testing Data dived into (probe, quiz, test)
-
Singular Value Decomposition
Each movie represented by a vector: (p1,p2,p3,p4…pn)
Each person represented by a vector: (q1,q2,q3,q4…qn)
-
Christina and David Rom
Robert Bell
-
BellKor
50 dimensions
107 models
Best Model: 6.8%
-
BellKor
50 dimensions
107 models
Best Model: 6.8%
Combination of Models: 8.4%
-
BellKor’s Pragmatic Chaos
Best Model 8.4% Ensemble: 10.1%
-
Enter ``The Ensemble’’
23 Teams
30 Countries
-
And the Winner is…
Ensemble 10.06%
Bellkor 10.06%
-
But, the Real Winner is…
Ensemble 10.06%
Bellkor 10.06%
50-50 Combination 10.19%
-
Combine accurate and diverse models to make good predictions.
-
Big Science
-
Leaving Our Silos
-
Medicine Sociology
Chemistry Economics
-
Freeman and Huang
Citations Impact Factor
# Addresses + + # References + + # Past papers + + Homophily - -
-
Science & Engineering Social Science
Papers > 100 Cites
0.00%
0.02%
0.04%
0.06%
0.08%
0.10%
0.12%
0.14%
Team
Team
Solo Solo
-
Jones B, Wuchty S, Uzzi B (2008) Multi-University Research Teams: Shifting Impact, Geography, and Stratification in Science. Science 322: 1259
Inter-University Collaboration Increases Impact
-
Cummings, J. N., Kiesler, S., Zadeh, R., & Balakrishnan, A. (2013). Group heterogeneity increases the risks of large group size: A longitudinal study of productivity in research groups. Psychological Science, 24(6), 880-890.
-
Cummings, J. N., Kiesler, S., Zadeh, R., & Balakrishnan, A. (2013). Group heterogeneity increases the risks of large group size: A longitudinal study of productivity in research groups. Psychological Science, 24(6), 880-890.
-
Cummings, J. N., Kiesler, S., Zadeh, R., & Balakrishnan, A. (2013). Group heterogeneity increases the risks of large group size: A longitudinal study of productivity in research groups. Psychological Science, 24(6), 880-890.
-
Best papers have low proximity
Best patents have low proximity
-
Atypical Connections Variable Odds Ratio
Years since PHD 1.14
Prior Cites 2.25
Author Count 0.8
Depth (HHI) 3.29
Atypical Connect 15.17
``Recombinant search and breakthrough idea generation: An analysis of high impact papers in the social sciences’’ Melissa A. Schilling and Elad Green Research Policy, 2011, vol. 40, issue 10, pages 1321-1331
Melissa Schilling
-
Q?
Diversity and Team ScienceOutlineSlide Number 3Slide Number 4Slide Number 5Slide Number 6Slide Number 7Slide Number 8Identity and Cognitive Diversity�Slide Number 10Slide Number 11Conclusions 1&2Prediction�Slide Number 14“All Models Are Wrong”“Hence, our truth is the intersection of independent lies.” Slide Number 17Slide Number 18Slide Number 1911971197Diversity Prediction Theorem�Diversity Prediction TheoremSlide Number 24Slide Number 25Slide Number 26Slide Number 27Slide Number 28Slide Number 29Slide Number 30Slide Number 31Model Diversity TheoremSlide Number 33Slide Number 34Slide Number 35Diversity IndexModel Diversity TheoremSize DIVERSITY (Δ) MattersSlide Number 39Slide Number 40Categorical Diversity TheoremSlide Number 42Computer ScienceCategorical Predictive ModelsSlide Number 45Slide Number 46Slide Number 47Slide Number 48Slide Number 49Slide Number 50Slide Number 51Slide Number 52Slide Number 53Slide Number 54Slide Number 55Slide Number 56Categorical Diversity TheoremValue of Distinct CategoriesSlide Number 59Slide Number 60Slide Number 61Singular Value DecompositionSlide Number 63BellKorBellKorBellKor’s Pragmatic ChaosEnter ``The Ensemble’’And the Winner is…But, the Real Winner is… Big ScienceSlide Number 72Slide Number 73Freeman and HuangSlide Number 75Inter-University Collaboration Increases ImpactSlide Number 77Slide Number 78Slide Number 79Slide Number 80Slide Number 81Atypical ConnectionsQ? ��