the data-driven world

51
The Data-Driven World Kaur Alasoo

Upload: kaur-alasoo

Post on 13-Jun-2015

607 views

Category:

Technology


0 download

DESCRIPTION

The presentation that I gave on JCI ECM 2011 conference.

TRANSCRIPT

Page 1: The Data-Driven World

The Data-Driven World

Kaur Alasoo

Page 2: The Data-Driven World

Kaur Alasoo

• Computer Science, University of Tartu (2007-2010)

• Intern at European Molecular Biology Laboratory (April - August 2010)

• Systems Biology at Aalto University (2010 - 2012)

Page 3: The Data-Driven World

Outline

• Where are we now?• Where are we heading to?• How to get there?

Page 4: The Data-Driven World

Where are we now?

Page 5: The Data-Driven World

Where are we now?

Science: Culturomics1Society: OkCupid2Personal life: Tracking sites3

Page 6: The Data-Driven World

1Science: Culturomics

Page 7: The Data-Driven World

Culturomics: Using Google Books to analyze culture

exclude proper nouns (fig. S4) and compoundwords (“whalewatching”). Even accounting forthese factors,we foundmany undocumentedwords,such as “aridification” (the process by which a geo-graphic region becomes dry), “slenthem” (a musicalinstrument), and, appropriately, theword “deletable.”

This gap between dictionaries and the lexiconresults from a balance that every dictionary muststrike: It must be comprehensive enough to be auseful reference but concise enough to be printed,shipped, and used. As such, many infrequentwords are omitted. To gauge how well dictio-naries reflect the lexicon, we ordered our year-2000lexicon by frequency, divided it into eight deciles(ranging from 10!9 to 10!8, to 10!2 to 10!1) andsampled each decile (7). We manually checkedhow many sample words were listed in theOxford English Dictionary (OED) (12) and in theMerriam-WebsterUnabridgedDictionary (MWD).(We excluded proper nouns, because neither theOED nor MWD lists them.) Both dictionarieshad excellent coverage of high-frequency wordsbut less coverage for frequencies below 10!6:67% of words in the 10!9 to 10!8 range werelisted in neither dictionary (Fig. 2B). Consistentwith Zipf’s famous law, a large fraction of thewords in our lexicon (63%) were in this lowest-frequency bin. As a result, we estimated that 52%of the English lexicon—themajority of thewordsused in English books—consists of lexical “darkmatter” undocumented in standard references (12).

To keep up with the lexicon, dictionaries areupdated regularly (13). We examined how wellthese changes corresponded with changes in ac-tual usage by studying the 2077 1-gramheadwordsadded to AHD4 in 2000. The overall frequency ofthese words, such as “buckyball” and “netiquette”,has soared since 1950: Two-thirds exhibited recent

sharp increases in frequency (>2! from 1950 to2000) (Fig. 2C). Nevertheless, there was a lag be-tween lexicographers and the lexicon. Over halfthewords added toAHD4were part of the Englishlexicon a century ago (frequency >10!9 from 1890to 1900). In fact, some newly added words, suchas “gypseous” and “amplidyne”, have already un-dergone a steep decline in frequency (Fig. 2D).

Not only must lexicographers avoid addingwords that have fallen out of fashion, they mustalso weed obsolete words from earlier editions.This is an imperfect process. We found 2220 ob-solete 1-gram headwords (“diestock”, “alkales-cent”) in AHD4. Their mean frequency declinedthroughout the 20th century and dipped below10!9 decades ago (Fig. 2D, inset).

Our results suggest that culturomic tools willaid lexicographers in at least two ways: (i) find-ing low-frequencywords that they do not list, and(ii) providing accurate estimates of current fre-quency trends to reduce the lag between changesin the lexicon and changes in the dictionary.

The evolution of grammar. Next, we exam-ined grammatical trends. We studied the Englishirregular verbs, a classic model of grammaticalchange (14–17). Unlike regular verbs, whose pasttense is generated by adding -ed (jump/jumped),irregular verbs are conjugated idiosyncratically(stick/stuck, come/came, get/got) (15).

All irregular verbs coexist with regular com-petitors (e.g., “strived” and “strove”) that threatento supplant them (Fig. 2E and fig. S5). High-frequency irregulars, which are more readilyremembered, hold their ground better. For in-stance, we found “found” (frequency: 5 ! 10!4)200,000 timesmore often thanwe finded “finded.”In contrast, “dwelt” (frequency: 1 ! 10!5) dwelt inour data only 60 times as often as “dwelled”

dwelled. We defined a verb’s “regularity” as thepercentage of instances in the past tense (i.e., thesum of “drived”, “drove”, and “driven”) in whichthe regular form is used.Most irregulars have beenstable for the past 200 years, but 16% underwenta change in regularity of 10% or more (Fig. 2F).

These changes occurred slowly: It took 200years for our fastest-moving verb (“chide”) to gofrom 10% to 90%. Otherwise, each trajectorywas sui generis; we observed no characteristicshape. For instance, a few verbs, such as “spill”,regularized at a constant speed, but others, suchas “thrive” and “dig”, transitioned in fits and starts(7). In some cases, the trajectory suggested a rea-son for the trend. For example,with “sped/speeded”the shift in meaning from “to move rapidly” andtoward “to exceed the legal limit” appears to havebeen the driving cause (Fig. 2G).

Six verbs (burn, chide, smell, spell, spill, andthrive) regularized between 1800 and 2000 (Fig.2F). Four are remnants of a now-defunct phono-logical process that used -t instead of -ed; they aremembers of a pack of irregulars that survived byvirtue of similarity (bend/bent, build/built, burn/burnt, learn/learnt, lend/lent, rend/rent, send/sent,smell/smelt, spell/spelt, spill/spilt, and spoil/spoilt).Verbs have been defecting from this coalition forcenturies (wend/went, pen/pent, gird/girt, geld/gelt, and gild/gilt all blend/blent into the domi-nant -ed rule). Culturomic analysis reveals thatthe collapse of this alliance has been the mostsignificant driver of regularization in the past200 years. The regularization of burnt, smelt, spelt,and spilt originated in the United States; theforms still cling to life in British English (Fig. 2,E and F). But the -t irregulars may be doomed inEngland too. Each year, a population the size ofCambridge adopts “burned” in lieu of “burnt”.

Fig.1.Culturomic analy-ses studymillions of booksat once. (A) Top row: Au-thors have been writingfor millennia; ~129 mil-lion book editions havebeen published since theadventof theprintingpress(upper left). Second row:Libraries and publishinghouses provide books toGoogle for scanning (mid-dle left). Over 15millionbookshavebeendigitized.Third row: Each book isassociatedwithmetadata.Fivemillionbooks are cho-senforcomputationalanal-ysis (bottom left). Bottomrow:A culturomic time lineshows the frequency of“apple” in English booksover time (1800–2000).(B) Usage frequency of“slavery”. The Civil War (1861–1865) and the civil rights movement (1955–1968) are highlighted in red. The number in the upper left (1e-4 = 10–4) is the unitof frequency. (C) Usage frequency over time for “the Great War” (blue), “World War I” (green), and “World War II” (red).

Frequency of theword "apple"

Year

129 million bookspublished

15 million booksscanned

5 million booksanalyzed

BA

C

www.sciencemag.org SCIENCE VOL 331 14 JANUARY 2011 177

RESEARCH ARTICLE

on

April

21,

201

1w

ww

.sci

ence

mag

.org

Dow

nloa

ded

from

Page 8: The Data-Driven World

Fame depends on profession

enter a regime marked by slower forgetting:Collective memory has both a short-term and along-term component.

But there have been changes. The amplitudeof the plots is rising every year: Precise dates areincreasingly common. There is also a greater fo-cus on the present. For instance, “1880” declinedto half its peak value in 1912, a lag of 32 years. In

contrast, “1973” declined to half its peak by1983, a lag of only 10 years. We are forgettingour past faster with each passing year (Fig. 3A).

We were curious whether our increasingtendency to forget the old was accompanied bymore rapid assimilation of the new (21). We di-vided a list of 147 inventions into time-resolvedcohorts based on the 40-year interval in which

they were first invented (1800–1840, 1840–1880,and 1880–1920) (7). We tracked the frequencyof each invention in the nth year after it wasinvented as compared to its maximum value andplotted the median of these rescaled trajectoriesfor each cohort.

The inventions from the earliest cohort(1800–1840) took over 66 years from invention

Dou

blin

g tim

e: 4

yrs

Half life: 73 yrs

Year of invention

0

5

x10-5

ycneuqerF

A B

C D

E F

Freq

uenc

yFr

eque

ncy

Freq

uenc

y (lo

g)M

edia

n fr

eque

ncy

(% o

f pea

k va

lue)

Med

ian

freq

uenc

y

Med

ian

freq

uenc

y (lo

g)

Fig. 3. Cultural turnover is accelerating. (A) We forget: frequency of “1883”(blue), “1910” (green), and “1950” (red). Inset: We forget faster. The half-lifeof the curves (gray dots) is getting shorter (gray line: moving average). (B) Culturaladoption is quicker. Median trajectory for three cohorts of inventions from threedifferent time periods (1800–1840, blue; 1840–1880, green; 1880–1920,red). Inset: The telephone (green; date of invention, green arrow) and radio(blue; date of invention, blue arrow). (C) Fame of various personalities bornbetween 1920 and 1930. (D) Frequency of the 50 most famous people born in

1871 (gray lines; median, thick dark gray line). Five examples are highlighted.(E) The median trajectory of the 1865 cohort is characterized by fourparameters: (i) initial age of celebrity (34 years old, tick mark); (ii) doublingtime of the subsequent rise to fame (4 years, blue line); (iii) age of peak celebrity(70 years after birth, tick mark), and (iv) half-life of the post-peak forgettingphase (73 years, red line). Inset: The doubling time and half-life over time.(F) The median trajectory of the 25 most famous personalities born between1800 and 1920 in various careers.

www.sciencemag.org SCIENCE VOL 331 14 JANUARY 2011 179

RESEARCH ARTICLE

on

April

21,

201

1w

ww

.sci

ence

mag

.org

Dow

nloa

ded

from

Page 9: The Data-Driven World

Tracking censorship

to widespread impact (frequency >25% of peak).Since then, the cultural adoption of technology hasbecome more rapid. The 1840–1880 inventioncohort was widely adopted within 50 years; the1880–1920 cohort within 27 (Fig. 3B and fig. S7).

“In the future, everyone will be famous for7.5minutes” –Whatshisname. People, too, rise toprominence, only to be forgotten (22). Fame can betracked by measuring the frequency of a person’sname (Fig. 3C). We compared the rise to fame ofthe most famous people of different eras. We tookall 740,000 people with entries in Wikipedia,removed cases where several famous individualsshare a name, and sorted the rest by birth date andfrequency (23). For every year from 1800 to 1950,we constructed a cohort consisting of the 50 most

famous people born in that year. For example, the1882 cohort includes “Virginia Woolf” and “FelixFrankfurter”; the 1946 cohort includes “BillClinton” and “Steven Spielberg”. We plotted themedian frequency for the names in each cohortover time (Fig. 3,D andE). The resulting trajectorieswere all similar. Each cohort had a pre-celebrityperiod (median frequency <10!9), followed by arapid rise to prominence, a peak, and a slow de-cline.We therefore characterized each cohort usingfour parameters: (i) the age of initial celebrity, (ii)the doubling time of the initial rise, (iii) the age ofpeak celebrity, and (iv) the half-life of the decline(Fig. 3E). The age of peak celebrity has been con-sistent over time: about 75 years after birth. Butthe other parameters have been changing (fig. S8).

Fame comes sooner and rises faster. Between theearly 19th century and the mid-20th century, theage of initial celebrity declined from 43 to 29years, and the doubling time fell from 8.1 to 3.3years. As a result, the most famous people alivetoday are more famous—in books—than theirpredecessors. Yet this fame is increasingly short-lived: The post-peak half-life dropped from 120to 71 years during the 19th century.

We repeated this analysis with all 42,358people in the databases of the EncyclopaediaBritannica (24), which reflect a process of expertcuration that began in 1768. The results weresimilar (7) (fig. S9). Thus, people are getting morefamous than ever before but are being forgottenmore rapidly than ever.

Fig. 4. Culturomics can be used todetect censorship. (A) Usage frequen-cy of “Marc Chagall” in German (red)as compared to English (blue). (B)Suppression of Leon Trotsky (blue),Grigory Zinoviev (green), and LevKamenev (red) in Russian texts,with noteworthy events indicated:Trotsky’s assassination (blue arrow),Zinoviev and Kamenev executed(red arrow), the Great Purge (redhighlight), and perestroika (gray ar-row). (C) The 1976 and 1989 Tianan-men Square incidents both led toelevated discussion in English texts(scale shown on the right). Responseto the 1989 incident is largely ab-sent inChinese texts (blue, scale shownon the left), suggesting governmentcensorship. (D) While the Holly-wood Ten were blacklisted (redhighlight) from U.S. movie studios,their fame declined (median: thickgray line). None of them were cred-ited in a film until 1960’s (aptlynamed) Exodus. (E) Artists and writ-ers in various disciplines were sup-pressed by the Nazi regime (redhighlight). In contrast, theNazis them-selves (thick red line) exhibited astrong fame peak during the waryears. (F) Distribution of suppres-sion indices for both English (blue)andGerman (red) for the period from1933–1945. Three victims of Nazisuppression are highlighted at left(red arrows). Inset: Calculation ofthe suppression index for “HenriMatisse”.

天安門

Freq

uenc

y

Freq

uenc

y

Freq

uenc

y

Freq

uenc

y

A B

C D

E F

14 JANUARY 2011 VOL 331 SCIENCE www.sciencemag.org180

RESEARCH ARTICLE

on

April

21,

201

1w

ww

.sci

ence

mag

.org

Dow

nloa

ded

from

wikipedia.org

Page 10: The Data-Driven World

to widespread impact (frequency >25% of peak).Since then, the cultural adoption of technology hasbecome more rapid. The 1840–1880 inventioncohort was widely adopted within 50 years; the1880–1920 cohort within 27 (Fig. 3B and fig. S7).

“In the future, everyone will be famous for7.5minutes” –Whatshisname. People, too, rise toprominence, only to be forgotten (22). Fame can betracked by measuring the frequency of a person’sname (Fig. 3C). We compared the rise to fame ofthe most famous people of different eras. We tookall 740,000 people with entries in Wikipedia,removed cases where several famous individualsshare a name, and sorted the rest by birth date andfrequency (23). For every year from 1800 to 1950,we constructed a cohort consisting of the 50 most

famous people born in that year. For example, the1882 cohort includes “Virginia Woolf” and “FelixFrankfurter”; the 1946 cohort includes “BillClinton” and “Steven Spielberg”. We plotted themedian frequency for the names in each cohortover time (Fig. 3,D andE). The resulting trajectorieswere all similar. Each cohort had a pre-celebrityperiod (median frequency <10!9), followed by arapid rise to prominence, a peak, and a slow de-cline.We therefore characterized each cohort usingfour parameters: (i) the age of initial celebrity, (ii)the doubling time of the initial rise, (iii) the age ofpeak celebrity, and (iv) the half-life of the decline(Fig. 3E). The age of peak celebrity has been con-sistent over time: about 75 years after birth. Butthe other parameters have been changing (fig. S8).

Fame comes sooner and rises faster. Between theearly 19th century and the mid-20th century, theage of initial celebrity declined from 43 to 29years, and the doubling time fell from 8.1 to 3.3years. As a result, the most famous people alivetoday are more famous—in books—than theirpredecessors. Yet this fame is increasingly short-lived: The post-peak half-life dropped from 120to 71 years during the 19th century.

We repeated this analysis with all 42,358people in the databases of the EncyclopaediaBritannica (24), which reflect a process of expertcuration that began in 1768. The results weresimilar (7) (fig. S9). Thus, people are getting morefamous than ever before but are being forgottenmore rapidly than ever.

Fig. 4. Culturomics can be used todetect censorship. (A) Usage frequen-cy of “Marc Chagall” in German (red)as compared to English (blue). (B)Suppression of Leon Trotsky (blue),Grigory Zinoviev (green), and LevKamenev (red) in Russian texts,with noteworthy events indicated:Trotsky’s assassination (blue arrow),Zinoviev and Kamenev executed(red arrow), the Great Purge (redhighlight), and perestroika (gray ar-row). (C) The 1976 and 1989 Tianan-men Square incidents both led toelevated discussion in English texts(scale shown on the right). Responseto the 1989 incident is largely ab-sent inChinese texts (blue, scale shownon the left), suggesting governmentcensorship. (D) While the Holly-wood Ten were blacklisted (redhighlight) from U.S. movie studios,their fame declined (median: thickgray line). None of them were cred-ited in a film until 1960’s (aptlynamed) Exodus. (E) Artists and writ-ers in various disciplines were sup-pressed by the Nazi regime (redhighlight). In contrast, theNazis them-selves (thick red line) exhibited astrong fame peak during the waryears. (F) Distribution of suppres-sion indices for both English (blue)andGerman (red) for the period from1933–1945. Three victims of Nazisuppression are highlighted at left(red arrows). Inset: Calculation ofthe suppression index for “HenriMatisse”.

天安門

Freq

uenc

y

Freq

uenc

y

Freq

uenc

y

Freq

uenc

y

A B

C D

E F

14 JANUARY 2011 VOL 331 SCIENCE www.sciencemag.org180

RESEARCH ARTICLE

on

April

21,

201

1w

ww

.sci

ence

mag

.org

Dow

nloa

ded

from

Page 11: The Data-Driven World

History of science

Occupational choices affect the rise to fame.We focused on the 25most famous individuals bornbetween 1800 and 1920 in seven occupations (ac-tors, artists, writers, politicians, biologists, phys-icists, and mathematicians), examining how theirfame grewas a function of age (Fig. 3F and fig. S10).

Actors tend to become famous earliest, ataround 30. But the fame of the actors we studied,whose ascent preceded the spread of television,rises slowly thereafter. (Their fame peaked at afrequency of 2 ! 10!7.) The writers became fa-mous about a decade after the actors, but rose forlonger and to a much higher peak (8 ! 10!7).Politicians did not become famous until their 50s,when, upon being elected president of the UnitedStates (in 11 of 25 cases; 9 more were heads ofother states), they rapidly rose to become themost famous of the groups (1 ! 10!6).

Science is a poor route to fame. Physicists andbiologists eventually reached a similar level offame as actors (1 ! 10!7), but it took them farlonger. Alas, even at their peak, mathematicianstend not to be appreciated by the public (2 ! 10!8).

Detecting censorship and suppression. Sup-pression of a person or an idea leaves quantifiablefingerprints (25). For instance, Nazi censorship ofthe Jewish artist Marc Chagall is evident bycomparing the frequency of “Marc Chagall” inEnglish and in German books (Fig. 4A). In bothlanguages, there is a rapid ascent starting in thelate 1910s (when Chagall was in his early 30s). InEnglish, the ascent continues. But in German, theartist’s popularity decreases, reaching a nadir from1936 to 1944, when his full name appears onlyonce. (In contrast, from 1946 to 1954, “MarcChagall” appears nearly 100 times in the German

corpus.) Such examples are found in many coun-tries, includingRussia (Trotsky), China (TiananmenSquare), and theUnited States (theHollywoodTen,blacklisted in 1947) (Fig. 4, B to D, and fig. S11).

We probed the impact of censorship on aperson’s cultural influence in Nazi Germany. Ledby such figures as the librarianWolfgangHermann,the Nazis created lists of authors and artists whose“undesirable”, “degenerate” work was bannedfrom libraries and museums and publicly burned(26–28). We plotted median usage in German forfive such lists: artists (100 names) and writers ofliterature (147), politics (117), history (53), andphilosophy (35) (Fig. 4E and fig. S12). We alsoincluded a collection of Nazi party members [547names (7)]. The five suppressed groups exhibiteda decline. This decline was modest for writers ofhistory (9%) and literature (27%), but pronouncedin politics (60%), philosophy (76%), and art(56%). The only group whose signal increasedduring the Third Reich was the Nazi party mem-bers [a 500% increase (7)].

Given such strong signals, we tested whetherone could identify victims of Nazi repression denovo.We computed a “suppression index” (s) foreach person by dividing their frequency from1933 to 1945 by themean frequency in 1925–1933and in 1955–1965 (Fig. 4F, inset). In English, thedistribution of suppression indices is tightly cen-tered around unity. Fewer than 1% of individualslie at the extremes (s < 1/5 or s > 5).

In German, the distribution is much wider, andskewed to the left: Suppression in Nazi Germanywas not the exception, but the rule (Fig. 4F). At thefar left, 9.8% of individuals showed strongsuppression (s < 1/5). This population is highlyenriched in documented victims of repression,such as Pablo Picasso (s = 0.12), the Bauhausarchitect Walter Gropius (s = 0.16), and HermannMaas (s < 0.01), an influential Protestant ministerwho helped many Jews flee (7). (Maas was laterrecognized by Israel’s Yad Vashem as one of the“Righteous Among the Nations.”) At the otherextreme, 1.5% of the population exhibited a dra-matic rise (s > 5). This subpopulation is highlyenriched in Nazis andNazi-supporters, who bene-fited immensely from government propaganda (7).

These results provide a strategy for rapidlyidentifying likely victims of censorship from alarge pool of possibilities, and highlight how cul-turomic methods might complement existing his-torical approaches.

Culturomics. Culturomics is the applicationof high-throughput data collection and analysis tothe study of human culture. Books are a begin-ning, but we must also incorporate newspapers(29), manuscripts (30), maps (31), artwork (32),and a myriad of other human creations (33, 34).Of course, many voices—already lost to time—lie forever beyond our reach.

Culturomic results are a new type of evidencein the humanities. As with fossils of ancient crea-tures, the challenge of culturomics lies in the in-terpretation of this evidence. Considerations ofspace restrict us to the briefest of surveys: a

A B

C D

E F

G H

Fig. 5. Culturomics provides quantitative evidence for scholars in many fields. (A) Historical epi-demiology: “influenza” is shown in blue; the Russian, Spanish, and Asian flu epidemics are highlighted.(B) History of the Civil War. (C) Comparative history. (D) Gender studies. (E and F) History of science. (G)Historical gastronomy. (H) History of religion: “God”.

www.sciencemag.org SCIENCE VOL 331 14 JANUARY 2011 181

RESEARCH ARTICLE

on

April

21,

201

1w

ww

.sci

ence

mag

.org

Dow

nloa

ded

from

Page 12: The Data-Driven World

Feminism

Occupational choices affect the rise to fame.We focused on the 25most famous individuals bornbetween 1800 and 1920 in seven occupations (ac-tors, artists, writers, politicians, biologists, phys-icists, and mathematicians), examining how theirfame grewas a function of age (Fig. 3F and fig. S10).

Actors tend to become famous earliest, ataround 30. But the fame of the actors we studied,whose ascent preceded the spread of television,rises slowly thereafter. (Their fame peaked at afrequency of 2 ! 10!7.) The writers became fa-mous about a decade after the actors, but rose forlonger and to a much higher peak (8 ! 10!7).Politicians did not become famous until their 50s,when, upon being elected president of the UnitedStates (in 11 of 25 cases; 9 more were heads ofother states), they rapidly rose to become themost famous of the groups (1 ! 10!6).

Science is a poor route to fame. Physicists andbiologists eventually reached a similar level offame as actors (1 ! 10!7), but it took them farlonger. Alas, even at their peak, mathematicianstend not to be appreciated by the public (2 ! 10!8).

Detecting censorship and suppression. Sup-pression of a person or an idea leaves quantifiablefingerprints (25). For instance, Nazi censorship ofthe Jewish artist Marc Chagall is evident bycomparing the frequency of “Marc Chagall” inEnglish and in German books (Fig. 4A). In bothlanguages, there is a rapid ascent starting in thelate 1910s (when Chagall was in his early 30s). InEnglish, the ascent continues. But in German, theartist’s popularity decreases, reaching a nadir from1936 to 1944, when his full name appears onlyonce. (In contrast, from 1946 to 1954, “MarcChagall” appears nearly 100 times in the German

corpus.) Such examples are found in many coun-tries, includingRussia (Trotsky), China (TiananmenSquare), and theUnited States (theHollywoodTen,blacklisted in 1947) (Fig. 4, B to D, and fig. S11).

We probed the impact of censorship on aperson’s cultural influence in Nazi Germany. Ledby such figures as the librarianWolfgangHermann,the Nazis created lists of authors and artists whose“undesirable”, “degenerate” work was bannedfrom libraries and museums and publicly burned(26–28). We plotted median usage in German forfive such lists: artists (100 names) and writers ofliterature (147), politics (117), history (53), andphilosophy (35) (Fig. 4E and fig. S12). We alsoincluded a collection of Nazi party members [547names (7)]. The five suppressed groups exhibiteda decline. This decline was modest for writers ofhistory (9%) and literature (27%), but pronouncedin politics (60%), philosophy (76%), and art(56%). The only group whose signal increasedduring the Third Reich was the Nazi party mem-bers [a 500% increase (7)].

Given such strong signals, we tested whetherone could identify victims of Nazi repression denovo.We computed a “suppression index” (s) foreach person by dividing their frequency from1933 to 1945 by themean frequency in 1925–1933and in 1955–1965 (Fig. 4F, inset). In English, thedistribution of suppression indices is tightly cen-tered around unity. Fewer than 1% of individualslie at the extremes (s < 1/5 or s > 5).

In German, the distribution is much wider, andskewed to the left: Suppression in Nazi Germanywas not the exception, but the rule (Fig. 4F). At thefar left, 9.8% of individuals showed strongsuppression (s < 1/5). This population is highlyenriched in documented victims of repression,such as Pablo Picasso (s = 0.12), the Bauhausarchitect Walter Gropius (s = 0.16), and HermannMaas (s < 0.01), an influential Protestant ministerwho helped many Jews flee (7). (Maas was laterrecognized by Israel’s Yad Vashem as one of the“Righteous Among the Nations.”) At the otherextreme, 1.5% of the population exhibited a dra-matic rise (s > 5). This subpopulation is highlyenriched in Nazis andNazi-supporters, who bene-fited immensely from government propaganda (7).

These results provide a strategy for rapidlyidentifying likely victims of censorship from alarge pool of possibilities, and highlight how cul-turomic methods might complement existing his-torical approaches.

Culturomics. Culturomics is the applicationof high-throughput data collection and analysis tothe study of human culture. Books are a begin-ning, but we must also incorporate newspapers(29), manuscripts (30), maps (31), artwork (32),and a myriad of other human creations (33, 34).Of course, many voices—already lost to time—lie forever beyond our reach.

Culturomic results are a new type of evidencein the humanities. As with fossils of ancient crea-tures, the challenge of culturomics lies in the in-terpretation of this evidence. Considerations ofspace restrict us to the briefest of surveys: a

A B

C D

E F

G H

Fig. 5. Culturomics provides quantitative evidence for scholars in many fields. (A) Historical epi-demiology: “influenza” is shown in blue; the Russian, Spanish, and Asian flu epidemics are highlighted.(B) History of the Civil War. (C) Comparative history. (D) Gender studies. (E and F) History of science. (G)Historical gastronomy. (H) History of religion: “God”.

www.sciencemag.org SCIENCE VOL 331 14 JANUARY 2011 181

RESEARCH ARTICLE

on

April

21,

201

1w

ww

.sci

ence

mag

.org

Dow

nloa

ded

from

Page 13: The Data-Driven World

Gender equality

Occupational choices affect the rise to fame.We focused on the 25most famous individuals bornbetween 1800 and 1920 in seven occupations (ac-tors, artists, writers, politicians, biologists, phys-icists, and mathematicians), examining how theirfame grewas a function of age (Fig. 3F and fig. S10).

Actors tend to become famous earliest, ataround 30. But the fame of the actors we studied,whose ascent preceded the spread of television,rises slowly thereafter. (Their fame peaked at afrequency of 2 ! 10!7.) The writers became fa-mous about a decade after the actors, but rose forlonger and to a much higher peak (8 ! 10!7).Politicians did not become famous until their 50s,when, upon being elected president of the UnitedStates (in 11 of 25 cases; 9 more were heads ofother states), they rapidly rose to become themost famous of the groups (1 ! 10!6).

Science is a poor route to fame. Physicists andbiologists eventually reached a similar level offame as actors (1 ! 10!7), but it took them farlonger. Alas, even at their peak, mathematicianstend not to be appreciated by the public (2 ! 10!8).

Detecting censorship and suppression. Sup-pression of a person or an idea leaves quantifiablefingerprints (25). For instance, Nazi censorship ofthe Jewish artist Marc Chagall is evident bycomparing the frequency of “Marc Chagall” inEnglish and in German books (Fig. 4A). In bothlanguages, there is a rapid ascent starting in thelate 1910s (when Chagall was in his early 30s). InEnglish, the ascent continues. But in German, theartist’s popularity decreases, reaching a nadir from1936 to 1944, when his full name appears onlyonce. (In contrast, from 1946 to 1954, “MarcChagall” appears nearly 100 times in the German

corpus.) Such examples are found in many coun-tries, includingRussia (Trotsky), China (TiananmenSquare), and theUnited States (theHollywoodTen,blacklisted in 1947) (Fig. 4, B to D, and fig. S11).

We probed the impact of censorship on aperson’s cultural influence in Nazi Germany. Ledby such figures as the librarianWolfgangHermann,the Nazis created lists of authors and artists whose“undesirable”, “degenerate” work was bannedfrom libraries and museums and publicly burned(26–28). We plotted median usage in German forfive such lists: artists (100 names) and writers ofliterature (147), politics (117), history (53), andphilosophy (35) (Fig. 4E and fig. S12). We alsoincluded a collection of Nazi party members [547names (7)]. The five suppressed groups exhibiteda decline. This decline was modest for writers ofhistory (9%) and literature (27%), but pronouncedin politics (60%), philosophy (76%), and art(56%). The only group whose signal increasedduring the Third Reich was the Nazi party mem-bers [a 500% increase (7)].

Given such strong signals, we tested whetherone could identify victims of Nazi repression denovo.We computed a “suppression index” (s) foreach person by dividing their frequency from1933 to 1945 by themean frequency in 1925–1933and in 1955–1965 (Fig. 4F, inset). In English, thedistribution of suppression indices is tightly cen-tered around unity. Fewer than 1% of individualslie at the extremes (s < 1/5 or s > 5).

In German, the distribution is much wider, andskewed to the left: Suppression in Nazi Germanywas not the exception, but the rule (Fig. 4F). At thefar left, 9.8% of individuals showed strongsuppression (s < 1/5). This population is highlyenriched in documented victims of repression,such as Pablo Picasso (s = 0.12), the Bauhausarchitect Walter Gropius (s = 0.16), and HermannMaas (s < 0.01), an influential Protestant ministerwho helped many Jews flee (7). (Maas was laterrecognized by Israel’s Yad Vashem as one of the“Righteous Among the Nations.”) At the otherextreme, 1.5% of the population exhibited a dra-matic rise (s > 5). This subpopulation is highlyenriched in Nazis andNazi-supporters, who bene-fited immensely from government propaganda (7).

These results provide a strategy for rapidlyidentifying likely victims of censorship from alarge pool of possibilities, and highlight how cul-turomic methods might complement existing his-torical approaches.

Culturomics. Culturomics is the applicationof high-throughput data collection and analysis tothe study of human culture. Books are a begin-ning, but we must also incorporate newspapers(29), manuscripts (30), maps (31), artwork (32),and a myriad of other human creations (33, 34).Of course, many voices—already lost to time—lie forever beyond our reach.

Culturomic results are a new type of evidencein the humanities. As with fossils of ancient crea-tures, the challenge of culturomics lies in the in-terpretation of this evidence. Considerations ofspace restrict us to the briefest of surveys: a

A B

C D

E F

G H

Fig. 5. Culturomics provides quantitative evidence for scholars in many fields. (A) Historical epi-demiology: “influenza” is shown in blue; the Russian, Spanish, and Asian flu epidemics are highlighted.(B) History of the Civil War. (C) Comparative history. (D) Gender studies. (E and F) History of science. (G)Historical gastronomy. (H) History of religion: “God”.

www.sciencemag.org SCIENCE VOL 331 14 JANUARY 2011 181

RESEARCH ARTICLE

on

April

21,

201

1w

ww

.sci

ence

mag

.org

Dow

nloa

ded

from

Page 14: The Data-Driven World
Page 15: The Data-Driven World

2Society: OkCupid

Page 16: The Data-Driven World

OkCupid: data mining to analyze dating

Page 17: The Data-Driven World

men want to date most attractive women

Page 18: The Data-Driven World

most men are below average

Page 19: The Data-Driven World
Page 20: The Data-Driven World
Page 21: The Data-Driven World

3Personal Life: Tracking

Page 22: The Data-Driven World

Goodreads tracks all books you have ever read

Page 23: The Data-Driven World

Goodreads tracks all books you have ever read

Page 24: The Data-Driven World

Goodreads tracks all books you have ever read

Page 25: The Data-Driven World
Page 26: The Data-Driven World

Travel planning Visited places Read books

Flight log

Exercise tracking Conversation log Activity tracking

Page 27: The Data-Driven World

Fitbit: tracking physical activity

Page 28: The Data-Driven World

Where are we heading to?

Page 29: The Data-Driven World

Where are we heading to?

Science: Other fields1Society: Data integration2Personal life: Data integration3

Page 30: The Data-Driven World

1Science: Other fields

Page 31: The Data-Driven World

Studying art history with image processing?

Page 32: The Data-Driven World

Analyzing literature using text mining?

http://www.textually.org/textually/archives/2009/02/022619.htm

Page 33: The Data-Driven World

Co-mentioning of people

When Arno and his father reached school, the classes had already begun.

Kevade by Oska Luts

Page 34: The Data-Driven World

Co-mentioning of people

When Arno and his father reached school, the classes had already begun.

Kevade by Oska Luts

Page 35: The Data-Driven World

Co-mentioning of terms

When Arno and his father reached school, the classes had already begun.

Kevade by Oska Luts

Page 36: The Data-Driven World

2Society: Data Integration

Page 37: The Data-Driven World

Social network analysis

http://www.psychologytoday.com/blog/mr-personality/201001/the-psychology-social-networking

Page 38: The Data-Driven World

Data Integration

http://strategicbusinessintelligence.biz/images/data-integration.jpg

Page 39: The Data-Driven World

3Personal Life: Data Integration

Page 40: The Data-Driven World

Almost complete log of personal life

Page 41: The Data-Driven World

Travel planning Visited places Read books

Flight log

Exercise tracking Conversation log Activity tracking

Page 42: The Data-Driven World

Purchase history

Data integration

Public transport log

Page 43: The Data-Driven World

How to get there?

Page 44: The Data-Driven World

Lack of data won’t be the problem

http://www.sand.com/wp-content/uploads/2011/04/binary_data.jpg

Page 45: The Data-Driven World

Data integration and analysis skills are important

http://www.keralaevents.com/eventphotos/799/Dataanalysis_1.jpg

Page 46: The Data-Driven World

Privacy has to be preserved

http://www.geekologie.com/2008/04/warmth_and_privacy_while_using.php

Page 47: The Data-Driven World

Privacy-preserving data mining

http://sharemind.cyber.ee/

Page 48: The Data-Driven World

Hard part is to come up with good questions

http://www.hhmi.org/bulletin/dec2005/chronicle/crosstalk.html

Page 49: The Data-Driven World

Conclusion

Page 50: The Data-Driven World

Conclusion

• There are already many successful examples of data-rich applications.

• More and more data will become available in many different fields.

• Collecting data is easy. Difficulties lie in analyzing it and understanding what it means.

Page 51: The Data-Driven World

References

• Quantitative Analysis of Culture Using Millions of Digitized Books, Jean-Baptiste Michel, et al. Science 331, 176 (2011)

• OkTrends: Dating research from OkCupid http://blog.okcupid.com

• The Data-Driven Life, Gary Wolf, The New York Times, http://www.nytimes.com/2010/05/02/magazine/02self-measurement-t.html

• The Quantified Self | self knowledge through numbershttp://quantifiedself.com