analyzing nih funding patterns with statistical text...
TRANSCRIPT
Analyzing NIH Funding Patterns with Statistical Text Analysis
JihyunPark EricNalisnickPadhraicSmyth
Dept.OfComputerScienceUniversityOfCalifornia,Irvine
MargaretBlume-KohoutNewMexicoConsor>um
RalfKrestelWebScienceResearchGroup
Hasso-PlaGner-Ins>tut
▸ NIHinvestsover$30billioneachyear
▸ Canwegaininsightintothisprocessusingtextandmetadata?
▸ Ourapproachistousesta>s>caltopicmodeling
▸ WeusedgrantsdatafromNCI(Na>onalCancerIns>tute)
YEAR1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014
NU
MB
ER O
F G
RA
NTS
(TH
OU
SAN
DS)
4
5
6
7
8
9
10
11NCI FUNDED GRANTS
YEAR1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014
BIL
LIO
NS
(US
DO
LLA
RS)
1.5
2
2.5
3
3.5
4
4.5
5
5.5
6NCI FUNDING AMOUNT
JihyunPark,AAAI-16ScholarlyBigdataworkshop,Feb2016
Measuring the Impact of NIH(National Institute of Health) Funding2
ARRAFundedARRAFunded
ForeachgrantGENETICS
HUMAN GENOME
BIOENGINEERING
NANO TECHNOLOGY
0 0.25 0.5 0.75 1
1.0
0.8
0.7
0.2
JihyunPark,AAAI-16ScholarlyBigdataworkshop,Feb2016
Overview3
FundingpaAernsoverCmeforeacharea
Probabilityofeachlabelbeingassociatedwiththegrant
NCIData
0 35 70 105 140
PROJECT ID
GRANT ABSTRACT RCDCLabels
FUNDING YEAR …
0 35 70 105 140
...
TextClassificaConTechniques
Funding+Year informaCon
JihyunPark,AAAI-16ScholarlyBigdataworkshop,Feb2016
NCI (National Cancer Institute) Data
▸ Grantabstractsfrom1994through2013
▸ TextProcessing▸ BOWrepresenta>on▸ Removed500commonstopwords▸ Extractednoun-phrasetermsusingaNLPparser
▸ BOWData▸ Total149,901documents▸ Numberofdocumentswithlabels(trainingdata):31,628(2008~2011)▸ Numberofdocumentswithoutlabels:118,273▸ Sizeofvocabulary(W):29,713
4
JihyunPark,AAAI-16ScholarlyBigdataworkshop,Feb2016
LDA: Topics are Represented as Distributions over Words5
WEEKDOW_JONES
POINTS10_YR_TREASURY_YIELD
PERCENTCLOSE
NASDAQ_COMPOSITESTANDARD_POOR
CHANGEFRIDAY
WALL_STREETANALYSTSINVESTORS
FIRMGOLDMAN_SACHS
FIRMSINVESTMENT
MERRILL_LYNCHCOMPANIESSECURITIES
SEPT_11WAR
SECURITYIRAQ
TERRORISMNATIONKILLED
AFGHANISTANATTACKS
OSAMA_BIN_LADEN
BANKRUPTCYCREDITORS
BANKRUPTCY_PROTECTIONASSETS
COMPANYFILED
BANKRUPTCY_FILINGENRON
BANKRUPTCY_COURTKMART
Terrorism WallStreetFirms StockMarket Bankruptcy
Figures from Mark Steyvers
JihyunPark,AAAI-16ScholarlyBigdataworkshop,Feb2016
LDA: Documents are Represented as Combinations of Topics6
WEEKDOW_JONES
POINTS10_YR_TREASURY_YIELD
PERCENTCLOSE
NASDAQ_COMPOSITESTANDARD_POOR
CHANGEFRIDAY
WALL_STREETANALYSTSINVESTORS
FIRMGOLDMAN_SACHS
FIRMSINVESTMENT
MERRILL_LYNCHCOMPANIESSECURITIES
SEPT_11WAR
SECURITYIRAQ
TERRORISMNATIONKILLED
AFGHANISTANATTACKS
OSAMA_BIN_LADEN
BANKRUPTCYCREDITORS
BANKRUPTCY_PROTECTIONASSETS
COMPANYFILED
BANKRUPTCY_FILINGENRON
BANKRUPTCY_COURTKMART
Terrorism WallStreetFirms StockMarket Bankruptcy
Document1
70% 30%
Document2 Document3
…50% 50% 90%
Figures from Mark Steyvers
JihyunPark,AAAI-16ScholarlyBigdataworkshop,Feb2016
LDA (Latent Dirichlet Allocation)7
W
D
T
T
W
D
JihyunPark,AAAI-16ScholarlyBigdataworkshop,Feb2016
LDA (Latent Dirichlet Allocation)
▸ TopicModelsasFactorAnalysisforCountData
8
W
D
T
T
W
D
Ttopicweightsforeachdocument
topic-wordprobabilitydistribuCon
JihyunPark,AAAI-16ScholarlyBigdataworkshop,Feb2016
9
doc1 3 1 1 1 1
doc2 1 1
doc3 2 1
doc4 1 1 1
doc5 1 1
doc6 2
doc7 1 1
doc8 2 1
doc9 1 1 1 2
doc10 2
brai
n
lung
_can
cer
wom
en
obes
ity
child
ren
mic
e
expe
rimen
t
hbv
qual
ity
glio
ma
rese
arch
er
Do
cu
me
nts
Words or Terms
NIH Data Representation for L-LDA
JihyunPark,AAAI-16ScholarlyBigdataworkshop,Feb2016
10
doc1 3 1 1 1 1
doc2 1 1
doc3 2 1
doc4 1 1 1
doc5 1 1
doc6 2
doc7 1 1
doc8 2 1
doc9 1 1 1 2
doc10 2
1
1
1
1 1
1
1
1
1
1 1
1br
ain ca
ncer
brea
st ca
ncer
kidne
y dise
ase
lung
canc
er
min
d an
d bo
dy
brai
n
lung
_can
cer
wom
en
obes
ity
child
ren
mic
e
expe
rimen
t
hbv
qual
ity
glio
ma
rese
arch
er
Do
cu
me
nts
Words or Terms Codes or Labels
NIH Data Representation for L-LDA
JihyunPark,AAAI-16ScholarlyBigdataworkshop,Feb2016
11
doc1 3 1 1 1 1
doc2 1 1
doc3 2 1
doc4 1 1 1
doc5 1 1
doc6 2
doc7 1 1
doc8 2 1
doc9 1 1 1 2
doc10 2
1 1 1
1 1 1
1 1 1
1 1 1 1
1 1 1
1 1 1
1 1 1
1 1 1
1 1 1 1
1 1 1br
ain ca
ncer
brea
st ca
ncer
kidne
y dise
ase
lung
canc
er
min
d an
d bo
dy
Back
grou
nd 1
Back
grou
nd 2
brai
n
lung
_can
cer
wom
en
obes
ity
child
ren
mic
e
expe
rimen
t
hbv
qual
ity
glio
ma
rese
arch
er
Do
cu
me
nts
Words or Terms Codes or Labels
NIH Data Representation for L-LDA
JihyunPark,AAAI-16ScholarlyBigdataworkshop,Feb2016
Examples of Topics from NCI Abstracts (5 out of 98)12
BrainCancer
glioma
braintumor
gbm
malignantglioma
glioblastoma
brain
BreastCancer
breastcancer
women
breastcancercell
breast
breastcancerpaCent
brca1
KidneyDisease
rcc
kidneycancer
renalcellcarcinoma
vhl
renalcancer
pvhl
Background1
program
trainee
university
training
candidate
field
Background7
model
mice
work
experiment
human
mousemodel
88TopicsfromRCDClabels10Backgroundtopics
JihyunPark,AAAI-16ScholarlyBigdataworkshop,Feb2016
Evaluation 13
GrantswithRCDClabels(31,628documents)
TRAIN 90 %
TEST 10 %
28Kdocs
3Kdocs
29713terms
Samplingprobabili>eswereaveragedoverthewordsinadocumenttocalculateAUCandR-precisionscores
p(code | doc)AUC
R-Precision
p(code | doc)∝ p(code |wordi ,doc)i∑
JihyunPark,AAAI-16ScholarlyBigdataworkshop,Feb2016
Logistic Regression Classifier14
LogisCcRegressionClassifier
pLR(code = k | d)
p(code = 1| doc)
p(code = 2 | doc)
p(code = 87 | doc)
…
p(code = 88 | doc)
88logisCcregressionclassifierstrainedtoproducecalibrated
probabiliCesusingtrainingdata
L-LDA TOPIC PROBABILITY CALIBRATED TOPIC PROBABILITY
…
p(code = k | doc)
JihyunPark,AAAI-16ScholarlyBigdataworkshop,Feb2016
Evaluation Result15
p(c | d) pLR(c | d)
L-LDA L-LDA+LogisCcRegression
AUC 0.80 0.89
R-Precision 0.56 0.64
JihyunPark,AAAI-16ScholarlyBigdataworkshop,Feb2016
Analyzing Funding Patterns over Time
▸ Frac>onallyassignthefundsindirectpropor>ontotheprobabili>esfromthelogis>cregressionclassifiers
16
pLR(code | doc)
wcd =pLR(c | d)
pk (c = k | d)k∑
Fcy = wcdxd
d:yd=y∑
c = 1,2,...88
:weightforthecategorycfordocumentd
:amountoffundingfordocumentd(consideredinfla>on)
:yearwhendocumentdwasfunded
:totales>matedamountoffundingforcategorycinyeary
wcd
xdydFc
y
JihyunPark,AAAI-16ScholarlyBigdataworkshop,Feb2016
17
Es>matedpercentageoffundingallocatedto4generalRCDCcategoriesYEAR
1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014
FUN
DIN
G P
ERC
ENTA
GE
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6NanotechnologyNetworking-and-Information-Technology-RandDHuman-GenomeObesity
JihyunPark,AAAI-16ScholarlyBigdataworkshop,Feb2016
18
Es>matedpercentageoffundingallocatedto4specificRCDCdiseasecategoriesYEAR
1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014
FUN
DIN
G P
ERC
ENTA
GE
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6Lung-CancerLiver-CancerBrain-CancerInfectious-Diseases
JihyunPark,AAAI-16ScholarlyBigdataworkshop,Feb2016
19
YEAR1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014
FUN
DIN
G P
ERC
ENTA
GE
0
1
2
3
4
5
6
7 Breast-CancerTranslational-ResearchEpidemiology-And-Longitudinal-Studies
JihyunPark,AAAI-16ScholarlyBigdataworkshop,Feb2016
20
YEAR1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014
FUN
DIN
G P
ERC
ENTA
GE
0
2
4
6
8
10
12
14 GeneticsBiotechnologyBreast-CancerObesitySleep-Research
JihyunPark,AAAI-16ScholarlyBigdataworkshop,Feb2016
Conclusions
▸ Summary
▸ Labeledtopicmodelingandlogis>cclassifierscanbecombinedtoanalyzeNIHgrantfundingdata
▸ Sta>s>caltopicmodelingallowslinkingoftextwithmetadatainaquan>fiablemanner
▸ FutureWork▸ Jointlyanalyzegrantsandscien>ficar>clesrelatedtothegrants(ongoing)
▸ Broaderanalysisoftheeconomicandpolicyimplica>ons
▸ Improvementsontopicmodel
▸ Howtobestcalibrate
▸ Selec>ngtherightHyperparameters
▸ Methodssuchasusingseedwords
21
JihyunPark,AAAI-16ScholarlyBigdataworkshop,Feb2016
22
THANK YOU
JihyunPark,AAAI-16ScholarlyBigdataworkshop,Feb2016
23
BACKUP SLIDES
JihyunPark,AAAI-16ScholarlyBigdataworkshop,Feb2016
NCI (National Cancer Institute) Data24
NCIData
0 35 70 105 140
PROJECT ID
GRANT ABSTRACT RCDCLabels
FUNDING YEAR …
0 35 70 105 140
...
number of tokens0 50 100 150 200 250
#104
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
Number of tokens in a document Number of labels in a documentnumber of labels per document
5 10 15 20 25 300
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
JihyunPark,AAAI-16ScholarlyBigdataworkshop,Feb2016
Examples of Topics from NCI Abstracts (5 out of 88)25
BrainCancer
glioma
braintumor
gbm
malignantglioma
glioblastoma
brain
BreastCancer
breastcancer
women
breastcancercell
breast
breastcancerpaCent
brca1
KidneyDisease
rcc
kidneycancer
renalcellcarcinoma
vhl
renalcancer
pvhl
HepaCCs
hcv
hbv
livercancer
hepaCCsvirus
hbvinfecCon
hbvreplicaCon
LungCancer
gliomalungcancer
nsclc
lung
leadingcause
cancerdeath
egfr
JihyunPark,AAAI-16ScholarlyBigdataworkshop,Feb2016
Labeled-LDA for NIC Grants
▸ 88Topics(RCDCCodes)
▸ 10BackgroundTopics
▸ Hyperparameters
▸ Dirichletpriorforword-topicdistribu>on
▸ =0.01
▸ Dirichletpriorfordoc-topicdistribu>on
▸ Usedpropor>onalalphas
26
βw
αc = 5c=1
88
∑ αb = 1b
B
∑
JihyunPark,AAAI-16ScholarlyBigdataworkshop,Feb2016
27
JihyunPark,AAAI-16ScholarlyBigdataworkshop,Feb2016
Analyzing Funding Patterns over Time
▸ Frac>onallyassignthefundsindirectpropor>ontotheprobability
28
wcd =pl (c | d)
pk (c = k | d)k∑
Fcy = wcdxd
d:yd=y∑
c = 1,2,...88
JihyunPark,AAAI-16ScholarlyBigdataworkshop,Feb2016
NCI (National Cancer Institute) Data
▸ 149,901grantsintotal
▸ forFY1994~FY2013
▸ Numberofgrantswithlabels:31,628(2008~2011)
▸ Numberofgrantswithoutlabels:118,273
▸ Sizeofvocabulary(W):29,713
29
3 1 1 1 1
1 1
2 1
1 1 1
1 1
2
1 1
2 1
1 1 1 2
2
W
D