Download - Datamining Project: Update Marcus Gutierrez, Jessica Rebollosa, Vince (Wenqing) Sun, Lauren Waldrop
Datamining Project: UpdateMarcus Gutierrez, Jessica Rebollosa, Vince (Wenqing) Sun, Lauren Waldrop
http://dataminingmed.weebly.com
Recap• Data
▫ Non-homogenous datasets (Clinical Trial/Pubmed)▫ Cancer-related▫ Relations (Explicit links)
• Motivation▫ Implicit links between clinical trials and pubmed articles
may exist
• Aim▫ Provide scientists in the biological community insight into
related clinical trials and/or other publications of interest
2
Data Pre-Prepocessing• First Trial terms:
1. if. radiation therapy2. i),. gemcitabine/cisplatin3. weeks until. disease progression4. this. regimen5. serum levels of. interleukin-66. biliary adenocarcinomas7. adenocarcinoma treated8. post-operative adjuvant paclitaxel +
cisplatin9. phase ii trial of post-operative10.cardia receiving. post-operative
adjuvant paclitaxel11.gastro-esophageal junction or cardia
3
Data Pre-Prepocessing• LingPipe codes gives terms such as:
Running Stadistical Name Entity Recognizer with Training a Named Entity
Recognizer with two models: pos-en-bio-medpost.HiddenMarkovModel and
pos-en-general-brown.HiddenMarkovModel1. brain metastases2. patients undergo3. prophylactic cranial irradiation4. brain5. disease small cell lung cancer6. cranial irradiation7. health economics8. therapy vs progression9. administration
4
Refer to: http://alias-i.com/lingpipe/demos/tutorial/ne/read-me.html
Data Pre-Prepocessing• Final terms:
Trials:
1. bronchi
2. bronchial
3. bsh
4. cachexia
5. calcimimetic
6. calcium
7. hybridization
8. hydrochloride
9. hydrocortisone
10.hydroxyproline
11.hypercortisolism
5
Refer to: https://github.com/tnunes/becas-python
Pubmed:
1. abdomen
2. acc
3. acetate
4. acitretin
5. actinomycin
6. add
7. dermatitis
8. desmoid
9. desmolase
10.desmoplastic
11.detoxification
Semantic groupIdentified entity types:➢ Chemicals➢ Enzymes➢ Genes➢ Protein➢ Disease or
Syndrome➢ Anatomical
structure➢ Body System
Entity Extraction - TFIDF•Using textmodeler code
▫Extract entities▫Calculate TFIDF
•Examples of Features:▫“thyroid cancer” ▫“stem cell”
▫“cell lung cancer tumor cells tumor cells” X▫“arms arm arm oxaliplatin arm arm” X
•Number of Unique Entities:▫Pubmed: 1696▫Trials: 1492
6
Term Extraction - TFIDF•Implement Simple Code
▫Term Extraction▫TFIDF Calculation
•Examples of Features:▫“mesothelioma” ▫“adenocarcinoma”▫“neoplasia”
•Number of Unique Entities:▫Pubmed: 818▫Trials: 802
7
8
ResultsFor the variance analysis, we removed the maximum threshold, we only use Minimum threshold to see if there are any improvements.
9
ResultsThen we choose threshold = 0.00008 and 0.000070.
And we noticed the two ACS figures are very similar.
10
ResultsThen we removed the terms with variance lower than threshold, and get the clusters before dependent clusterK=10. But after the dependent clustering, there is only one giant cluster.
11
Results
12
Results
13
ResultsNow we use the same data set with preprocessing: we removed the terms like “and”, “or”.
14
ResultsThis is the variance using the preprocessed data.
And I set the threshold to .00005, .00006, 0.00007, .00008, .00009, .0001, .00011, 0.00012, .00013, .00014, .00015, .00016, 0.00017, .00018, .00019, .0002.
And we set the threshold candidates to: 0.00003, 0.00006, 0.00008, 0.00009, 0.0001, 0.00011, 0.00012, 0.00013, .00014, .00015, .00016, .00017, .00018, .00019,.00020,.00021,.00022,.00023,.00024, 0.00025.
15
Results
K=5
16
Results
17
Results
Before dependent clustering
18
Results
After dependent clustering
19
ResultsWhen we considered the medical terms and use the dictionary to preprocess the data, and we used entities as the feature, then we get:
20
ResultsThe clustering results before dependent clustering:
21
ResultsWhen we considered the medical terms and use the dictionary to preprocess the data, and we use each term as a feature, we get:
22
ResultsBefore dependent clustering
Heterogeneous Naïve Bayes Classification
Find the Probability that a relation exists for every document in corpus B, given every document in corpus A
23
Corpus A
doc t1 t2 t3 t4
A rat cat cat bat
B rat rat bat
C dog dog cat
D bird bat bat dog
Z cat bird dog
Corpus B
doc t1 t2 t3 t4 t5
1 trial boy boy sick
2 trial healthy girl
3 trial cancer treatment girl
4 trial cancer brain cancer
5 trial blind boy girl girl
6 trial brain cancer blind
Relational
Doc (A) Doc (B)
A 2,4
B 1,6
C 1,2
D 4
24
doc (A) class t1 t2 t3 t4
A trial rat cat cat bat
A healthy rat cat cat bat
A girl rat cat cat bat
A trial rat cat cat bat
A cancer rat cat cat bat
A brain rat cat cat bat
A cancer rat cat cat bat
B trial rat rat bat
B boy rat rat bat
B boy rat rat bat
B sick rat rat bat
B trial rat rat bat
B brain rat rat bat
B cancer rat rat bat
B blind rat rat bat
C trial dog dog cat
C boy dog dog cat
C boy dog dog cat
C sick dog dog cat
C trial dog dog cat
C healthy dog dog cat
C girl dog dog cat
D trial bird bat bat dog
D cancer bird bat bat dog
D brain bird bat bat dog
D cancer bird bat bat dog
25
Corpus A
doc t1 t2 t3 t4
A rat cat cat bat
B rat rat bat
C dog dog cat
D bird bat bat dog
Z cat bird dog
Corpus B
doc t1 t2 t3 t4 t5
1 trial boy boy sick
2 trial healthy girl
3 trial cancer treatment girl
4 trial cancer brain cancer
5 trial blind boy girl girl
6 trial brain cancer blind
docs doc 1 doc 2 doc 3 doc 4 doc 5 doc 6
doc A 0.001645 0.001628 0.002001 0.002752 0.001866 0.002091
doc B 0.011329 0.004655 0.007571 0.013608 0.013202 0.016166
doc C 0.010044 0.008304 0.006061 0.003992 0.011153 0.003543
doc D 0.000146 0.000146 0.000435 0.000851 0.000146 0.000561
doc Z 0.000584 0.000584 0.001033 0.001655 0.000584 0.001206
26
Naïve Bayes Formulation
𝑃 (𝑑𝑜𝑐1|𝑑𝑜𝑐𝐴 )∝0.001645
P(trial)∙P(trial|rat) ∙P(trial|cat) ∙P(trial|cat) ∙P(trial|bat)+ P(boy)∙P(boy|rat) ∙P(boy|cat) ∙P(boy|cat) ∙P(boy|bat)+P(boy)∙P(boy|rat) ∙P(boy|cat) ∙P(boy|cat) ∙P(boy|bat)+
P(sick)∙P(sick|rat) ∙P(sick|cat) ∙P(sick|cat) ∙P(sick|bat)+
A rat cat cat bat
1 trial boy boy sick
27
Naïve Bayes Laplace Transform
• This handles better handles the terms that do not appear at all, however, we lose even more accuracy.
• This raises the question: Do we need to improve accuracy?
28
Naïve Bayes Accuracy
•We MAY not need to improve accuracy•We are more interested in relative ratings
4.214256697426845E-61 9.545714918515275E-62 6.375720538007726E-69 …
29
Naïve Bayes Future Improvement
•Improve accuracy•Improve speed•Determine criteria for predicting new links•Find out if new links improve or harm dependent clustering
30
Contributions from each MemberSW/P Removal1
Non-BT Removal2
VTR3
TFIDF Terms
TFIDF Entities
DC4
DenC5
NB6
DA & DV7
Web-site8
Jessica X X X X X
Lauren X X X X X X
Marcus X
Vince X X X X
1: Stop Word/Punctuation Removal2: Non-biological term removal3: Variance Term Removal4: Dependent Clustering5: Density Clustering6: Naïve Bayes – New Algorithm7: Data Analysis & Data Visualization
Contribution
31