Fostering Serendipity through Big Linked Data
Muhammad Saleem , Maulik R. Kamdar , Aftab Iqbal , Shanmukha Sampath , Helena F. Deus , and Axel-Cyrille
Ngonga Ngomo
Semantic Web Challenge at ISWC2013, October 21-25 , 2013, Sydney, Australia
Agenda
• Motivation• Datasets• Architecture• Evaluation• Requirements• Demo• Conclusion and Future Work
Motivation
Fostering Serendipity through Big Data Triplification, Continuous Integration,
and Visualization
Triplification: Linked TCGA• TCGA is publicly accessible atlas of cancer
related data from National Cancer Institute (NCI) – 9000 patients– 33 cancer types– 147,645 raw data files– 12.7 TB
• Only 46% of the total expected data with new data being submitted every day
• Goal is to enable cancer researchers to make and validate important discoveries
• Total Linked TCGA > 30 billion triples (Largest Dataset of LOD)
Triplification:PubMed• Collection of publications from the bio-
medical domain• Large amount of metadata (MESH Terms)• 23+ million publications• 10,000 new publications/month
Big Data Continuous Integration
TopFed
Parser
Federator Optimizer
Integrator
Results
ResultsSPARQL Query
Sub-queryPubMed
Entrez UtilitiesRDFizer
Auto Loader
TCGA Data Portal
SPARQL endpoint
RDF
SPARQL endpoint
RDF
SPARQL endpoint
RDF
Index
b1 b2 p1 p2 g1 g2 g3p3 p4 g4 g5 g6p5 p6 g7 g8 g9
C = {CNV, SNP, E-Gene, E-Protein, miRNA, Clinical}
F = {Expression-Exon}M = {beta_value, position}
(CNV, SNP, E-Gene, miRNA, E-Protein, Clinical)
Exon-Expression
Methylation
D = {seg_mean, rpmmm, scaled_est, p_exp_val}
C-2 = {{p {∈ E ∪ A ∪ G} ∨ {p = rdf:type o ∧ ∈ F}} ∧ {{S-Join(p, E ∪ F) P-Join(∨ p, E ∪ F)} ∨ {!S-Join(p, M ∪ B ∪ D ∪ C) ∧ !P-Join(p, M ∪ B ∪ D ∪ C) }}}
C-3 = {{p {∈ M ∪ A} {p = rdf:type o ∨ ∧ ∈ B}} ∧ {{S-Join(p, M ∪ B) P-Join(∨ p, M ∪ B) } ∨ {!S-Join(p, E ∪ F ∪ D ∪ C) ∧ !P-Join(p, E ∪ F ∪ D ∪ C) }}}
C-1 = {{p {∈ D ∪ A ∪ G} {p = rdf:type o ∨ ∧ ∈ C}} ∧ {{S-Join(p, D ∪ C) P-Join(p, ∨ D ∪ C) } ∨ {!S-Join(p, M ∪ B ∪ E ∪ F) ∧ !P-Join(p, M ∪ B ∪ E ∪ F) }}}
C-1 Category ∨Colour = blue
IF tumour lookup is successful forward to corresponding leafElse broadcast to every one
For each query triple t(s, p, o) T ∈
A = {chromosome, result, bcr_patient_barcode} G = {start, stop}
B = {DNA-Methylation}
E = {RPKM}
Tumours
SPARQL endpoints
C-2 Category ∨Colour = pink
C-3 Category ∨Colour = green
1-16 17-33 1-5 6-11 12-16 17-22 23-27 28-33 1-4 5-8 9-12 13-16 17-20 21-24 25-27 28-30 31-33
Highly Scalable
Evaluation:Number of Sub-Query Submission
• TopFed number of sub-queries submission is 1/3 to FedX• Number of ASK requests
– FedX 480– TopFed 10
1 2 3 4 5 6 7 8 9 10 Avg0
10
20
30
40
50
60
FedX number of Sub-Query Submission TopFedE number of Sub-Query Submission
Evaluation: Query Runtime
1 2 3 4 5 6 7 8 9 10 Average10
100
1000
10000
100000FedX TopFed
Que
ry E
xecu
tion
Tim
e (m
sec)
in
log
scal
e
• TopFed outperform FedX significantly on 90% of the queries • On average, the query run time of TopFed is about 1/3 to that of FedX • TopFed‘s best run-time (query 2, query 3) is more than 75 times
smaller than that of FedX
Big Data Track Requirements• Data Volume
– 7.36 billion triples from Linked TCGA – 23 million publications from PubMed
• Data Variety– The Linked TCGA data was extracted from raw text files of different
structures– Processed the metadata associated with PubMed publications and
transform them into RDF– Unstructured data (publication abstracts) is processed to extract mentions
of gene names and cancers
• Data Velocity– TCGA data doubles /2 months– PubMed publications 10k/month
Big Data Visualization
Tumor-wise Visualization
PubMed Paper-wise Visualization
Genome-wise Patients Results Visualization
Everything is Public• Demo: http://srvgal78.deri.ie/tcga-pubmed/• TopFed: https://code.google.com/p/topfed/• TCGA Data Refiner, RDFizer: http://goo.gl/vSnBEJ• Utilities: http://goo.gl/kNrFdI• Linked TCGA : http://tcga.deri.ie/
[email protected] AKSW, University of Leipzig, Germany