genomic data model and genometric query language as
TRANSCRIPT
Dipartimento di Elettronica, Informazione e
Bioingegneria
Genomic ComputingPolitecnico di Milano 13-17 March, 2017
Genomic Data Model and GenoMetric Query Language as research enabler to discover genome propertiesMarco Masseroli and Stefano Ceri(joint work with several PhD students)
Politecnico di Milano, BioInformatics Group
2
Genomic Computing Background: Next Generation Sequencing
• Next Generation Sequencing technology is about to provide affordable (in time and cost) and precise determinations of genome wide:
– DNA sequence / variations (DNA-seq)– gene subregions’ activity (RNA-seq) [all gene test]– protein-DNA interaction regions (ChIP-seq) – open chromatin (DNase-seq)
Goal of $1,000 full genome sequencing in under an hour has just met
• Very many DNA-interacting proteins / subjects / conditions will be soon evaluated
– Personalized medicine (diagnosis and treatment)– Each NGS test can generate 0.4TB -> Big Data scenario
Source: http://blog.goldenhelix.com/grudy/a-hitchhiker%E2%80%99s- guide-to-next-generation-sequencing-part-2/
Genomic Computing Big data analysis with Next Generation Sequencing
My talk
3
4
HeterogeneousGenomic
Data Sources
HeterogeneousClinic
Data SourcesPersonal Patient Data
Genome Browser Genome Browser
Data Analytics
Data Analytics
Ontological Knowledge Ontological Knowledge
Genom
ic Data Integration
Publishing / Crawling / Searching
Genom
ic Data Integration
Publishing / Crawling / Searching
Clinical D
ata IntegrationC
linical Data Integration
Biologist ClinicianMedical Literature
Clinical Protocols
Genomic Computing The big picture: Distributed heterogeneous data
5
A n
umbe
r of g
enom
ic fe
atur
es (t
rack
s)
One macro genomic region
Dat
a tra
cks
Genomic Computing Current practice – UCSC Genome Browser
6
Genomic Computing The challenge: Understanding biologists’ needs
• Working together with biologists for giving answers to the problems behind the «courtesy» slide
Courtesy of Prof. Pelicci, IEO
7
Genomic Computing Challenge: Genotype–phenotype discovery
• (Epi)genotype-phenotype relationship discovery: understanding genomic regions, genome variations and their associations with different phenotypes
– highly heterogeneous scenario
• It requires evaluating, in several different conditions and types of individuals:
– genome (DNA) sequence variations – gene activity & its regulation– occurring interactions
8
Genomic Computing Main questions
Scientist’s typical questions(from our interaction with IEO - European Oncology Instituteand IIT - Italian Institute of Technology)
• Can interesting DNA regions and their relationships be discovered using genome-wide queries?
• Can genomic data of patients be grouped according to clinical phenotype and compared?
• Can the genomic features of all the genes involved in the same biological process be extracted and then analyzed?
• Can we retrieve portions of the genome of given patients, extracting them from remote servers and comparing them?
9
Genomic Computing Research agenda
• Can interesting DNA regions and their relationships be discovered using genome-wide queries?
Genometric query language• Can genomic data of patients be grouped according to
clinical phenotype and compared? Genometric query language + clustering
• Can all the features of the genes involved in the same biological process be extracted and then analyzed?
Genometric query language + data analysis
• Can we retrieve portions of the genome of given patients, extracting them from remote servers and comparing them?
Genometric query language + indexing & search
10
Genomic Computing Research agenda by topics
• Data model: design a simple and format-independent data model for describing datasets with both genomic regions and general provenance information (including phenotype)
• Query language: design a query language where both genometric aspects (about the placement of regions on the genome) and provenance can be queried at a high level of data independence and transparency
• Integrative data analysis: translating query results into a genome space which is the ideal start point for correlation and network analysis
• Data search: design protocols for data crawling and indexing based on the data model
Genomic Data Model
11
12
Genomic Computing Genomic Data Model
Within the same sample, two kinds of data:• Region values aligned w.r.t. a given reference, with specific
left-right ends within a chromosome, and with several associated attributes (e.g. p-value of region significance)
• Metadata, with free-format attribute-value pairs, storing all the knowledge about the sample
A C G T T A A C G G A T A C C A A C
left (position) right (position)
chr (chromosome)
strand (direction)
DNAr
Genomic Computing Model rationale
• Regions of the model are data format independent and provide an interoperability framework for comparing data on mutations, expression or regulation using regions as common ground
• Metadata attribute-value pairs of the model are info-system independent and provide an interoperability framework for comparing samples based upon their biological aspects
13
14
Genomic Computing Genomic Data Model – Example
0.1 0.6 Tumor_type = brcaPatient_age = 75
0.5 0Tumor_type = brcaPatient_age = 63Sex = Female
0.1 0 0.8 0.1Tumor_type = brcaPatient_age = 58
Sample 1
Sample 3
Sample 2
15
Genomic Computing Genomic Data Model – Example
• Region values: {expID, region:(chr, left, right, strand), p-value}
• Metadata: {expID, attribute, value}
Genomic ComputingGenomic Data Model - Samples and datasets
16
Samples and datasets• Every sample corresponds to an «experiment», with an ID• Every dataset is a named collection of samples with the
same region data schemaData format independent; interoperability framework for comparing data samples based upon their biological aspects
Genomic ComputingGenomic Data Model - Mapping examples
17
Genomic ComputingGenomic Data Model - Mapping examples
18
Genomic ComputingGenomic Data Model - Mapping examples
19
DNA-seq (mutations)(id, ('chr,start,stop,strand), (A,G,C,T,del,ins,inserted,ambig,Max,Error,A2T,A2C,A2G,C2A,C2G,C2T))(1, (chr1, 917179, 917180,*), (0,0,0,0,1,0,'.','.',0,0,0,0,0,0,0,0))(1, (chr1, 917179, 917179,*), (0,0,0,0,0,1,G,'.',0,0,0,0,0,0,0,0))
RNA-seq (gene expression)(id, ((chr,start,stop,strand), (source,type,score,frame,geneID,transcriptID,RPKM1,RPKM2,iIDR))(1, (chr8, 101960824, 101964847,-), ('GencodeV10', 'transcript', 0.026615, NULL, 'ENSG00000164924.11', 'ENST00000418997.1', 0.209968, 0.193078, 0.058))
Annotations(id, (chr,start,stop,strand), (proteinID,alignID,type))(1, (chr1, 11873, 11873, +), ('uc001aaa.3', 'uc001aaa.3', 'cds')) (1, (chr1, 11873, 12227, +), ('uc001aaa.3', 'uc001aaa.3', 'exon')) (1, (chr1, 12612, 12721, +), ('uc001aaa.3', 'uc001aaa.3', 'exon')) (1, (chr1, 13220, 14409, +), ('uc001aaa.3', 'uc001aaa.3', 'exon'))
ChIA-PET (denoting 3D genomic loops, head is assembled with coordinates, tail is in the schema)(id,(chr,headstart,headstop,strand), (loopType, tailChr, tailStart, tailStop, PETcount, pValue, qValue))(1, (chr1,7385626,7389841,*), ('Inter-Chromosome', chr17, 3081653, 3084755, 50, 0.0, 0.0)
20
Genomic ComputingGenomic Data Model - Other mapping examples
Query Language
21
(Motivational example and detailed description)
22
Genomic Computing GMQL motivational example
The language allows for queries on the genome involving large datasets describing:
• Genomic signals (i.e. experiment dataset regions)• Reference regions (e.g. TSS, genes, promoters, enhancers)• Distance rules (e.g. the nearest enhancer that stands
at least at 100 kb from the nearest gene)
Enhancer Promoter
GeneReference DNAExperimental dataset 1
Distance pattern
Experimental dataset 2
Experimental dataset 3
Reference regions
23
Identification of distal bindings in transcription regulatory regionsFind all CTCF transcription factor (TF) binding regions of ChIP-seq data regarding human cancer cell line HeLa-S3, which are farther than x kb (e.g. 1000 kb) from the TSS (transcription start site) of the nearest gene. Then, find all H3K4me1 histone modification (HM) regions that are also farther than x kb from the TSS of the nearest gene.Finally, consider known enhancer (EN) regions and return a list of EN-HM-overlapping TF regions.
Genomic Computing GMQL motivational example – Distal bindings
Nearest gene
HM
TF1
TF2
REFx
TSS
EN
GMQL result
HM: Histone mark experiment
TF: Transcription factorexperiment
REF: Reference DNA regionsEN: Enhancerx: Threshold distance
DNA region
Not respectingthe distance thresholdNot EN or HMoverlapping
24
Genomic Computing GMQL motivational example – Distal bindings
HM = SELECT(dataType == 'ChipSeq‘ AND cell == 'HeLa-S3‘AND antibody == ‘H3K4me1') PEAK;
TF = SELECT(dataType == 'ChipSeq‘ AND cell == 'HeLa-S3‘AND antibody == ‘CTCF') PEAK;
TSS = SELECT(type == ‘TSS‘) ANNOTATION;EN = SELECT(type == ‘enhancer‘) ANNOTATION; HMa = JOIN(minDistance(5) AND distance > 1000000, right) TSS HM;TFa = JOIN(minDistance(5) AND distance > 1000000, right) TSS TF;TFb = JOIN(distance < 0, left) TFa EN;HMb = JOIN(distance < 0, left) HMa EN;TF_res = JOIN(distance < 0, left) TFb HMb;
Nearest gene
HM
TF1
TF2
REFx
TSS
EN
GMQL result
HM: Histone mark experiment
TF: Transcription factorexperiment
REF: Reference DNA regionsEN: Enhancerx: Threshold distance
DNA region
Not respectingthe distance thresholdNot EN or HMoverlapping
25
Genomic Computing GenoMetric Query Language
GenoMetric Query Language (GMQL) is defined as a sequence of algebraic operations following the structure:
< variable > = < operator > (< parameters >) < variable >
– Every variable is a dataset including many samples
– Offers high-level, declarative operations which operate both on regions and meta-data -> thus, each operation progressively builds the regions and meta-data of its result
– Inspired by SQL and Pig Latin
– Targeted towards cloud computing
Genomic Computing Overall view on GMQL operations
Classic relational operations – with genomic extensions• SELECT, PROJECT, EXTEND, ORDER, GROUP, MERGE, UNION, DIFFERENCE
Domain-specific genomic operations:
• COVER, (GENOMETRIC) JOIN, MAPUtilities:
• MATERIALIZE
27
28
Genomic Computing Sample selection – Example SELECT
0.1 0.6 Tumor_type = brcaPatient_age = 75
Selection of the samples where a selection predicate p is true (e.g. select patients younger than 70 years)
0.5 0Tumor_type = brcaPatient_age = 63Gender = Female
0.1 0 0.8 0.1Tumor_type = brcaPatient_age = 58
S2 = SELECT(p) S1;
Example: S2 = SELECT(Patient_age < 70) S1;
30
Genomic Computing Region projection – Example PROJECT
0.1 0.6 Tumor_type = brcaPatient_age = 75
0.5 0Tumor_type = brcaPatient_age = 63Gender = Female
0.1 0 0.8 0.1Tumor_type = brcaPatient_age = 58
Selection of the regions where a selection predicate p is true (e.g. select those regions which have a score greater than 0.5)
S2 = PROJECT(p) S1;
Example: S2 = PROJECT(score > 0.5) S1;
31
Genomic Computing Region projection – Example PROJECT
Tumor_type = brcaPatient_age = 75
Tumor_type = brcaPatient_age = 75
Projection of the regions: for each gene in a set, take its promoter (e.g. from -2kbp, to +1kbp from the TSS)
S2 = PROJECT(p) S1;
Example: S2 = PROJECT(start = start – 2000, stop = start + 1000) S1;
S1
S2
33
Genomic Computing Metadata aggregation – Example AGGREGATE
5 3102 2
5 01
Tumor_type = brcaPatient_age = 75Region_count = 3
Tumor_type = escaPatient_age = 78Region_count = 5
5 3
Tumor_type = cholPatient_age = 85Region_count = 2
Count the regions in each sample and store it in metadata
S2 = AGGREGATE(Ai AS gi) S1;
Example: S2 = AGGREGATE(Region_count AS COUNT) S1;
34
Genomic Computing Order and select top k – Example ORDER
Tumor_type = brcaPatient_age = 75Region_count = 3Order = 2
Tumor_type = escaPatient_age = 78Region_count = 5Order = 1
5 01
5 3102 2
5 3
Tumor_type = cholPatient_age = 85Region_count = 2Order = 3
Order by region_count metadata and take the top 2 samplesS2 = ORDER(Ai; [TOP: k]) S1;
Example: S2 = ORDER(Region_count; TOP: 2) S1;
35
Genomic Computing Group by metadata – Example GROUP
5 01
Tumor_type = brcaPatient_age = 75Group = 1Min = 0
5 3102 1
Tumor_type = escaPatient_age = 78Group = 2Min = 1
5 3
Tumor_type = cholPatient_age = 87Group = 3Min = 3
534 6
Tumor_type = escaPatient_age = 78Group = 2Min = 1
Group samples according to the value of tumor and compute the region minimum score of each group
36
Genomic Computing Region merge – Example MERGE
Type = ChipSeqAntibody = CTCFReplicate = 1Type = ChipSeqAntibody = CTCFReplicate = 2
Type = ChipSeqAntibody = CTCFReplicate = 3
Type = ChipSeqAntibody = CTCFReplicate = 1Replicate = 2Replicate = 3
Collapse a bunch of samples (both region and metadata) into an unique one S2 = MERGE() S1;
S1.s1
S1.s2
S1.s3
S2
37
Genomic Computing Region union – Example UNION
Tumor_type = brcaExperiment = mirna
Tumor_type = brcaExperiment = rnaseq
Return a single dataset with all the samples in two input datasets, merging their region attributes if different
Tumor_type = brcaExperiment = mirna
Tumor_type = brcaExperiment = rnaseq
S3 = UNION() S1 S2;
S1
S2
S3
38
Genomic Computing Region difference – Example DIFFERENCE
Return all the regions in the first dataset that do not overlap any region in the second one
Tumor_type = brcaExperiment = mirna
Tumor_type = brcaExperiment = rnaseq
S1.Tumor_type = brcaS1.Experiment = rnaseqS2.Tumor_type = brcaS2.Experiment = mirna
S1
S2
S3
S3 = DIFFERENCE() S1 S2;
COVER(ALL,ALL) AND COVER(2,ANY)COVER(1,ANY) OR
S2 = COVER(min, max) S1;
39
Genomic Computing Dataset operations: COVER
• Jaccard indexes can be used instead of min-max• An aggregate function f can be computed for regions
forming the cover
• Produces new regions where there are between MIN and MAX regions of the operand dataset
40
Tumor_type = brcaTumor_grade = g3
Tumor_type = brcaTumor_grade = g2
Tumor_type = brcaTumor_grade = g2
Tumor_type = brcaTumor_grade = g2Tumor_grade = g3
2 2 3 1 2 1 1
COVER(2, ANY): find portions of the genome that are covered by at least two regions
Genomic Computing Region Cover – Example COVER
S2 = COVER(2, ANY) S1;
S1.s1
S1.s2
S1.s3
S2 23 22
• Given two sets of samples, JOIN builds the pairs of regions and metadata where a join predicate p is true.
• Region of results are composed from regions of the operands
S3 = JOIN(p, comp-op) S1 S2;
• Functions minDistance and distance can be used in the predicate
Genomic Computing Dataset operations: JOIN
41
42
Genomic Computing Metadata join – Example JOIN
Metadata join: select pairs of matching samples (e.g. with the same “Type”)
Type = uvmPatient = 123Gender = M
Type = brcaPatient = 10Age = 88
Type = brcaPatient = 211
Type = sarcPatient = 12
Type = sarcPatient = 444Age = 88
Type = brcaPatient = 333Grade = g3
Type = sarcPatient = 12
43
Join at min-distance: associate each region in the former dataset with the closest in the latter
Genomic Computing Region Join – Example JOIN
A B feature = transcripts
2 feature = TFBSs
S1.feature = transcriptsS2.feature = TFBSs
S3 = JOIN(mindistance, RIGHT) S1 S2;
1-A 3-B
1 3
S1
S2
S3
S2 = JOIN(distance < 1000, CON) R, S1;
d = 400
d = 2500 d = 1100
d = 0
Matching pairs and region
composition
R
3 5
e1
3 5 e2
All pairs
44
Genomic Computing Dataset operations: JOIN example
53
R
e1
S2 = MAP(newAttr AS MIN(attr)) R S1;
2 1 3
2 1 43 2
1
2 1
S1
R
S2
45
Genomic Computing Dataset operations: MAP
• Computes aggregate functions over samples of S1 which intersect with the regions of R
0
46
Compute an aggregate function (e.g. COUNT) on al the regions intersecting the reference
Genomic Computing Region Map – Example MAP
annotation = genesprovider = RefSeqA B
feature = SNP
2 4
R.annotation = genesR.provider = RefSeqS1.features = SNP
S2 = MAP(count_R_S1 AS COUNT) R S1;
R
S1
S2
47
Genomic Computing MAP opens to Genome Space abstraction
• MAP operations, through reference regions R, extract and standardize genomic features expressed in distinct datasets
• Genome Space: simplified structured outcome, ideal format for data analysis
R1 R2 R3 DHS
RNAPII H3K4me1
… Gene A TSS Enhancer E
GMQL MAP
48
Genomic Computing Next: MAP – Genometric space abstraction
• Genometric spaces represent adjacency matrices, i.e. networks
– Network analysis methods (e.g. page rank, hub/authority, community detection, …)
R1 R2 R3
DHS RNAPII H3K4me1
… Gene A TSS
Enhancer E
GMQL MAP
49
Genomic Computing Pattern-based querying & clustering
On the Genome Space (rows = regions, columns = datasets):• Convert genome space values (e.g. to Boolean format)• Extract patterns (regions with given values in different datasets,
i.e. with same genomic traits)• Filter rows (find regions with given values in given datasets)• Cluster rows (based on identity or similarity)• Cluster columns (based on identity or similarity)• Extract relevant row attributes (e.g. common aspects of clustered
regions)• Extract relevant column attributes (e.g. common aspects of
clustered datasets, e.g. prevalent phenotype)G1 e1 e2 e3 e4 e5 e6 e7 e8 e9 e10 e11
r1 X X x Xr2 X X X Xr3 X Xr4 X Xr5 X Xr6 X X X
51
Genomic Computing MAP result visualization: Genome Browser
Res = MAP(mutCount AS count) Genes Dataset;
Dataset
Genes
Res
52
Genomic Computing MAP result visualization: Heatmap viewer
It requires: • Partitioning by experiment classes• Adding names to regions and to experiments (from metadata)• Adding colors
53
Genomic Computing Data Viewer: Region clustering
Cluster3 (low densitypattern-2)
Cluster2 (low densitypattern-1)
Cluster1 (high density)
Cluster4 (basal)
54
Genomic Computing Data Viewer: MAP result visualization
55
Genomic Computing Data Viewer: Dendrogram
Cut-1: 2 clusters
Cut-2: 4 clusters
Distance
56
Genomic Computing Data Viewer: Pattern extraction on samples
57
Genomic Computing Data Viewer: Metadata aggregation
For biological/clinical interpretation of genomic data processing, and data stratification based on of biological/clinical metadata values and/or patterns of different genomic feature regions
GMQL at work(Examples)
59
• Annotating unknown data to known annotations(e.g. annotating transcription factors to genes)
• Defining (epi)genomic features on the base ofexperimental data (e.g. putative enhancers)
• Searching for patterns of (epi)genomic features inexperimental data and relating them to specific biological phenomena (e.g. short-range interactions)
• Looking for patterns of (epi)genomic features related tothe tridimensional structure of the genome and relatingthem to specific biological phenomena (e.g. long-range interactions)
Genomic Computing applicationsTypical GMQL applications
60
61
Genomic Computing GMQL examples – Genetic mutations
Finding genetic mutations in CpG IslandsConsider DNA-seq data of distinct human cancer cell lines; for each of them quantify the mutations in each CpG island.Then, select the CpG islands with at least a mutation.Return the list of cancer cell lines ordered by the number of such CpG islands.
MUT = SELECT(dataType == ‘DnaSeq‘ AND cell_karyotype == ‘cancer‘) EXP;CpG = SELECT(type == ‘CpG islands') ANNOTATION;
CpG1 = MAP(MutCount AS COUNT) CpG MUT;CpG2 = PROJECT(MutCount > 0) CpG1;CpG3 = AGGREGATE(CpGCount AS COUNT) CpG2;CpG_res = ORDER(DESC CpGCount) CpG3;MATERIALIZE CpG_res;
62
Genomic Computing GMQL examples – Heterogeneous datasets
Combining ChIP-seq and DNase-seq data in different formats and sourcesExtract broad peaks of ChIP-seq transcription factor binding sites and histone modifications from ENCODE samples that intersect DNase-seq open chromatin regions from Roadmap Epigenomics in normal H1 embryonic stem cells.
CHIPSEQ = SELECT(dataType == 'ChipSeq' AND view == 'Peaks' AND setType == 'exp' AND cell == 'H1-hESC') HG19_ENCODE_BROAD;
DNASESEQ = SELECT(assay == 'DNase.hotspot.broad' AND Standardized_Epigenome_name == 'H1 Cells') HG19_ROADMAP_EPIGENOMICS_BED;
DNASESEQ1 = COVER(1, ANY) DNASESEQ;
CHIPSEQ_IN_DNASESEQ = JOIN(distance < 0, project_right_distinct)DNASESEQ1 CHIPSEQ;
MATERIALIZE CHIPSEQ_IN_DNASESEQ;
63
Genomic Computing GMQL examples – Heterogeneous datasets
Associating transcriptomics and epigenomicsIn RNA-seq experiments of human cancer cell line HeLa-S3, find the average expression of each gene. Then, in ChIP-seq experiments of the same cell line, in each gene find the average signal of H3K27me3 and H3K4me3 histone modifications. Map these experiments to known genes.
Based on all such average values, evaluate the pairwise similarity of each gene with the BRCA1 gene, and return the ordered list of the 20 genes most similar to BRCA1.Also, through bi-clustering, identify and return groups of genes and their histone and transcription signals with similar patterns.
GENES = SELECT(type == ‘gene‘ AND provider == ‘UCSC’) ANNOTATION; RNA = SELECT(dataType == ‘RnaChip‘ AND cell == 'HeLa-S3‘) PEAK;HM = SELECT(dataType == ‘ChipSeq‘ AND cell = ‘HeLa-S3‘ AND
(antibody == ‘H3K27me3‘ OR antibody == ‘H3K4me3‘)) PEAK;
EXP = UNION RNA HM;GenomeSpace = MAP(EXPavg AS avg(signal)) GENES EXP;……
64
Genomic Computing GMQL examples – Patient datasets
Counting distinct DNA mutations in patient groupsGroup patients by tumor type and ethnicity, and count the distinct DNA somatic mutations in each group.
MUTATION = SELECT(dataType == ‘dnaseq’) TCGA_dnaseq;
MUTATION_BY_RACE = COVER(1, ANY; GROUP BY tumor_tag, race;overlap_count AS COUNT, barcodes AS BAG(tumor_sample_barcode)) MUTATION;
MUTATION_COUNT = AGGREGATE(mutation_count AS COUNT) MUTATION_BY_RACE;
MATERIALIZE MUTATION_COUNT;
65
Genomic Computing GMQL examples – Different datasets of patients
Combining different data types of multiple patients Match DNA copy number variation (CNV) and microRNA (miRNA) data samples regarding the same biospecimen and extract the CNVs occurring within expressed miRNA genes in the paired samples.
CNV = SELECT(dataType == ’cnv’) TCGA_cnv;
MIRNA_GENE = SELECT(dataType == ’mirnaseq’) TCGA_mirnaseq_mirna;
CNV_GENE_0 = MAP(left -> bcr_sample_barcode == right -> bcr_sample_barcode, gene_count AS COUNT, mirna_genes AS BAG(mirna_id)) CNV MIRNA_GENE;
CNV_GENE = PROJECT(gene_count > 0) CNV_GENE_0;
MATERIALIZE CNV_GENE;
66
Genomic Computing GMQL examples – Different datasets of patients
Combining and processing heterogeneous omics data of patients In TCGA data of breast cancer patients, find the DNA somatic mutations within the first 2000 bp outside of the genes that are expressed and methylated in at least one of these patients, and extract the top five patients with the highest number of such mutations and their somatic mutations.
EXPRESSED_GENE = SELECT(dataType == ‘rnaseqv2’ AND tumor_tag == 'brca') HG19_TCGA_RnaSeqV2_Gene;
METHYLATION = SELECT(dataType == ‘dnamethylation’ AND tumor_tag == 'brca') HG19_TCGA_Dnamethylation;
MUTATION = SELECT(data_type == ‘dnaseq’ AND tumor_tag == 'brca') HG19_TCGA_DnaSeq;
GENE_METHYL = JOIN(left->bcr_sample_barcode == right-> bcr_sample_barcode, distance < 0, project_left_distinct) EXPRESSED_GENE METHYLATION;
GENE_METHYL1 = COVER(1, ANY) GENE_METHYL;MUTATION_GENE = JOIN(distance < 2000 AND distance > 0, left) MUTATION
GENE_METHYL1;MUTATION_GENE_count = AGGREGATE(mutation_count AS COUNT) MUTATION_GENE;MUTATION_GENE_top = ORDER(DESC mutation_count; TOP 5) MUTATION_GENE_count;MATERIALIZE MUTATION_GENE_top;
System Architecture
(GenData 2020)http://www.bioinformatics.deib.polimi.it/genData/
67
Genomic Computing Overall architecture of system (GenData 2020)
68
GMQL
Stores experimental datasets and annotations collected from external databases
• ENCODE (more than 4000 processed datasets for humans and mices, relevant to epigenomic research)
• Epigenomics Roadmap (about 1000 human epigenomic datasets for stem cells and ex-vivo tissues)
• TCGA (The Cancer Genome Atlas, providing more than 50,000 processed datasets for more than 30 cancer types, including mutations, copy number variations, gene and miRNA expressions, methylations)
Genomic Computing Repository
69
Genomic Computing Repository
70
Annotation data are also extracted from external references, based upon the needs of given research projects
• Genes (UCSC, RefSeq, Ensembl)• Transcription Start Sites (SwitchGear)• Transcription Factor Binding Sites (UCSC, ENCODE)• CpG islands (UCSC)• miRNA target sites (UCSC)• Enhancers (Vista)
Genomic Computing Repository - Annotations
71
Implementation
72
73
Genomic Computing GMQL implementation, V1
• GMQL similar to Pig Latin (by Yahoo! Research)– Algebraic language for data-intensive applications on
Apache Hadoop, a framework for parallel computing which executes Google MapReduce programs
• Implementation strategy: develop a translator to Pig Latin– Easier development and maintenance – Big company involvement ensures development– Use cloud computing power to obtain efficiency and
scalability
Genomic Computing System architecture
74
Genomic Computing System architecture
75
Genomic Computing System architecture - Repository
76
Genomic Computing GMQL query translation to PIG over Hadoop
77
GMQL query Translator Pig over Hadoop
Motivation:• Clear & compact user code• User-transparent optimization
Genomic Computing Translation example
78
• 1 statement => 25 Pig Latin lines of code + auxiliary Java function
• The translator takes also care of updating the variable schema
• Error handling
79
Genomic Computing GMQL to Pig Latin translation
•
Genomic Computing Optimization in translation
1. Parallelism by splitting computations:• By chromosome• By experiment
2. Join and Map have a translation which avoids crossproducts, based on sequential scan of regions
Pig Latin shows its ability to scale on hundreds or thousands of experiments and multi-node systems
80
Genomic Computing MAP and JOIN vs. competitors
81
Genomic Computing System architecture, V2
• Holistic data management system for genomics• Uses cloud-based computing for querying thousands of
heterogeneous datasets
82
83
Genomic Computing GMQL implementation, V2
• A different approach, with language-independent intermediate representation
• Targeting also usability from within R and Galaxy
IR Semantics
GMQL EmbeddedGMQL
LogicalGMQL
Spark Flink ???
Syntax
Implementation API to IR
Genomic Computing GMQL implementation, V2 - Scala API
84
Genomic Computing GMQL implementation, V2 - IR Example
85
86
Genomic Computing GMQL implementation, V2
• New optimization options
GMQLGMQL IR OptimizedIR Implementation
QueryOptimizer
Low LevelOptimizer
1) Node reordering / deletion2) Select condition refinement
1) Alternative algorithms2) Parallelism tuning3) Data partitioning4) Caching
Genomic Computing Optimizations at compile-time & execution-time
Idea:• Let Flink/Spark/… engines implement common and
well known optimization• Exploit the intermediate representation in order to
implement optimizations which are driven by the semantics of GMQL
• Meta-first optimization• Operator swapping optimization
• Other optimizations based on algorithms for parallel execution on the cloud
87
Genomic Computing Meta-first optimization
Under certain conditions (meta-separability), it is possible to compute the metadata side of the query strictly before the region data side.
88
Genomic Computing Meta-separability
GMQL queries are always meta-separable, except for the ones which use the EXTEND (AGGREGATE) operator
(EXTEND operator computes and aggregates on the region data and stores the result in the metadata)
89
Genomic Computing Meta-first optimization - Workflow
ReadMD
StoreMD
ReadRD
StoreRD
ID s
• Compute metadata side of the query
• Retrieve the IDs from the metadata result
• Use the IDs to selectively load only the files that will appear in the output
90
Genomic Computing Where it helps?
Affected queries are the ones which contain one or more metadata selection (far from the Readings), metadata join and metadata group by; those operations cut the size of the output
91
Genomic Computing Operator swapping optimization
Some reordering of the execution plan can not be inferred by lower level optimizer, since they are motivated by GMQL semantics
DS1 DS2 DS3
Joinleft
Diff
DS1 DS3 DS2
Joinleft
Diff
92
Genomic Computing Binning strategy optimization
Bin1 Bin2 Bin3 Bin4 Bin5 Bin6 Bin75bin 6
n7
Strategy for intersection:1. Partition the genome in bins2. Assign each region to all the bins it overlaps3. Search for intersections within each bin
In the case of more complex operations, we change the way in which the regions are assigned to the bins
93
In order to avoid the duplicates production, when two regions overlap, an output is emitted if, and only if, at least one of them begins in the considered bin
• Bin 2: overlap => red region begins => Output e• Bin 3: overlap => no region begins => Output not emitted!
Genomic Computing Binning strategy
Avoiding output duplicates:
bin 1 bin 2 bin 3
94
Genomic Computing Binning tradeoff
• Smaller bins: smaller search space, but higher number of replicates
• Optimal binning size depends on:– Number of regions and local density– Region length distribution– GMQL operation and parameters– System settings (e.g., number of nodes, amount of
memory, …)
050100150200250300
1K 5K 10K 50K 100K 200K 500K 1M 10M
Bin length
Exec. time
95
User Interface
97
Genomic Computing GMQL Web interface
98
http://www.bioinformatics.deib.polimi.it/GMQL/interfaces/
100
Genomic ComputingGMQL results on Integrated Genome Browser
Genomic Computing GMQL examples – Distal bindings: Visualization of results
101
Results are provided to user in GTF or Tab-delimited format
Genomic Computing GMQL examples – Distal bindings: Visualization of results
102
Applications
103
Genomic Computing Transcription Factor query & Genome Space
Source: ENCODE ChIP-seq datasets for transcription factors (TF)Goal: Generation of a transcriptional network from ChIP-seq dataMethod: Select differentially TFs which are also genes and derive TF-Genes and TF-TF links. Build a TFxGenes matrix Mij such that Mij = 1 if there is at least one binding site i in the gene region j, Mij = 0 otherwise.
# Extract gene information, some of them tagged with TF-encoding
GENE_ANN = SELECT(type == 'gene' AND provider == 'RefSeq') ANNOTATION;
GENES = PROJECT(feature == ’gene’) GENE_ANN; TF_GENES = PROJECT(encode_TF == ‘yes’) GENES; # red in next slide
# Collect TF samples (122)
TF = SELECT(dataType = 'ChipSeq' AND subType = 'TF' AND cell == 'k562' AND treatement == 'None') ENCODE_PEAK;
# Build TF Genomic Space
GS_TF = MAP(signal AS EXISTS) GENES TF;104
105
Genomic Computing Genome Space building
G1 G2 G4
TF1TF2TF4
Ann
Cell line: K562 (CML)# TFs: 122# Nodes: 6240# Edges: 30587
TF2
TF1 G1
G2
TF4G4
G3=TF3
G3=TF3
106
Genomic Computing Restriction of TF to open chromatin regions
# Collect open chromatin samples
DHS_2 = SELECT(dataType == 'DnaseSeq' AND cell == 'k562') ENCODE_PEAK; # 2 samples
FAIRE = SELECT(dataType == 'FaireSeq' AND cell == 'k562') ENCODE_PEAK; # 1 sample
# Merge DHS replicates in one sample
DHS = COVER_FLAT(2, ANY; pValue AS MIN(pValue)) DHS_2;
# Merge open chromatin regions from DHS and FAIRE assays
DHS_FAIRE = UNION DHS FAIRE;OPEN = COVER(1, ANY; pValue AS MIN(pValue)) DHS_FAIRE;
# Extract TFs in open chromatin regions only (active DNA # binding)
TF_OPEN = JOIN(distance < 0, left) TF OPEN;
107
Genomic Computing Transcription Factor Genome Space
# TF Genomic Space
GS_TF = MAP(signal AS EXISTS) GENES TF_OPEN;
G1 G2 G3TF3
G4 G5 … Gn
TF1 1 0 1 1 0 … 1
TF2 1 0 1 0 0 … 1
TF3 0 0 0 0 1 … 0
… 0 0 0 1 0 … 1
TFn 0 0 0 1 1 … 1
108
Genomic Computing Genome Space building
G1 G2 G4Ann
TF1
TF2
TF4
DHS
FAIRE
TF2
TF1
TF4
G4
G2 Cell line: K562 (CML)# TFs: 95# Nodes: 1717# Edges: 2367
G3=TF3
G3=TF3
G1
From Genometric Space to networks: K562 transcription network
109
Genomic Computing Differential binding research
Source: ENCODE JUN ChIP-seq datasets relative to cell line ‘k562’ (chronic myelogenous leukemia), with one control and 4 cases treated using interferon alpha or gamma at 30 minutes and 6 hours respectively, each with two replicas
Questions: • Q1: Find those JUN binding regions in treated cases where there are no
JUN binding peaks in the control• Q2: Find the JUN binding regions which are differentially present between
treated cases (e.g. in alpha and not in gamma interferon treated samples)
• Q3: Find the genes that overlap with JUN binding regions either in control or treated samples
• Q4: Find the average JUN ChIP-seq signal (in bedGraph files) of the treated samples in promoters (areas surrounding the TSS) of genes intersecting JUN binding regions either in control or treated samples
110
Select IFNa30 non overlapping with CONTROLS (Question 1, same for IFNa6h, IFNg30, IFNg6h)
# select controlsJUN_2 = SELECT (antibody == 'c-Jun' AND dataType == 'ChipSeq' AND
cell == 'k562' AND laboratory == 'Stanford' AND treatment == 'None') ENCODE_PEAK;
# extract the peak regions in at least two replicas and their minimum p-valueJUN = COVER(2, ANY; pValue AS MIN(pValue)) JUN_2;
# select treated samplesJUN_IFNa30_2 = SELECT(antibody == 'c-Jun' AND dataType == 'ChipSeq'
AND cell == 'k562' AND laboratory == 'Stanford' AND treatment == 'IFNa30') ENCODE_PEAK;
# extract the peak regions in at least two replicas and their minimum p-valueJUN_IFNa30_3 = COVER(2, ANY; pValue AS MIN(pValue)) JUN_IFNa30_2;
# extract peaks which do not overlap with controls (question 1)JUN_IFNa30 =JOIN(minDistance AND distance >= 0, left) JUN_IFNa30_3 JUN;
111
Regions expressed differentially by treatment alpha/gamma (Question 2)
# all treated samples: union of all peaks regions non-overlapping with controlsJUN_IFNa = UNION JUN_IFNa30 JUN_IFNa6h;JUN_IFNg = UNION JUN_IFNg30 JUN_IFNg6h;JUN_IFNag = UNION JUN_IFNa JUN_IFNg;
# regions that are present in all treatments alpha and not in any treatment# gamma (question 2)JUN_IFNag_ONLYa = DIFFERENCE() JUN_IFNa JUN_IFNg;
# regions that are present in all treatments gamma and not in any treatment# alpha (question 2)JUN_IFNag_ONLYg = DIFFERENCE() JUN_IFNg JUN_IFNa;
112
JUN overlapping gene lists (Question 3)
# extract all genesGENE_2 = SELECT(type == 'gene' AND provider == 'RefSeq')
ANNOTATION;GENE = PROJECT(feature == ’gene’) GENE_2;
# all samples: union of peak regions in controls or in treated samplesJUN_ALL = UNION JUN JUN_IFNag;
# extract genes intersecting with peak regions in control or treated samples# (question 3)GENE_JUN_ALL = JOIN(distance < 0, left) GENE JUN_ALL;
113
Prepare relevant promoter regions
# extract transcription start sites (TSS) of all genesTSS = SELECT(type == ‘TSS’) ANNOTATION;
# extend TSS to obtain promoter regions PROM = PROJECT(start = start - 1000, stop = stop + 500) TSS;
# extract the peak regions present in at least one sampleJUN_ALL_2 = COVER(1, ANY) JUN_ALL;
# extract all promoter regions intersecting with at least one peak region # in control or treated samplesPROM_JUN_ALL = JOIN(distance < 0, right) JUN_ALL_2 PROM;
114
Compute average signal on extended promoter regions (Question 4)
# Load bedGraph datasetsJUN_ALL_BG = SELECT(antibody == 'c-Jun' AND dataType == 'ChipSeq' AND
cell == 'k562' AND laboratory == 'Stanford' AND(treatment == 'IFNa30' OR treatment == 'IFNa6h' ORtreatment == 'IFNg30' OR treatment == 'IFNg6h' OR treatment == none)) ENCODE_BEDGRAPH;
# Map peak binding regions on promoters intersecting with at least one peak# region in control or treated samples, producing average promoter signal # (question 4)JUN_ALL_SIGNAL = MAP(promAvg AS AVG(signal)) PROM_JUN_ALL
JUN_ALL_BG;
115
PR1 PR2 PR3 PR4 PR5 … PRn
JUN1 6.7 7.5 5.2 3.2 5.2 … 15.4
JUN2 15.4 0.4 0.8 9.4 6.2 … 21.6
JUN3 5.5 14.5 7.2 5.4 5.2 … 4.5
… … … … … … … …
JUNn 2.2 2.1 2.5 6.5 6.2 … 6.7
Summary & Outlook
116
117
Genomic Computing Conclusions
• GDM: a data format-independent genomic data model– For genomic region data and related metadata– Easing integration and processing of heterogeneous
genomic data
• GMQL: a high-level declarative language– Easing the expression of even complex queries on
numerous data of multiple different types– Running also on cloud computing environments– Supporting a first processing also of big data, to extract
the relevant (usually smaller) ones for further processing
• Several GDM & GMQL application examples– Characterizing interplay and function of genomic regions
118
Genomic Computing Discovering biologically useful results
Working together with biologists at IEO & IIT for giving answers to biological problems. Ongoing projects:
1. 3D chromatine structure (Pelicci)2. DNA replication and gene expression (Pelicci)3. Analysis of unknown P53 binding loci (Amati)4. Chromatin state change in time-course Myc bindings (Amati)5. Myc bindings saturation patterns (Amati)6. Transcription factors co-occurrence with TEAD binding sites
(Campaner)7. Correlating hotspots of RAD21 bindings with high density TF
regions (Campaner)8. Copy number variants (CNVs) in Chromosome 7 (Testa)
119
A p
atte
rn o
f gen
omic
feat
ures
Genomic Computing FutureVision: Pattern-based queries from genome browser
120
Genomic Computing Future Vision: Cycle Query–Analysis–Visualization
• Query– By example– Using public DBs & ontologies– Search remote data– Query / extract remote data
• Analysis– (un)supervised learning– Region finding– Motif / pattern finding
• Visualization– Clustering– Long range interactions
VisualizationVisualization
QueryQuery
AnalysisAnalysis
Genomic Computing Future Long-term vision: Internet of Genomes
• The platform (client & servers) and language should support queries/computations involving different servers- Minimizing the information to be transferred among
servers and between them and the client• Each server should expose its own data for access by
exploratory search & crawlers
1 2 1 121
Genomic Computing Resources & Web sites & acknowledgements
Overview: http://www.bioinformatics.deib.polimi.it/genomic_computing/GMQL web site: http://www.bioinformatics.deib.polimi.it/GMQL/
Includes:• Local mode or MapReduce mode (over Hadoop, or Hadoop YARN)
for GNU/Linux systems - Download (122 MB)• Web services (over Hadoop YARN) - Download (60 MB)• Quick start - Install GMQL and get started• GMQL tutorial & Complete documentation• Functional comparison with BEDTools & BEDOPS• Pointer to publications (Bioinformatics, IEEE/ACM TCBB, Methods)
http://www.bioinformatics.deib.polimi.it/GMQL/interfaces/Includes:• User-friendly interface to creating/managing GMQL queries• Repository of ENCODE / Roadmap Epigenomics / TCGA datasets
122
Genomic Computing Resources, Web sites & acknowledgements
European Research Council “Data-Driven Genomic Computing”(GeCo) project: http://www.bioinformatics.deib.polimi.it/GeCo/
Our group, with Prof. Stefano Ceri and several PhD & master studentsand postDoc: • Abdulrahman Kaitoua• Pietro Pinoli• Arif Canakoglu
123
Thank you for your attention!
Any question?
Thank you for your attention!
Genomic Computing
124
http://www.bioinformatics.deib.polimi.it/genomic_computing/