genomic data model and genometric query language as

Dipartimento di Elettronica, Informazione e

Bioingegneria

Genomic ComputingPolitecnico di Milano 13-17 March, 2017

Genomic Data Model and GenoMetric Query Language as research enabler to discover genome propertiesMarco Masseroli and Stefano Ceri(joint work with several PhD students)

Politecnico di Milano, BioInformatics Group

2

Genomic Computing Background: Next Generation Sequencing

• Next Generation Sequencing technology is about to provide affordable (in time and cost) and precise determinations of genome wide:

– DNA sequence / variations (DNA-seq)– gene subregions’ activity (RNA-seq) [all gene test]– protein-DNA interaction regions (ChIP-seq) – open chromatin (DNase-seq)

Goal of $1,000 full genome sequencing in under an hour has just met

• Very many DNA-interacting proteins / subjects / conditions will be soon evaluated

– Personalized medicine (diagnosis and treatment)– Each NGS test can generate 0.4TB -> Big Data scenario

Source: http://blog.goldenhelix.com/grudy/a-hitchhiker%E2%80%99s- guide-to-next-generation-sequencing-part-2/

Genomic Computing Big data analysis with Next Generation Sequencing

My talk

3

4

HeterogeneousGenomic

Data Sources

HeterogeneousClinic

Data SourcesPersonal Patient Data

Genome Browser Genome Browser

Data Analytics

Data Analytics

Ontological Knowledge Ontological Knowledge

Genom

ic Data Integration

Publishing / Crawling / Searching

Genom

ic Data Integration

Publishing / Crawling / Searching

Clinical D

ata IntegrationC

linical Data Integration

Biologist ClinicianMedical Literature

Clinical Protocols

Genomic Computing The big picture: Distributed heterogeneous data

5

A n

umbe

r of g

enom

ic fe

atur

es (t

rack

s)

One macro genomic region

Dat

a tra

cks

Genomic Computing Current practice – UCSC Genome Browser

6

Genomic Computing The challenge: Understanding biologists’ needs

• Working together with biologists for giving answers to the problems behind the «courtesy» slide

Courtesy of Prof. Pelicci, IEO

7

Genomic Computing Challenge: Genotype–phenotype discovery

• (Epi)genotype-phenotype relationship discovery: understanding genomic regions, genome variations and their associations with different phenotypes

– highly heterogeneous scenario

• It requires evaluating, in several different conditions and types of individuals:

– genome (DNA) sequence variations – gene activity & its regulation– occurring interactions

8

Genomic Computing Main questions

Scientist’s typical questions(from our interaction with IEO - European Oncology Instituteand IIT - Italian Institute of Technology)

• Can interesting DNA regions and their relationships be discovered using genome-wide queries?

• Can genomic data of patients be grouped according to clinical phenotype and compared?

• Can the genomic features of all the genes involved in the same biological process be extracted and then analyzed?

• Can we retrieve portions of the genome of given patients, extracting them from remote servers and comparing them?

9

Genomic Computing Research agenda

• Can interesting DNA regions and their relationships be discovered using genome-wide queries?

Genometric query language• Can genomic data of patients be grouped according to

clinical phenotype and compared? Genometric query language + clustering

• Can all the features of the genes involved in the same biological process be extracted and then analyzed?

Genometric query language + data analysis

• Can we retrieve portions of the genome of given patients, extracting them from remote servers and comparing them?

Genometric query language + indexing & search

10

Genomic Computing Research agenda by topics

• Data model: design a simple and format-independent data model for describing datasets with both genomic regions and general provenance information (including phenotype)

• Query language: design a query language where both genometric aspects (about the placement of regions on the genome) and provenance can be queried at a high level of data independence and transparency

• Integrative data analysis: translating query results into a genome space which is the ideal start point for correlation and network analysis

• Data search: design protocols for data crawling and indexing based on the data model

Genomic Data Model

11

12

Genomic Computing Genomic Data Model

Within the same sample, two kinds of data:• Region values aligned w.r.t. a given reference, with specific

left-right ends within a chromosome, and with several associated attributes (e.g. p-value of region significance)

• Metadata, with free-format attribute-value pairs, storing all the knowledge about the sample

A C G T T A A C G G A T A C C A A C

left (position) right (position)

chr (chromosome)

strand (direction)

DNAr

Genomic Computing Model rationale

• Regions of the model are data format independent and provide an interoperability framework for comparing data on mutations, expression or regulation using regions as common ground

• Metadata attribute-value pairs of the model are info-system independent and provide an interoperability framework for comparing samples based upon their biological aspects

13

14

Genomic Computing Genomic Data Model – Example

0.1 0.6 Tumor_type = brcaPatient_age = 75

0.5 0Tumor_type = brcaPatient_age = 63Sex = Female

0.1 0 0.8 0.1Tumor_type = brcaPatient_age = 58

Sample 1

Sample 3

Sample 2

15

Genomic Computing Genomic Data Model – Example

• Region values: {expID, region:(chr, left, right, strand), p-value}

• Metadata: {expID, attribute, value}

Genomic ComputingGenomic Data Model - Samples and datasets

16

Samples and datasets• Every sample corresponds to an «experiment», with an ID• Every dataset is a named collection of samples with the

same region data schemaData format independent; interoperability framework for comparing data samples based upon their biological aspects

Genomic ComputingGenomic Data Model - Mapping examples

17


18


19

DNA-seq (mutations)(id, ('chr,start,stop,strand), (A,G,C,T,del,ins,inserted,ambig,Max,Error,A2T,A2C,A2G,C2A,C2G,C2T))(1, (chr1, 917179, 917180,*), (0,0,0,0,1,0,'.','.',0,0,0,0,0,0,0,0))(1, (chr1, 917179, 917179,*), (0,0,0,0,0,1,G,'.',0,0,0,0,0,0,0,0))

RNA-seq (gene expression)(id, ((chr,start,stop,strand), (source,type,score,frame,geneID,transcriptID,RPKM1,RPKM2,iIDR))(1, (chr8, 101960824, 101964847,-), ('GencodeV10', 'transcript', 0.026615, NULL, 'ENSG00000164924.11', 'ENST00000418997.1', 0.209968, 0.193078, 0.058))

Annotations(id, (chr,start,stop,strand), (proteinID,alignID,type))(1, (chr1, 11873, 11873, +), ('uc001aaa.3', 'uc001aaa.3', 'cds')) (1, (chr1, 11873, 12227, +), ('uc001aaa.3', 'uc001aaa.3', 'exon')) (1, (chr1, 12612, 12721, +), ('uc001aaa.3', 'uc001aaa.3', 'exon')) (1, (chr1, 13220, 14409, +), ('uc001aaa.3', 'uc001aaa.3', 'exon'))

ChIA-PET (denoting 3D genomic loops, head is assembled with coordinates, tail is in the schema)(id,(chr,headstart,headstop,strand), (loopType, tailChr, tailStart, tailStop, PETcount, pValue, qValue))(1, (chr1,7385626,7389841,*), ('Inter-Chromosome', chr17, 3081653, 3084755, 50, 0.0, 0.0)

20

Genomic ComputingGenomic Data Model - Other mapping examples

Query Language

21

(Motivational example and detailed description)

22

Genomic Computing GMQL motivational example

The language allows for queries on the genome involving large datasets describing:

• Genomic signals (i.e. experiment dataset regions)• Reference regions (e.g. TSS, genes, promoters, enhancers)• Distance rules (e.g. the nearest enhancer that stands

at least at 100 kb from the nearest gene)

Enhancer Promoter

GeneReference DNAExperimental dataset 1

Distance pattern

Experimental dataset 2

Experimental dataset 3

Reference regions

23

Identification of distal bindings in transcription regulatory regionsFind all CTCF transcription factor (TF) binding regions of ChIP-seq data regarding human cancer cell line HeLa-S3, which are farther than x kb (e.g. 1000 kb) from the TSS (transcription start site) of the nearest gene. Then, find all H3K4me1 histone modification (HM) regions that are also farther than x kb from the TSS of the nearest gene.Finally, consider known enhancer (EN) regions and return a list of EN-HM-overlapping TF regions.

Genomic Computing GMQL motivational example – Distal bindings

Nearest gene

HM

TF1

TF2

REFx

TSS

EN

GMQL result

HM: Histone mark experiment

TF: Transcription factorexperiment

REF: Reference DNA regionsEN: Enhancerx: Threshold distance

DNA region

Not respectingthe distance thresholdNot EN or HMoverlapping

24

Genomic Computing GMQL motivational example – Distal bindings

HM = SELECT(dataType == 'ChipSeq‘ AND cell == 'HeLa-S3‘AND antibody == ‘H3K4me1') PEAK;

TF = SELECT(dataType == 'ChipSeq‘ AND cell == 'HeLa-S3‘AND antibody == ‘CTCF') PEAK;

TSS = SELECT(type == ‘TSS‘) ANNOTATION;EN = SELECT(type == ‘enhancer‘) ANNOTATION; HMa = JOIN(minDistance(5) AND distance > 1000000, right) TSS HM;TFa = JOIN(minDistance(5) AND distance > 1000000, right) TSS TF;TFb = JOIN(distance < 0, left) TFa EN;HMb = JOIN(distance < 0, left) HMa EN;TF_res = JOIN(distance < 0, left) TFb HMb;

Nearest gene

HM

TF1

TF2

REFx

TSS

EN

GMQL result

HM: Histone mark experiment

TF: Transcription factorexperiment

REF: Reference DNA regionsEN: Enhancerx: Threshold distance

DNA region

Not respectingthe distance thresholdNot EN or HMoverlapping

25

Genomic Computing GenoMetric Query Language

GenoMetric Query Language (GMQL) is defined as a sequence of algebraic operations following the structure:

< variable > = < operator > (< parameters >) < variable >

– Every variable is a dataset including many samples

– Offers high-level, declarative operations which operate both on regions and meta-data -> thus, each operation progressively builds the regions and meta-data of its result

– Inspired by SQL and Pig Latin

– Targeted towards cloud computing

Genomic Computing Overall view on GMQL operations

Classic relational operations – with genomic extensions• SELECT, PROJECT, EXTEND, ORDER, GROUP, MERGE, UNION, DIFFERENCE

Domain-specific genomic operations:

• COVER, (GENOMETRIC) JOIN, MAPUtilities:

• MATERIALIZE

27

28

Genomic Computing Sample selection – Example SELECT


Selection of the samples where a selection predicate p is true (e.g. select patients younger than 70 years)

0.5 0Tumor_type = brcaPatient_age = 63Gender = Female


S2 = SELECT(p) S1;

Example: S2 = SELECT(Patient_age < 70) S1;

30

Genomic Computing Region projection – Example PROJECT


0.5 0Tumor_type = brcaPatient_age = 63Gender = Female


Selection of the regions where a selection predicate p is true (e.g. select those regions which have a score greater than 0.5)

S2 = PROJECT(p) S1;

Example: S2 = PROJECT(score > 0.5) S1;

31

Genomic Computing Region projection – Example PROJECT

Tumor_type = brcaPatient_age = 75

Tumor_type = brcaPatient_age = 75

Projection of the regions: for each gene in a set, take its promoter (e.g. from -2kbp, to +1kbp from the TSS)

S2 = PROJECT(p) S1;

Example: S2 = PROJECT(start = start – 2000, stop = start + 1000) S1;

S1

S2

33

Genomic Computing Metadata aggregation – Example AGGREGATE

5 3102 2

5 01

Tumor_type = brcaPatient_age = 75Region_count = 3

Tumor_type = escaPatient_age = 78Region_count = 5

5 3

Tumor_type = cholPatient_age = 85Region_count = 2

Count the regions in each sample and store it in metadata

S2 = AGGREGATE(Ai AS gi) S1;

Example: S2 = AGGREGATE(Region_count AS COUNT) S1;

34

Genomic Computing Order and select top k – Example ORDER

Tumor_type = brcaPatient_age = 75Region_count = 3Order = 2

Tumor_type = escaPatient_age = 78Region_count = 5Order = 1

5 01

5 3102 2

5 3

Tumor_type = cholPatient_age = 85Region_count = 2Order = 3

Order by region_count metadata and take the top 2 samplesS2 = ORDER(Ai; [TOP: k]) S1;

Example: S2 = ORDER(Region_count; TOP: 2) S1;

35

Genomic Computing Group by metadata – Example GROUP

5 01

Tumor_type = brcaPatient_age = 75Group = 1Min = 0

5 3102 1

Tumor_type = escaPatient_age = 78Group = 2Min = 1

5 3

Tumor_type = cholPatient_age = 87Group = 3Min = 3

534 6

Tumor_type = escaPatient_age = 78Group = 2Min = 1

Group samples according to the value of tumor and compute the region minimum score of each group

36

Genomic Computing Region merge – Example MERGE

Type = ChipSeqAntibody = CTCFReplicate = 1Type = ChipSeqAntibody = CTCFReplicate = 2

Type = ChipSeqAntibody = CTCFReplicate = 3

Type = ChipSeqAntibody = CTCFReplicate = 1Replicate = 2Replicate = 3

Collapse a bunch of samples (both region and metadata) into an unique one S2 = MERGE() S1;

S1.s1

S1.s2

S1.s3

S2

37

Genomic Computing Region union – Example UNION

Tumor_type = brcaExperiment = mirna

Tumor_type = brcaExperiment = rnaseq

Return a single dataset with all the samples in two input datasets, merging their region attributes if different



S3 = UNION() S1 S2;

S1

S2

S3

38

Genomic Computing Region difference – Example DIFFERENCE

Return all the regions in the first dataset that do not overlap any region in the second one



S1.Tumor_type = brcaS1.Experiment = rnaseqS2.Tumor_type = brcaS2.Experiment = mirna

S1

S2

S3

S3 = DIFFERENCE() S1 S2;

COVER(ALL,ALL) AND COVER(2,ANY)COVER(1,ANY) OR

S2 = COVER(min, max) S1;

39

Genomic Computing Dataset operations: COVER

• Jaccard indexes can be used instead of min-max• An aggregate function f can be computed for regions

forming the cover

• Produces new regions where there are between MIN and MAX regions of the operand dataset

40

Tumor_type = brcaTumor_grade = g3



Tumor_type = brcaTumor_grade = g2Tumor_grade = g3

2 2 3 1 2 1 1

COVER(2, ANY): find portions of the genome that are covered by at least two regions

Genomic Computing Region Cover – Example COVER

S2 = COVER(2, ANY) S1;

S1.s1

S1.s2

S1.s3

S2 23 22

• Given two sets of samples, JOIN builds the pairs of regions and metadata where a join predicate p is true.

• Region of results are composed from regions of the operands

S3 = JOIN(p, comp-op) S1 S2;

• Functions minDistance and distance can be used in the predicate

Genomic Computing Dataset operations: JOIN

41

42

Genomic Computing Metadata join – Example JOIN

Metadata join: select pairs of matching samples (e.g. with the same “Type”)

Type = uvmPatient = 123Gender = M

Type = brcaPatient = 10Age = 88

Type = brcaPatient = 211

Type = sarcPatient = 12

Type = sarcPatient = 444Age = 88

Type = brcaPatient = 333Grade = g3

Type = sarcPatient = 12

43

Join at min-distance: associate each region in the former dataset with the closest in the latter

Genomic Computing Region Join – Example JOIN

A B feature = transcripts

2 feature = TFBSs

S1.feature = transcriptsS2.feature = TFBSs

S3 = JOIN(mindistance, RIGHT) S1 S2;

1-A 3-B

1 3

S1

S2

S3

S2 = JOIN(distance < 1000, CON) R, S1;

d = 400

d = 2500 d = 1100

d = 0

Matching pairs and region

composition

R

3 5

e1

3 5 e2

All pairs

44

Genomic Computing Dataset operations: JOIN example

53

R

e1

S2 = MAP(newAttr AS MIN(attr)) R S1;

2 1 3

2 1 43 2

1

2 1

S1

R

S2

45

Genomic Computing Dataset operations: MAP

• Computes aggregate functions over samples of S1 which intersect with the regions of R

0

46

Compute an aggregate function (e.g. COUNT) on al the regions intersecting the reference

Genomic Computing Region Map – Example MAP

annotation = genesprovider = RefSeqA B

feature = SNP

2 4

R.annotation = genesR.provider = RefSeqS1.features = SNP

S2 = MAP(count_R_S1 AS COUNT) R S1;

R

S1

S2

47

Genomic Computing MAP opens to Genome Space abstraction

• MAP operations, through reference regions R, extract and standardize genomic features expressed in distinct datasets

• Genome Space: simplified structured outcome, ideal format for data analysis

R1 R2 R3 DHS

RNAPII H3K4me1

… Gene A TSS Enhancer E

GMQL MAP

48

Genomic Computing Next: MAP – Genometric space abstraction

• Genometric spaces represent adjacency matrices, i.e. networks

– Network analysis methods (e.g. page rank, hub/authority, community detection, …)

R1 R2 R3

DHS RNAPII H3K4me1

… Gene A TSS

Enhancer E

GMQL MAP

49

Genomic Computing Pattern-based querying & clustering

On the Genome Space (rows = regions, columns = datasets):• Convert genome space values (e.g. to Boolean format)• Extract patterns (regions with given values in different datasets,

i.e. with same genomic traits)• Filter rows (find regions with given values in given datasets)• Cluster rows (based on identity or similarity)• Cluster columns (based on identity or similarity)• Extract relevant row attributes (e.g. common aspects of clustered

regions)• Extract relevant column attributes (e.g. common aspects of

clustered datasets, e.g. prevalent phenotype)G1 e1 e2 e3 e4 e5 e6 e7 e8 e9 e10 e11

r1 X X x Xr2 X X X Xr3 X Xr4 X Xr5 X Xr6 X X X

51

Genomic Computing MAP result visualization: Genome Browser

Res = MAP(mutCount AS count) Genes Dataset;

Dataset

Genes

Res

52

Genomic Computing MAP result visualization: Heatmap viewer

It requires: • Partitioning by experiment classes• Adding names to regions and to experiments (from metadata)• Adding colors

53

Genomic Computing Data Viewer: Region clustering

Cluster3 (low densitypattern-2)

Cluster2 (low densitypattern-1)

Cluster1 (high density)

Cluster4 (basal)

54

Genomic Computing Data Viewer: MAP result visualization

55

Genomic Computing Data Viewer: Dendrogram

Cut-1: 2 clusters

Cut-2: 4 clusters

Distance

56

Genomic Computing Data Viewer: Pattern extraction on samples

57

Genomic Computing Data Viewer: Metadata aggregation

For biological/clinical interpretation of genomic data processing, and data stratification based on of biological/clinical metadata values and/or patterns of different genomic feature regions

GMQL at work(Examples)

59

• Annotating unknown data to known annotations(e.g. annotating transcription factors to genes)

• Defining (epi)genomic features on the base ofexperimental data (e.g. putative enhancers)

• Searching for patterns of (epi)genomic features inexperimental data and relating them to specific biological phenomena (e.g. short-range interactions)

• Looking for patterns of (epi)genomic features related tothe tridimensional structure of the genome and relatingthem to specific biological phenomena (e.g. long-range interactions)

Genomic Computing applicationsTypical GMQL applications

60

61

Genomic Computing GMQL examples – Genetic mutations

Finding genetic mutations in CpG IslandsConsider DNA-seq data of distinct human cancer cell lines; for each of them quantify the mutations in each CpG island.Then, select the CpG islands with at least a mutation.Return the list of cancer cell lines ordered by the number of such CpG islands.

MUT = SELECT(dataType == ‘DnaSeq‘ AND cell_karyotype == ‘cancer‘) EXP;CpG = SELECT(type == ‘CpG islands') ANNOTATION;

CpG1 = MAP(MutCount AS COUNT) CpG MUT;CpG2 = PROJECT(MutCount > 0) CpG1;CpG3 = AGGREGATE(CpGCount AS COUNT) CpG2;CpG_res = ORDER(DESC CpGCount) CpG3;MATERIALIZE CpG_res;

62

Genomic Computing GMQL examples – Heterogeneous datasets

Combining ChIP-seq and DNase-seq data in different formats and sourcesExtract broad peaks of ChIP-seq transcription factor binding sites and histone modifications from ENCODE samples that intersect DNase-seq open chromatin regions from Roadmap Epigenomics in normal H1 embryonic stem cells.

CHIPSEQ = SELECT(dataType == 'ChipSeq' AND view == 'Peaks' AND setType == 'exp' AND cell == 'H1-hESC') HG19_ENCODE_BROAD;

DNASESEQ = SELECT(assay == 'DNase.hotspot.broad' AND Standardized_Epigenome_name == 'H1 Cells') HG19_ROADMAP_EPIGENOMICS_BED;

DNASESEQ1 = COVER(1, ANY) DNASESEQ;

CHIPSEQ_IN_DNASESEQ = JOIN(distance < 0, project_right_distinct)DNASESEQ1 CHIPSEQ;

MATERIALIZE CHIPSEQ_IN_DNASESEQ;

63

Genomic Computing GMQL examples – Heterogeneous datasets

Associating transcriptomics and epigenomicsIn RNA-seq experiments of human cancer cell line HeLa-S3, find the average expression of each gene. Then, in ChIP-seq experiments of the same cell line, in each gene find the average signal of H3K27me3 and H3K4me3 histone modifications. Map these experiments to known genes.

Based on all such average values, evaluate the pairwise similarity of each gene with the BRCA1 gene, and return the ordered list of the 20 genes most similar to BRCA1.Also, through bi-clustering, identify and return groups of genes and their histone and transcription signals with similar patterns.

GENES = SELECT(type == ‘gene‘ AND provider == ‘UCSC’) ANNOTATION; RNA = SELECT(dataType == ‘RnaChip‘ AND cell == 'HeLa-S3‘) PEAK;HM = SELECT(dataType == ‘ChipSeq‘ AND cell = ‘HeLa-S3‘ AND

(antibody == ‘H3K27me3‘ OR antibody == ‘H3K4me3‘)) PEAK;

EXP = UNION RNA HM;GenomeSpace = MAP(EXPavg AS avg(signal)) GENES EXP;……

64

Genomic Computing GMQL examples – Patient datasets

Counting distinct DNA mutations in patient groupsGroup patients by tumor type and ethnicity, and count the distinct DNA somatic mutations in each group.

MUTATION = SELECT(dataType == ‘dnaseq’) TCGA_dnaseq;

MUTATION_BY_RACE = COVER(1, ANY; GROUP BY tumor_tag, race;overlap_count AS COUNT, barcodes AS BAG(tumor_sample_barcode)) MUTATION;

MUTATION_COUNT = AGGREGATE(mutation_count AS COUNT) MUTATION_BY_RACE;

MATERIALIZE MUTATION_COUNT;

65

Genomic Computing GMQL examples – Different datasets of patients

Combining different data types of multiple patients Match DNA copy number variation (CNV) and microRNA (miRNA) data samples regarding the same biospecimen and extract the CNVs occurring within expressed miRNA genes in the paired samples.

CNV = SELECT(dataType == ’cnv’) TCGA_cnv;

MIRNA_GENE = SELECT(dataType == ’mirnaseq’) TCGA_mirnaseq_mirna;

CNV_GENE_0 = MAP(left -> bcr_sample_barcode == right -> bcr_sample_barcode, gene_count AS COUNT, mirna_genes AS BAG(mirna_id)) CNV MIRNA_GENE;

CNV_GENE = PROJECT(gene_count > 0) CNV_GENE_0;

MATERIALIZE CNV_GENE;

66

Genomic Computing GMQL examples – Different datasets of patients

Combining and processing heterogeneous omics data of patients In TCGA data of breast cancer patients, find the DNA somatic mutations within the first 2000 bp outside of the genes that are expressed and methylated in at least one of these patients, and extract the top five patients with the highest number of such mutations and their somatic mutations.

EXPRESSED_GENE = SELECT(dataType == ‘rnaseqv2’ AND tumor_tag == 'brca') HG19_TCGA_RnaSeqV2_Gene;

METHYLATION = SELECT(dataType == ‘dnamethylation’ AND tumor_tag == 'brca') HG19_TCGA_Dnamethylation;

MUTATION = SELECT(data_type == ‘dnaseq’ AND tumor_tag == 'brca') HG19_TCGA_DnaSeq;

GENE_METHYL = JOIN(left->bcr_sample_barcode == right-> bcr_sample_barcode, distance < 0, project_left_distinct) EXPRESSED_GENE METHYLATION;

GENE_METHYL1 = COVER(1, ANY) GENE_METHYL;MUTATION_GENE = JOIN(distance < 2000 AND distance > 0, left) MUTATION

GENE_METHYL1;MUTATION_GENE_count = AGGREGATE(mutation_count AS COUNT) MUTATION_GENE;MUTATION_GENE_top = ORDER(DESC mutation_count; TOP 5) MUTATION_GENE_count;MATERIALIZE MUTATION_GENE_top;

System Architecture

(GenData 2020)http://www.bioinformatics.deib.polimi.it/genData/

67

http://www.bioinformatics.deib.polimi.it/genData/

Genomic Computing Overall architecture of system (GenData 2020)

68

GMQL

Stores experimental datasets and annotations collected from external databases

• ENCODE (more than 4000 processed datasets for humans and mices, relevant to epigenomic research)

• Epigenomics Roadmap (about 1000 human epigenomic datasets for stem cells and ex-vivo tissues)

• TCGA (The Cancer Genome Atlas, providing more than 50,000 processed datasets for more than 30 cancer types, including mutations, copy number variations, gene and miRNA expressions, methylations)

Genomic Computing Repository

69

Genomic Computing Repository

70

Annotation data are also extracted from external references, based upon the needs of given research projects

• Genes (UCSC, RefSeq, Ensembl)• Transcription Start Sites (SwitchGear)• Transcription Factor Binding Sites (UCSC, ENCODE)• CpG islands (UCSC)• miRNA target sites (UCSC)• Enhancers (Vista)

Genomic Computing Repository - Annotations

71

Implementation

72

73

Genomic Computing GMQL implementation, V1

• GMQL similar to Pig Latin (by Yahoo! Research)– Algebraic language for data-intensive applications on

Apache Hadoop, a framework for parallel computing which executes Google MapReduce programs

• Implementation strategy: develop a translator to Pig Latin– Easier development and maintenance – Big company involvement ensures development– Use cloud computing power to obtain efficiency and

scalability

Genomic Computing System architecture

74

Genomic Computing System architecture

75

Genomic Computing System architecture - Repository

76

Genomic Computing GMQL query translation to PIG over Hadoop

77

GMQL query Translator Pig over Hadoop

Motivation:• Clear & compact user code• User-transparent optimization

Genomic Computing Translation example

78

• 1 statement => 25 Pig Latin lines of code + auxiliary Java function

• The translator takes also care of updating the variable schema

• Error handling

79

Genomic Computing GMQL to Pig Latin translation

•

Genomic Computing Optimization in translation

1. Parallelism by splitting computations:• By chromosome• By experiment

2. Join and Map have a translation which avoids crossproducts, based on sequential scan of regions

Pig Latin shows its ability to scale on hundreds or thousands of experiments and multi-node systems

80

Genomic Computing MAP and JOIN vs. competitors

81

Genomic Computing System architecture, V2

• Holistic data management system for genomics• Uses cloud-based computing for querying thousands of

heterogeneous datasets

82

83


• A different approach, with language-independent intermediate representation

• Targeting also usability from within R and Galaxy

IR Semantics

GMQL EmbeddedGMQL

LogicalGMQL

Spark Flink ???

Syntax

Implementation API to IR

Genomic Computing GMQL implementation, V2 - Scala API

84

Genomic Computing GMQL implementation, V2 - IR Example

85

86


• New optimization options

GMQLGMQL IR OptimizedIR Implementation

QueryOptimizer

Low LevelOptimizer

1) Node reordering / deletion2) Select condition refinement

1) Alternative algorithms2) Parallelism tuning3) Data partitioning4) Caching

Genomic Computing Optimizations at compile-time & execution-time

Idea:• Let Flink/Spark/… engines implement common and

well known optimization• Exploit the intermediate representation in order to

implement optimizations which are driven by the semantics of GMQL

• Meta-first optimization• Operator swapping optimization

• Other optimizations based on algorithms for parallel execution on the cloud

87

Genomic Computing Meta-first optimization

Under certain conditions (meta-separability), it is possible to compute the metadata side of the query strictly before the region data side.

88

Genomic Computing Meta-separability

GMQL queries are always meta-separable, except for the ones which use the EXTEND (AGGREGATE) operator

(EXTEND operator computes and aggregates on the region data and stores the result in the metadata)

89

Genomic Computing Meta-first optimization - Workflow

ReadMD

StoreMD

ReadRD

StoreRD

ID s

• Compute metadata side of the query

• Retrieve the IDs from the metadata result

• Use the IDs to selectively load only the files that will appear in the output

90

Genomic Computing Where it helps?

Affected queries are the ones which contain one or more metadata selection (far from the Readings), metadata join and metadata group by; those operations cut the size of the output

91

Genomic Computing Operator swapping optimization

Some reordering of the execution plan can not be inferred by lower level optimizer, since they are motivated by GMQL semantics

DS1 DS2 DS3

Joinleft

Diff

DS1 DS3 DS2

Joinleft

Diff

92

Genomic Computing Binning strategy optimization

Bin1 Bin2 Bin3 Bin4 Bin5 Bin6 Bin75bin 6

n7

Strategy for intersection:1. Partition the genome in bins2. Assign each region to all the bins it overlaps3. Search for intersections within each bin

In the case of more complex operations, we change the way in which the regions are assigned to the bins

93

In order to avoid the duplicates production, when two regions overlap, an output is emitted if, and only if, at least one of them begins in the considered bin

• Bin 2: overlap => red region begins => Output e• Bin 3: overlap => no region begins => Output not emitted!

Genomic Computing Binning strategy

Avoiding output duplicates:

bin 1 bin 2 bin 3

94

Genomic Computing Binning tradeoff

• Smaller bins: smaller search space, but higher number of replicates

• Optimal binning size depends on:– Number of regions and local density– Region length distribution– GMQL operation and parameters– System settings (e.g., number of nodes, amount of

memory, …)

050100150200250300

1K 5K 10K 50K 100K 200K 500K 1M 10M

Bin length

Exec. time

95

User Interface

97

Genomic Computing GMQL Web interface

98

http://www.bioinformatics.deib.polimi.it/GMQL/interfaces/

100

Genomic ComputingGMQL results on Integrated Genome Browser

Genomic Computing GMQL examples – Distal bindings: Visualization of results

101

Results are provided to user in GTF or Tab-delimited format

Genomic Computing GMQL examples – Distal bindings: Visualization of results

102

Applications

103

Genomic Computing Transcription Factor query & Genome Space

Source: ENCODE ChIP-seq datasets for transcription factors (TF)Goal: Generation of a transcriptional network from ChIP-seq dataMethod: Select differentially TFs which are also genes and derive TF-Genes and TF-TF links. Build a TFxGenes matrix Mij such that Mij = 1 if there is at least one binding site i in the gene region j, Mij = 0 otherwise.

# Extract gene information, some of them tagged with TF-encoding

GENE_ANN = SELECT(type == 'gene' AND provider == 'RefSeq') ANNOTATION;

GENES = PROJECT(feature == ’gene’) GENE_ANN; TF_GENES = PROJECT(encode_TF == ‘yes’) GENES; # red in next slide

# Collect TF samples (122)

TF = SELECT(dataType = 'ChipSeq' AND subType = 'TF' AND cell == 'k562' AND treatement == 'None') ENCODE_PEAK;

# Build TF Genomic Space

GS_TF = MAP(signal AS EXISTS) GENES TF;104

105

Genomic Computing Genome Space building

G1 G2 G4

TF1TF2TF4

Ann

Cell line: K562 (CML)# TFs: 122# Nodes: 6240# Edges: 30587

TF2

TF1 G1

G2

TF4G4

G3=TF3

G3=TF3

106

Genomic Computing Restriction of TF to open chromatin regions

# Collect open chromatin samples

DHS_2 = SELECT(dataType == 'DnaseSeq' AND cell == 'k562') ENCODE_PEAK; # 2 samples

FAIRE = SELECT(dataType == 'FaireSeq' AND cell == 'k562') ENCODE_PEAK; # 1 sample

# Merge DHS replicates in one sample

DHS = COVER_FLAT(2, ANY; pValue AS MIN(pValue)) DHS_2;

# Merge open chromatin regions from DHS and FAIRE assays

DHS_FAIRE = UNION DHS FAIRE;OPEN = COVER(1, ANY; pValue AS MIN(pValue)) DHS_FAIRE;

# Extract TFs in open chromatin regions only (active DNA # binding)

TF_OPEN = JOIN(distance < 0, left) TF OPEN;

107

Genomic Computing Transcription Factor Genome Space

# TF Genomic Space

GS_TF = MAP(signal AS EXISTS) GENES TF_OPEN;

G1 G2 G3TF3

G4 G5 … Gn

TF1 1 0 1 1 0 … 1

TF2 1 0 1 0 0 … 1

TF3 0 0 0 0 1 … 0

… 0 0 0 1 0 … 1

TFn 0 0 0 1 1 … 1

108

Genomic Computing Genome Space building

G1 G2 G4Ann

TF1

TF2

TF4

DHS

FAIRE

TF2

TF1

TF4

G4

G2 Cell line: K562 (CML)# TFs: 95# Nodes: 1717# Edges: 2367

G3=TF3

G3=TF3

G1

From Genometric Space to networks: K562 transcription network

109

Genomic Computing Differential binding research

Source: ENCODE JUN ChIP-seq datasets relative to cell line ‘k562’ (chronic myelogenous leukemia), with one control and 4 cases treated using interferon alpha or gamma at 30 minutes and 6 hours respectively, each with two replicas

Questions: • Q1: Find those JUN binding regions in treated cases where there are no

JUN binding peaks in the control• Q2: Find the JUN binding regions which are differentially present between

treated cases (e.g. in alpha and not in gamma interferon treated samples)

• Q3: Find the genes that overlap with JUN binding regions either in control or treated samples

• Q4: Find the average JUN ChIP-seq signal (in bedGraph files) of the treated samples in promoters (areas surrounding the TSS) of genes intersecting JUN binding regions either in control or treated samples

110

Select IFNa30 non overlapping with CONTROLS (Question 1, same for IFNa6h, IFNg30, IFNg6h)

# select controlsJUN_2 = SELECT (antibody == 'c-Jun' AND dataType == 'ChipSeq' AND

cell == 'k562' AND laboratory == 'Stanford' AND treatment == 'None') ENCODE_PEAK;

# extract the peak regions in at least two replicas and their minimum p-valueJUN = COVER(2, ANY; pValue AS MIN(pValue)) JUN_2;

# select treated samplesJUN_IFNa30_2 = SELECT(antibody == 'c-Jun' AND dataType == 'ChipSeq'

AND cell == 'k562' AND laboratory == 'Stanford' AND treatment == 'IFNa30') ENCODE_PEAK;

# extract the peak regions in at least two replicas and their minimum p-valueJUN_IFNa30_3 = COVER(2, ANY; pValue AS MIN(pValue)) JUN_IFNa30_2;

# extract peaks which do not overlap with controls (question 1)JUN_IFNa30 =JOIN(minDistance AND distance >= 0, left) JUN_IFNa30_3 JUN;

111

Regions expressed differentially by treatment alpha/gamma (Question 2)

# all treated samples: union of all peaks regions non-overlapping with controlsJUN_IFNa = UNION JUN_IFNa30 JUN_IFNa6h;JUN_IFNg = UNION JUN_IFNg30 JUN_IFNg6h;JUN_IFNag = UNION JUN_IFNa JUN_IFNg;

# regions that are present in all treatments alpha and not in any treatment# gamma (question 2)JUN_IFNag_ONLYa = DIFFERENCE() JUN_IFNa JUN_IFNg;

# regions that are present in all treatments gamma and not in any treatment# alpha (question 2)JUN_IFNag_ONLYg = DIFFERENCE() JUN_IFNg JUN_IFNa;

112

JUN overlapping gene lists (Question 3)

# extract all genesGENE_2 = SELECT(type == 'gene' AND provider == 'RefSeq')

ANNOTATION;GENE = PROJECT(feature == ’gene’) GENE_2;

# all samples: union of peak regions in controls or in treated samplesJUN_ALL = UNION JUN JUN_IFNag;

# extract genes intersecting with peak regions in control or treated samples# (question 3)GENE_JUN_ALL = JOIN(distance < 0, left) GENE JUN_ALL;

113

Prepare relevant promoter regions

# extract transcription start sites (TSS) of all genesTSS = SELECT(type == ‘TSS’) ANNOTATION;

# extend TSS to obtain promoter regions PROM = PROJECT(start = start - 1000, stop = stop + 500) TSS;

# extract the peak regions present in at least one sampleJUN_ALL_2 = COVER(1, ANY) JUN_ALL;

# extract all promoter regions intersecting with at least one peak region # in control or treated samplesPROM_JUN_ALL = JOIN(distance < 0, right) JUN_ALL_2 PROM;

114

Compute average signal on extended promoter regions (Question 4)

# Load bedGraph datasetsJUN_ALL_BG = SELECT(antibody == 'c-Jun' AND dataType == 'ChipSeq' AND

cell == 'k562' AND laboratory == 'Stanford' AND(treatment == 'IFNa30' OR treatment == 'IFNa6h' ORtreatment == 'IFNg30' OR treatment == 'IFNg6h' OR treatment == none)) ENCODE_BEDGRAPH;

# Map peak binding regions on promoters intersecting with at least one peak# region in control or treated samples, producing average promoter signal # (question 4)JUN_ALL_SIGNAL = MAP(promAvg AS AVG(signal)) PROM_JUN_ALL

JUN_ALL_BG;

115

PR1 PR2 PR3 PR4 PR5 … PRn

JUN1 6.7 7.5 5.2 3.2 5.2 … 15.4

JUN2 15.4 0.4 0.8 9.4 6.2 … 21.6

JUN3 5.5 14.5 7.2 5.4 5.2 … 4.5

… … … … … … … …

JUNn 2.2 2.1 2.5 6.5 6.2 … 6.7

Summary & Outlook

116

117

Genomic Computing Conclusions

• GDM: a data format-independent genomic data model– For genomic region data and related metadata– Easing integration and processing of heterogeneous

genomic data

• GMQL: a high-level declarative language– Easing the expression of even complex queries on

numerous data of multiple different types– Running also on cloud computing environments– Supporting a first processing also of big data, to extract

the relevant (usually smaller) ones for further processing

• Several GDM & GMQL application examples– Characterizing interplay and function of genomic regions

118

Genomic Computing Discovering biologically useful results

Working together with biologists at IEO & IIT for giving answers to biological problems. Ongoing projects:

1. 3D chromatine structure (Pelicci)2. DNA replication and gene expression (Pelicci)3. Analysis of unknown P53 binding loci (Amati)4. Chromatin state change in time-course Myc bindings (Amati)5. Myc bindings saturation patterns (Amati)6. Transcription factors co-occurrence with TEAD binding sites

(Campaner)7. Correlating hotspots of RAD21 bindings with high density TF

regions (Campaner)8. Copy number variants (CNVs) in Chromosome 7 (Testa)

119

A p

atte

rn o

f gen

omic

feat

ures

Genomic Computing FutureVision: Pattern-based queries from genome browser

120

Genomic Computing Future Vision: Cycle Query–Analysis–Visualization

• Query– By example– Using public DBs & ontologies– Search remote data– Query / extract remote data

• Analysis– (un)supervised learning– Region finding– Motif / pattern finding

• Visualization– Clustering– Long range interactions

VisualizationVisualization

QueryQuery

AnalysisAnalysis

Genomic Computing Future Long-term vision: Internet of Genomes

• The platform (client & servers) and language should support queries/computations involving different servers- Minimizing the information to be transferred among

servers and between them and the client• Each server should expose its own data for access by

exploratory search & crawlers

1 2 1 121

Genomic Computing Resources & Web sites & acknowledgements

Overview: http://www.bioinformatics.deib.polimi.it/genomic_computing/GMQL web site: http://www.bioinformatics.deib.polimi.it/GMQL/

Includes:• Local mode or MapReduce mode (over Hadoop, or Hadoop YARN)

for GNU/Linux systems - Download (122 MB)• Web services (over Hadoop YARN) - Download (60 MB)• Quick start - Install GMQL and get started• GMQL tutorial & Complete documentation• Functional comparison with BEDTools & BEDOPS• Pointer to publications (Bioinformatics, IEEE/ACM TCBB, Methods)

http://www.bioinformatics.deib.polimi.it/GMQL/interfaces/Includes:• User-friendly interface to creating/managing GMQL queries• Repository of ENCODE / Roadmap Epigenomics / TCGA datasets

122

Genomic Computing Resources, Web sites & acknowledgements

European Research Council “Data-Driven Genomic Computing”(GeCo) project: http://www.bioinformatics.deib.polimi.it/GeCo/

Our group, with Prof. Stefano Ceri and several PhD & master studentsand postDoc: • Abdulrahman Kaitoua• Pietro Pinoli• Arif Canakoglu

123

Thank you for your attention!

Any question?

Thank you for your attention!

Genomic Computing

124

http://www.bioinformatics.deib.polimi.it/genomic_computing/

genomic data model and genometric query language as

Documents