genomic data model and genometric query language as

117
Dipartimento di Elettronica, Informazione e Bioingegneria Genomic Computing Politecnico di Milano 13-17 March, 2017 Genomic Data Model and GenoMetric Query Language as research enabler to discover genome properties Marco Masseroli and Stefano Ceri (joint work with several PhD students) Politecnico di Milano, BioInformatics Group

Upload: others

Post on 23-May-2022

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Genomic Data Model and GenoMetric Query Language as

Dipartimento di Elettronica, Informazione e

Bioingegneria

Genomic ComputingPolitecnico di Milano 13-17 March, 2017

Genomic Data Model and GenoMetric Query Language as research enabler to discover genome propertiesMarco Masseroli and Stefano Ceri(joint work with several PhD students)

Politecnico di Milano, BioInformatics Group

Page 2: Genomic Data Model and GenoMetric Query Language as

2

Genomic Computing Background: Next Generation Sequencing

• Next Generation Sequencing technology is about to provide affordable (in time and cost) and precise determinations of genome wide:

– DNA sequence / variations (DNA-seq)– gene subregions’ activity (RNA-seq) [all gene test]– protein-DNA interaction regions (ChIP-seq) – open chromatin (DNase-seq)

Goal of $1,000 full genome sequencing in under an hour has just met

• Very many DNA-interacting proteins / subjects / conditions will be soon evaluated

– Personalized medicine (diagnosis and treatment)– Each NGS test can generate 0.4TB -> Big Data scenario

Page 3: Genomic Data Model and GenoMetric Query Language as

Source: http://blog.goldenhelix.com/grudy/a-hitchhiker%E2%80%99s- guide-to-next-generation-sequencing-part-2/

Genomic Computing Big data analysis with Next Generation Sequencing

My talk

3

Page 4: Genomic Data Model and GenoMetric Query Language as

4

HeterogeneousGenomic 

Data Sources

HeterogeneousClinic

Data SourcesPersonal Patient Data

Genome Browser Genome Browser

Data Analytics

Data Analytics

Ontological Knowledge Ontological Knowledge

Genom

ic Data Integration 

Publishing / Crawling / Searching

Genom

ic Data Integration 

Publishing / Crawling / Searching

Clinical D

ata IntegrationC

linical Data Integration

Biologist ClinicianMedical Literature

Clinical Protocols

Genomic Computing The big picture: Distributed heterogeneous data

Page 5: Genomic Data Model and GenoMetric Query Language as

5

A n

umbe

r of g

enom

ic fe

atur

es (t

rack

s)

One macro genomic region

Dat

a tra

cks

Genomic Computing Current practice – UCSC Genome Browser

Page 6: Genomic Data Model and GenoMetric Query Language as

6

Genomic Computing The challenge: Understanding biologists’ needs

• Working together with biologists for giving answers to the problems behind the «courtesy» slide

Courtesy of Prof. Pelicci, IEO

Page 7: Genomic Data Model and GenoMetric Query Language as

7

Genomic Computing Challenge: Genotype–phenotype discovery

• (Epi)genotype-phenotype relationship discovery: understanding genomic regions, genome variations and their associations with different phenotypes

– highly heterogeneous scenario

• It requires evaluating, in several different conditions and types of individuals:

– genome (DNA) sequence variations – gene activity & its regulation– occurring interactions

Page 8: Genomic Data Model and GenoMetric Query Language as

8

Genomic Computing Main questions

Scientist’s typical questions(from our interaction with IEO - European Oncology Instituteand IIT - Italian Institute of Technology)

• Can interesting DNA regions and their relationships be discovered using genome-wide queries?

• Can genomic data of patients be grouped according to clinical phenotype and compared?

• Can the genomic features of all the genes involved in the same biological process be extracted and then analyzed?

• Can we retrieve portions of the genome of given patients, extracting them from remote servers and comparing them?

Page 9: Genomic Data Model and GenoMetric Query Language as

9

Genomic Computing Research agenda

• Can interesting DNA regions and their relationships be discovered using genome-wide queries?

Genometric query language• Can genomic data of patients be grouped according to

clinical phenotype and compared? Genometric query language + clustering

• Can all the features of the genes involved in the same biological process be extracted and then analyzed?

Genometric query language + data analysis

• Can we retrieve portions of the genome of given patients, extracting them from remote servers and comparing them?

Genometric query language + indexing & search

Page 10: Genomic Data Model and GenoMetric Query Language as

10

Genomic Computing Research agenda by topics

• Data model: design a simple and format-independent data model for describing datasets with both genomic regions and general provenance information (including phenotype)

• Query language: design a query language where both genometric aspects (about the placement of regions on the genome) and provenance can be queried at a high level of data independence and transparency

• Integrative data analysis: translating query results into a genome space which is the ideal start point for correlation and network analysis

• Data search: design protocols for data crawling and indexing based on the data model

Page 11: Genomic Data Model and GenoMetric Query Language as

Genomic Data Model

11

Page 12: Genomic Data Model and GenoMetric Query Language as

12

Genomic Computing Genomic Data Model

Within the same sample, two kinds of data:• Region values aligned w.r.t. a given reference, with specific

left-right ends within a chromosome, and with several associated attributes (e.g. p-value of region significance)

• Metadata, with free-format attribute-value pairs, storing all the knowledge about the sample

A C G T T A A C G G A T A C C A A C

left (position) right (position)

chr (chromosome)

strand (direction)

DNAr

Page 13: Genomic Data Model and GenoMetric Query Language as

Genomic Computing Model rationale

• Regions of the model are data format independent and provide an interoperability framework for comparing data on mutations, expression or regulation using regions as common ground

• Metadata attribute-value pairs of the model are info-system independent and provide an interoperability framework for comparing samples based upon their biological aspects

13

Page 14: Genomic Data Model and GenoMetric Query Language as

14

Genomic Computing Genomic Data Model – Example

0.1 0.6 Tumor_type = brcaPatient_age = 75

0.5 0Tumor_type = brcaPatient_age = 63Sex = Female

0.1 0 0.8 0.1Tumor_type = brcaPatient_age = 58

Sample 1

Sample 3

Sample 2

Page 15: Genomic Data Model and GenoMetric Query Language as

15

Genomic Computing Genomic Data Model – Example

• Region values: {expID, region:(chr, left, right, strand), p-value}

• Metadata: {expID, attribute, value}

Page 16: Genomic Data Model and GenoMetric Query Language as

Genomic ComputingGenomic Data Model - Samples and datasets

16

Samples and datasets• Every sample corresponds to an «experiment», with an ID• Every dataset is a named collection of samples with the

same region data schemaData format independent; interoperability framework for comparing data samples based upon their biological aspects

Page 17: Genomic Data Model and GenoMetric Query Language as

Genomic ComputingGenomic Data Model - Mapping examples

17

Page 18: Genomic Data Model and GenoMetric Query Language as

Genomic ComputingGenomic Data Model - Mapping examples

18

Page 19: Genomic Data Model and GenoMetric Query Language as

Genomic ComputingGenomic Data Model - Mapping examples

19

Page 20: Genomic Data Model and GenoMetric Query Language as

DNA-seq (mutations)(id, ('chr,start,stop,strand), (A,G,C,T,del,ins,inserted,ambig,Max,Error,A2T,A2C,A2G,C2A,C2G,C2T))(1, (chr1, 917179, 917180,*), (0,0,0,0,1,0,'.','.',0,0,0,0,0,0,0,0))(1, (chr1, 917179, 917179,*), (0,0,0,0,0,1,G,'.',0,0,0,0,0,0,0,0))

RNA-seq (gene expression)(id, ((chr,start,stop,strand), (source,type,score,frame,geneID,transcriptID,RPKM1,RPKM2,iIDR))(1, (chr8, 101960824, 101964847,-), ('GencodeV10', 'transcript', 0.026615, NULL, 'ENSG00000164924.11', 'ENST00000418997.1', 0.209968, 0.193078, 0.058))

Annotations(id, (chr,start,stop,strand), (proteinID,alignID,type))(1, (chr1, 11873, 11873, +), ('uc001aaa.3', 'uc001aaa.3', 'cds')) (1, (chr1, 11873, 12227, +), ('uc001aaa.3', 'uc001aaa.3', 'exon')) (1, (chr1, 12612, 12721, +), ('uc001aaa.3', 'uc001aaa.3', 'exon')) (1, (chr1, 13220, 14409, +), ('uc001aaa.3', 'uc001aaa.3', 'exon'))

ChIA-PET (denoting 3D genomic loops, head is assembled with coordinates, tail is in the schema)(id,(chr,headstart,headstop,strand), (loopType, tailChr, tailStart, tailStop, PETcount, pValue, qValue))(1, (chr1,7385626,7389841,*), ('Inter-Chromosome', chr17, 3081653, 3084755, 50, 0.0, 0.0)

20

Genomic ComputingGenomic Data Model - Other mapping examples

Page 21: Genomic Data Model and GenoMetric Query Language as

Query Language

21

(Motivational example and detailed description)

Page 22: Genomic Data Model and GenoMetric Query Language as

22

Genomic Computing GMQL motivational example

The language allows for queries on the genome involving large datasets describing:

• Genomic signals (i.e. experiment dataset regions)• Reference regions (e.g. TSS, genes, promoters, enhancers)• Distance rules (e.g. the nearest enhancer that stands

at least at 100 kb from the nearest gene)

Enhancer Promoter

GeneReference DNAExperimental dataset 1

Distance pattern

Experimental dataset 2

Experimental dataset 3

Reference regions

Page 23: Genomic Data Model and GenoMetric Query Language as

23

Identification of distal bindings in transcription regulatory regionsFind all CTCF transcription factor (TF) binding regions of ChIP-seq data regarding human cancer cell line HeLa-S3, which are farther than x kb (e.g. 1000 kb) from the TSS (transcription start site) of the nearest gene. Then, find all H3K4me1 histone modification (HM) regions that are also farther than x kb from the TSS of the nearest gene.Finally, consider known enhancer (EN) regions and return a list of EN-HM-overlapping TF regions.

Genomic Computing GMQL motivational example – Distal bindings

Nearest gene

HM

TF1

TF2

REFx

TSS

EN

GMQL result

HM: Histone mark experiment

TF: Transcription factorexperiment

REF: Reference DNA regionsEN: Enhancerx: Threshold distance

DNA region

Not respectingthe distance thresholdNot EN or HMoverlapping

Page 24: Genomic Data Model and GenoMetric Query Language as

24

Genomic Computing GMQL motivational example – Distal bindings

HM = SELECT(dataType == 'ChipSeq‘ AND cell == 'HeLa-S3‘AND antibody == ‘H3K4me1') PEAK;

TF = SELECT(dataType == 'ChipSeq‘ AND cell == 'HeLa-S3‘AND antibody == ‘CTCF') PEAK;

TSS = SELECT(type == ‘TSS‘) ANNOTATION;EN = SELECT(type == ‘enhancer‘) ANNOTATION; HMa = JOIN(minDistance(5) AND distance > 1000000, right) TSS HM;TFa = JOIN(minDistance(5) AND distance > 1000000, right) TSS TF;TFb = JOIN(distance < 0, left) TFa EN;HMb = JOIN(distance < 0, left) HMa EN;TF_res = JOIN(distance < 0, left) TFb HMb;

Nearest gene

HM

TF1

TF2

REFx

TSS

EN

GMQL result

HM: Histone mark experiment

TF: Transcription factorexperiment

REF: Reference DNA regionsEN: Enhancerx: Threshold distance

DNA region

Not respectingthe distance thresholdNot EN or HMoverlapping

Page 25: Genomic Data Model and GenoMetric Query Language as

25

Genomic Computing GenoMetric Query Language

GenoMetric Query Language (GMQL) is defined as a sequence of algebraic operations following the structure:

< variable > = < operator > (< parameters >) < variable >

– Every variable is a dataset including many samples

– Offers high-level, declarative operations which operate both on regions and meta-data -> thus, each operation progressively builds the regions and meta-data of its result

– Inspired by SQL and Pig Latin

– Targeted towards cloud computing

Page 26: Genomic Data Model and GenoMetric Query Language as

Genomic Computing Overall view on GMQL operations

Classic relational operations – with genomic extensions• SELECT, PROJECT, EXTEND, ORDER, GROUP, MERGE, UNION, DIFFERENCE

Domain-specific genomic operations:

• COVER, (GENOMETRIC) JOIN, MAPUtilities:

• MATERIALIZE

27

Page 27: Genomic Data Model and GenoMetric Query Language as

28

Genomic Computing Sample selection – Example SELECT

0.1 0.6 Tumor_type = brcaPatient_age = 75

Selection of the samples where a selection predicate p is true (e.g. select patients younger than 70 years)

0.5 0Tumor_type = brcaPatient_age = 63Gender = Female

0.1 0 0.8 0.1Tumor_type = brcaPatient_age = 58

S2 = SELECT(p) S1;

Example: S2 = SELECT(Patient_age < 70) S1;

Page 28: Genomic Data Model and GenoMetric Query Language as

30

Genomic Computing Region projection – Example PROJECT

0.1 0.6 Tumor_type = brcaPatient_age = 75

0.5 0Tumor_type = brcaPatient_age = 63Gender = Female

0.1 0 0.8 0.1Tumor_type = brcaPatient_age = 58

Selection of the regions where a selection predicate p is true (e.g. select those regions which have a score greater than 0.5)

S2 = PROJECT(p) S1;

Example: S2 = PROJECT(score > 0.5) S1;

Page 29: Genomic Data Model and GenoMetric Query Language as

31

Genomic Computing Region projection – Example PROJECT

Tumor_type = brcaPatient_age = 75

Tumor_type = brcaPatient_age = 75

Projection of the regions: for each gene in a set, take its promoter (e.g. from -2kbp, to +1kbp from the TSS)

S2 = PROJECT(p) S1;

Example: S2 = PROJECT(start = start – 2000, stop = start + 1000) S1;

S1

S2

Page 30: Genomic Data Model and GenoMetric Query Language as

33

Genomic Computing Metadata aggregation – Example AGGREGATE

5 3102 2

5 01

Tumor_type = brcaPatient_age = 75Region_count = 3

Tumor_type = escaPatient_age = 78Region_count = 5

5 3

Tumor_type = cholPatient_age = 85Region_count = 2

Count the regions in each sample and store it in metadata

S2 = AGGREGATE(Ai AS gi) S1;

Example: S2 = AGGREGATE(Region_count AS COUNT) S1;

Page 31: Genomic Data Model and GenoMetric Query Language as

34

Genomic Computing Order and select top k – Example ORDER

Tumor_type = brcaPatient_age = 75Region_count = 3Order = 2

Tumor_type = escaPatient_age = 78Region_count = 5Order = 1

5 01

5 3102 2

5 3

Tumor_type = cholPatient_age = 85Region_count = 2Order = 3

Order by region_count metadata and take the top 2 samplesS2 = ORDER(Ai; [TOP: k]) S1;

Example: S2 = ORDER(Region_count; TOP: 2) S1;

Page 32: Genomic Data Model and GenoMetric Query Language as

35

Genomic Computing Group by metadata – Example GROUP

5 01

Tumor_type = brcaPatient_age = 75Group = 1Min = 0

5 3102 1

Tumor_type = escaPatient_age = 78Group = 2Min = 1

5 3

Tumor_type = cholPatient_age = 87Group = 3Min = 3

534 6

Tumor_type = escaPatient_age = 78Group = 2Min = 1

Group samples according to the value of tumor and compute the region minimum score of each group

Page 33: Genomic Data Model and GenoMetric Query Language as

36

Genomic Computing Region merge – Example MERGE

Type = ChipSeqAntibody = CTCFReplicate = 1Type = ChipSeqAntibody = CTCFReplicate = 2

Type = ChipSeqAntibody = CTCFReplicate = 3

Type = ChipSeqAntibody = CTCFReplicate = 1Replicate = 2Replicate = 3

Collapse a bunch of samples (both region and metadata) into an unique one S2 = MERGE() S1;

S1.s1

S1.s2

S1.s3

S2

Page 34: Genomic Data Model and GenoMetric Query Language as

37

Genomic Computing Region union – Example UNION

Tumor_type = brcaExperiment = mirna

Tumor_type = brcaExperiment = rnaseq

Return a single dataset with all the samples in two input datasets, merging their region attributes if different

Tumor_type = brcaExperiment = mirna

Tumor_type = brcaExperiment = rnaseq

S3 = UNION() S1 S2;

S1

S2

S3

Page 35: Genomic Data Model and GenoMetric Query Language as

38

Genomic Computing Region difference – Example DIFFERENCE

Return all the regions in the first dataset that do not overlap any region in the second one

Tumor_type = brcaExperiment = mirna

Tumor_type = brcaExperiment = rnaseq

S1.Tumor_type = brcaS1.Experiment = rnaseqS2.Tumor_type = brcaS2.Experiment = mirna

S1

S2

S3

S3 = DIFFERENCE() S1 S2;

Page 36: Genomic Data Model and GenoMetric Query Language as

COVER(ALL,ALL) AND COVER(2,ANY)COVER(1,ANY) OR

S2 = COVER(min, max) S1;

39

Genomic Computing Dataset operations: COVER

• Jaccard indexes can be used instead of min-max• An aggregate function f can be computed for regions

forming the cover

• Produces new regions where there are between MIN and MAX regions of the operand dataset

Page 37: Genomic Data Model and GenoMetric Query Language as

40

Tumor_type = brcaTumor_grade = g3

Tumor_type = brcaTumor_grade = g2

Tumor_type = brcaTumor_grade = g2

Tumor_type = brcaTumor_grade = g2Tumor_grade = g3

2 2 3 1 2 1 1

COVER(2, ANY): find portions of the genome that are covered by at least two regions

Genomic Computing Region Cover – Example COVER

S2 = COVER(2, ANY) S1;

S1.s1

S1.s2

S1.s3

S2 23 22

Page 38: Genomic Data Model and GenoMetric Query Language as

• Given two sets of samples, JOIN builds the pairs of regions and metadata where a join predicate p is true.

• Region of results are composed from regions of the operands

S3 = JOIN(p, comp-op) S1 S2;

• Functions minDistance and distance can be used in the predicate

Genomic Computing Dataset operations: JOIN

41

Page 39: Genomic Data Model and GenoMetric Query Language as

42

Genomic Computing Metadata join – Example JOIN

Metadata join: select pairs of matching samples (e.g. with the same “Type”)

Type = uvmPatient = 123Gender = M

Type = brcaPatient = 10Age = 88

Type = brcaPatient = 211

Type = sarcPatient = 12

Type = sarcPatient = 444Age = 88

Type = brcaPatient = 333Grade = g3

Type = sarcPatient = 12

Page 40: Genomic Data Model and GenoMetric Query Language as

43

Join at min-distance: associate each region in the former dataset with the closest in the latter

Genomic Computing Region Join – Example JOIN

A B feature = transcripts

2 feature = TFBSs

S1.feature = transcriptsS2.feature = TFBSs

S3 = JOIN(mindistance, RIGHT) S1 S2;

1-A 3-B

1 3

S1

S2

S3

Page 41: Genomic Data Model and GenoMetric Query Language as

S2 = JOIN(distance < 1000, CON) R, S1;

d = 400

d = 2500 d = 1100

d = 0

Matching pairs and region

composition

R

3 5

e1

3 5 e2

All pairs

44

Genomic Computing Dataset operations: JOIN example

53

R

e1

Page 42: Genomic Data Model and GenoMetric Query Language as

S2 = MAP(newAttr AS MIN(attr)) R S1;

2 1 3

2 1 43 2

1

2 1

S1

R

S2

45

Genomic Computing Dataset operations: MAP

• Computes aggregate functions over samples of S1 which intersect with the regions of R

0

Page 43: Genomic Data Model and GenoMetric Query Language as

46

Compute an aggregate function (e.g. COUNT) on al the regions intersecting the reference

Genomic Computing Region Map – Example MAP

annotation = genesprovider = RefSeqA B

feature = SNP

2 4

R.annotation = genesR.provider = RefSeqS1.features = SNP

S2 = MAP(count_R_S1 AS COUNT) R S1;

R

S1

S2

Page 44: Genomic Data Model and GenoMetric Query Language as

47

Genomic Computing MAP opens to Genome Space abstraction

• MAP operations, through reference regions R, extract and standardize genomic features expressed in distinct datasets

• Genome Space: simplified structured outcome, ideal format for data analysis

    R1   R2    R3   DHS      

RNAPII          H3K4me1 

 

 …       Gene       A       TSS          Enhancer      E                                  

 

GMQL MAP

Page 45: Genomic Data Model and GenoMetric Query Language as

48

Genomic Computing Next: MAP – Genometric space abstraction

• Genometric spaces represent adjacency matrices, i.e. networks

– Network analysis methods (e.g. page rank, hub/authority, community detection, …)

  R1 R2  R3

DHS     RNAPII           H3K4me1 

 

…   Gene  A TSS 

 

     Enhancer  E                                  

 GMQL MAP

Page 46: Genomic Data Model and GenoMetric Query Language as

49

Genomic Computing Pattern-based querying & clustering

On the Genome Space (rows = regions, columns = datasets):• Convert genome space values (e.g. to Boolean format)• Extract patterns (regions with given values in different datasets,

i.e. with same genomic traits)• Filter rows (find regions with given values in given datasets)• Cluster rows (based on identity or similarity)• Cluster columns (based on identity or similarity)• Extract relevant row attributes (e.g. common aspects of clustered

regions)• Extract relevant column attributes (e.g. common aspects of

clustered datasets, e.g. prevalent phenotype)G1 e1 e2 e3 e4 e5 e6 e7 e8 e9 e10 e11

r1 X X x Xr2 X X X Xr3 X Xr4 X Xr5 X Xr6 X X X

Page 47: Genomic Data Model and GenoMetric Query Language as

51

Genomic Computing MAP result visualization: Genome Browser

Res = MAP(mutCount AS count) Genes Dataset;

Dataset

Genes

Res

Page 48: Genomic Data Model and GenoMetric Query Language as

52

Genomic Computing MAP result visualization: Heatmap viewer

It requires: • Partitioning by experiment classes• Adding names to regions and to experiments (from metadata)• Adding colors

Page 49: Genomic Data Model and GenoMetric Query Language as

53

Genomic Computing Data Viewer: Region clustering

Cluster3 (low densitypattern-2)

Cluster2 (low densitypattern-1)

Cluster1 (high density)

Cluster4 (basal)

Page 50: Genomic Data Model and GenoMetric Query Language as

54

Genomic Computing Data Viewer: MAP result visualization

Page 51: Genomic Data Model and GenoMetric Query Language as

55

Genomic Computing Data Viewer: Dendrogram

Cut-1: 2 clusters

Cut-2: 4 clusters

Distance

Page 52: Genomic Data Model and GenoMetric Query Language as

56

Genomic Computing Data Viewer: Pattern extraction on samples

Page 53: Genomic Data Model and GenoMetric Query Language as

57

Genomic Computing Data Viewer: Metadata aggregation

For biological/clinical interpretation of genomic data processing, and data stratification based on of biological/clinical metadata values and/or patterns of different genomic feature regions

Page 54: Genomic Data Model and GenoMetric Query Language as

GMQL at work(Examples)

59

Page 55: Genomic Data Model and GenoMetric Query Language as

• Annotating unknown data to known annotations(e.g. annotating transcription factors to genes)

• Defining (epi)genomic features on the base ofexperimental data (e.g. putative enhancers)

• Searching for patterns of (epi)genomic features inexperimental data and relating them to specific biological phenomena (e.g. short-range interactions)

• Looking for patterns of (epi)genomic features related tothe tridimensional structure of the genome and relatingthem to specific biological phenomena (e.g. long-range interactions)

Genomic Computing applicationsTypical GMQL applications

60

Page 56: Genomic Data Model and GenoMetric Query Language as

61

Genomic Computing GMQL examples – Genetic mutations

Finding genetic mutations in CpG IslandsConsider DNA-seq data of distinct human cancer cell lines; for each of them quantify the mutations in each CpG island.Then, select the CpG islands with at least a mutation.Return the list of cancer cell lines ordered by the number of such CpG islands.

MUT = SELECT(dataType == ‘DnaSeq‘ AND cell_karyotype == ‘cancer‘) EXP;CpG = SELECT(type == ‘CpG islands') ANNOTATION;

CpG1 = MAP(MutCount AS COUNT) CpG MUT;CpG2 = PROJECT(MutCount > 0) CpG1;CpG3 = AGGREGATE(CpGCount AS COUNT) CpG2;CpG_res = ORDER(DESC CpGCount) CpG3;MATERIALIZE CpG_res;

Page 57: Genomic Data Model and GenoMetric Query Language as

62

Genomic Computing GMQL examples – Heterogeneous datasets

Combining ChIP-seq and DNase-seq data in different formats and sourcesExtract broad peaks of ChIP-seq transcription factor binding sites and histone modifications from ENCODE samples that intersect DNase-seq open chromatin regions from Roadmap Epigenomics in normal H1 embryonic stem cells.

CHIPSEQ = SELECT(dataType == 'ChipSeq' AND view == 'Peaks' AND setType == 'exp' AND cell == 'H1-hESC') HG19_ENCODE_BROAD;

DNASESEQ = SELECT(assay == 'DNase.hotspot.broad' AND Standardized_Epigenome_name == 'H1 Cells') HG19_ROADMAP_EPIGENOMICS_BED;

DNASESEQ1 = COVER(1, ANY) DNASESEQ;

CHIPSEQ_IN_DNASESEQ = JOIN(distance < 0, project_right_distinct)DNASESEQ1 CHIPSEQ;

MATERIALIZE CHIPSEQ_IN_DNASESEQ;

Page 58: Genomic Data Model and GenoMetric Query Language as

63

Genomic Computing GMQL examples – Heterogeneous datasets

Associating transcriptomics and epigenomicsIn RNA-seq experiments of human cancer cell line HeLa-S3, find the average expression of each gene. Then, in ChIP-seq experiments of the same cell line, in each gene find the average signal of H3K27me3 and H3K4me3 histone modifications. Map these experiments to known genes.

Based on all such average values, evaluate the pairwise similarity of each gene with the BRCA1 gene, and return the ordered list of the 20 genes most similar to BRCA1.Also, through bi-clustering, identify and return groups of genes and their histone and transcription signals with similar patterns.

GENES = SELECT(type == ‘gene‘ AND provider == ‘UCSC’) ANNOTATION; RNA = SELECT(dataType == ‘RnaChip‘ AND cell == 'HeLa-S3‘) PEAK;HM = SELECT(dataType == ‘ChipSeq‘ AND cell = ‘HeLa-S3‘ AND

(antibody == ‘H3K27me3‘ OR antibody == ‘H3K4me3‘)) PEAK;

EXP = UNION RNA HM;GenomeSpace = MAP(EXPavg AS avg(signal)) GENES EXP;……

Page 59: Genomic Data Model and GenoMetric Query Language as

64

Genomic Computing GMQL examples – Patient datasets

Counting distinct DNA mutations in patient groupsGroup patients by tumor type and ethnicity, and count the distinct DNA somatic mutations in each group.

MUTATION = SELECT(dataType == ‘dnaseq’) TCGA_dnaseq;

MUTATION_BY_RACE = COVER(1, ANY; GROUP BY tumor_tag, race;overlap_count AS COUNT, barcodes AS BAG(tumor_sample_barcode)) MUTATION;

MUTATION_COUNT = AGGREGATE(mutation_count AS COUNT) MUTATION_BY_RACE;

MATERIALIZE MUTATION_COUNT;

Page 60: Genomic Data Model and GenoMetric Query Language as

65

Genomic Computing GMQL examples – Different datasets of patients

Combining different data types of multiple patients Match DNA copy number variation (CNV) and microRNA (miRNA) data samples regarding the same biospecimen and extract the CNVs occurring within expressed miRNA genes in the paired samples.

CNV = SELECT(dataType == ’cnv’) TCGA_cnv;

MIRNA_GENE = SELECT(dataType == ’mirnaseq’) TCGA_mirnaseq_mirna;

CNV_GENE_0 = MAP(left -> bcr_sample_barcode == right -> bcr_sample_barcode, gene_count AS COUNT, mirna_genes AS BAG(mirna_id)) CNV MIRNA_GENE;

CNV_GENE = PROJECT(gene_count > 0) CNV_GENE_0;

MATERIALIZE CNV_GENE;

Page 61: Genomic Data Model and GenoMetric Query Language as

66

Genomic Computing GMQL examples – Different datasets of patients

Combining and processing heterogeneous omics data of patients In TCGA data of breast cancer patients, find the DNA somatic mutations within the first 2000 bp outside of the genes that are expressed and methylated in at least one of these patients, and extract the top five patients with the highest number of such mutations and their somatic mutations.

EXPRESSED_GENE = SELECT(dataType == ‘rnaseqv2’ AND tumor_tag == 'brca') HG19_TCGA_RnaSeqV2_Gene;

METHYLATION = SELECT(dataType == ‘dnamethylation’ AND tumor_tag == 'brca') HG19_TCGA_Dnamethylation;

MUTATION = SELECT(data_type == ‘dnaseq’ AND tumor_tag == 'brca') HG19_TCGA_DnaSeq;

GENE_METHYL = JOIN(left->bcr_sample_barcode == right-> bcr_sample_barcode, distance < 0, project_left_distinct) EXPRESSED_GENE METHYLATION;

GENE_METHYL1 = COVER(1, ANY) GENE_METHYL;MUTATION_GENE = JOIN(distance < 2000 AND distance > 0, left) MUTATION

GENE_METHYL1;MUTATION_GENE_count = AGGREGATE(mutation_count AS COUNT) MUTATION_GENE;MUTATION_GENE_top = ORDER(DESC mutation_count; TOP 5) MUTATION_GENE_count;MATERIALIZE MUTATION_GENE_top;

Page 62: Genomic Data Model and GenoMetric Query Language as

System Architecture

(GenData 2020)http://www.bioinformatics.deib.polimi.it/genData/

67

Page 63: Genomic Data Model and GenoMetric Query Language as

Genomic Computing Overall architecture of system (GenData 2020)

68

GMQL

Page 64: Genomic Data Model and GenoMetric Query Language as

Stores experimental datasets and annotations collected from external databases

• ENCODE (more than 4000 processed datasets for humans and mices, relevant to epigenomic research)

• Epigenomics Roadmap (about 1000 human epigenomic datasets for stem cells and ex-vivo tissues)

• TCGA (The Cancer Genome Atlas, providing more than 50,000 processed datasets for more than 30 cancer types, including mutations, copy number variations, gene and miRNA expressions, methylations)

Genomic Computing Repository

69

Page 65: Genomic Data Model and GenoMetric Query Language as

Genomic Computing Repository

70

Page 66: Genomic Data Model and GenoMetric Query Language as

Annotation data are also extracted from external references, based upon the needs of given research projects

• Genes (UCSC, RefSeq, Ensembl)• Transcription Start Sites (SwitchGear)• Transcription Factor Binding Sites (UCSC, ENCODE)• CpG islands (UCSC)• miRNA target sites (UCSC)• Enhancers (Vista)

Genomic Computing Repository - Annotations

71

Page 67: Genomic Data Model and GenoMetric Query Language as

Implementation

72

Page 68: Genomic Data Model and GenoMetric Query Language as

73

Genomic Computing GMQL implementation, V1

• GMQL similar to Pig Latin (by Yahoo! Research)– Algebraic language for data-intensive applications on

Apache Hadoop, a framework for parallel computing which executes Google MapReduce programs

• Implementation strategy: develop a translator to Pig Latin– Easier development and maintenance – Big company involvement ensures development– Use cloud computing power to obtain efficiency and

scalability

Page 69: Genomic Data Model and GenoMetric Query Language as

Genomic Computing System architecture

74

Page 70: Genomic Data Model and GenoMetric Query Language as

Genomic Computing System architecture

75

Page 71: Genomic Data Model and GenoMetric Query Language as

Genomic Computing System architecture - Repository

76

Page 72: Genomic Data Model and GenoMetric Query Language as

Genomic Computing GMQL query translation to PIG over Hadoop

77

GMQL query Translator Pig over Hadoop

Motivation:• Clear & compact user code• User-transparent optimization

Page 73: Genomic Data Model and GenoMetric Query Language as

Genomic Computing Translation example

78

• 1 statement => 25 Pig Latin lines of code + auxiliary Java function

• The translator takes also care of updating the variable schema

• Error handling

Page 74: Genomic Data Model and GenoMetric Query Language as

79

Genomic Computing GMQL to Pig Latin translation

Page 75: Genomic Data Model and GenoMetric Query Language as

Genomic Computing Optimization in translation

1. Parallelism by splitting computations:• By chromosome• By experiment

2. Join and Map have a translation which avoids crossproducts, based on sequential scan of regions

Pig Latin shows its ability to scale on hundreds or thousands of experiments and multi-node systems

80

Page 76: Genomic Data Model and GenoMetric Query Language as

Genomic Computing MAP and JOIN vs. competitors

81

Page 77: Genomic Data Model and GenoMetric Query Language as

Genomic Computing System architecture, V2

• Holistic data management system for genomics• Uses cloud-based computing for querying thousands of

heterogeneous datasets

82

Page 78: Genomic Data Model and GenoMetric Query Language as

83

Genomic Computing GMQL implementation, V2

• A different approach, with language-independent intermediate representation

• Targeting also usability from within R and Galaxy

IR Semantics

GMQL EmbeddedGMQL

LogicalGMQL

Spark Flink ???

Syntax

Implementation API to IR

Page 79: Genomic Data Model and GenoMetric Query Language as

Genomic Computing GMQL implementation, V2 - Scala API

84

Page 80: Genomic Data Model and GenoMetric Query Language as

Genomic Computing GMQL implementation, V2 - IR Example

85

Page 81: Genomic Data Model and GenoMetric Query Language as

86

Genomic Computing GMQL implementation, V2

• New optimization options

GMQLGMQL IR OptimizedIR Implementation

QueryOptimizer

Low LevelOptimizer

1) Node reordering / deletion2) Select condition refinement

1) Alternative algorithms2) Parallelism tuning3) Data partitioning4) Caching

Page 82: Genomic Data Model and GenoMetric Query Language as

Genomic Computing Optimizations at compile-time & execution-time

Idea:• Let Flink/Spark/… engines implement common and

well known optimization• Exploit the intermediate representation in order to

implement optimizations which are driven by the semantics of GMQL

• Meta-first optimization• Operator swapping optimization

• Other optimizations based on algorithms for parallel execution on the cloud

87

Page 83: Genomic Data Model and GenoMetric Query Language as

Genomic Computing Meta-first optimization

Under certain conditions (meta-separability), it is possible to compute the metadata side of the query strictly before the region data side.

88

Page 84: Genomic Data Model and GenoMetric Query Language as

Genomic Computing Meta-separability

GMQL queries are always meta-separable, except for the ones which use the EXTEND (AGGREGATE) operator

(EXTEND operator computes and aggregates on the region data and stores the result in the metadata)

89

Page 85: Genomic Data Model and GenoMetric Query Language as

Genomic Computing Meta-first optimization - Workflow

ReadMD

StoreMD

ReadRD

StoreRD

ID s

• Compute metadata side of the query

• Retrieve the IDs from the metadata result

• Use the IDs to selectively load only the files that will appear in the output

90

Page 86: Genomic Data Model and GenoMetric Query Language as

Genomic Computing Where it helps?

Affected queries are the ones which contain one or more metadata selection (far from the Readings), metadata join and metadata group by; those operations cut the size of the output

91

Page 87: Genomic Data Model and GenoMetric Query Language as

Genomic Computing Operator swapping optimization

Some reordering of the execution plan can not be inferred by lower level optimizer, since they are motivated by GMQL semantics

DS1 DS2 DS3

Joinleft

Diff

DS1 DS3 DS2

Joinleft

Diff

92

Page 88: Genomic Data Model and GenoMetric Query Language as

Genomic Computing Binning strategy optimization

Bin1 Bin2 Bin3 Bin4 Bin5 Bin6 Bin75bin 6

n7

Strategy for intersection:1. Partition the genome in bins2. Assign each region to all the bins it overlaps3. Search for intersections within each bin

In the case of more complex operations, we change the way in which the regions are assigned to the bins

93

Page 89: Genomic Data Model and GenoMetric Query Language as

In order to avoid the duplicates production, when two regions overlap, an output is emitted if, and only if, at least one of them begins in the considered bin

• Bin 2: overlap => red region begins => Output e• Bin 3: overlap => no region begins => Output not emitted!

Genomic Computing Binning strategy

Avoiding output duplicates:

bin 1 bin 2 bin 3

94

Page 90: Genomic Data Model and GenoMetric Query Language as

Genomic Computing Binning tradeoff

• Smaller bins: smaller search space, but higher number of replicates

• Optimal binning size depends on:– Number of regions and local density– Region length distribution– GMQL operation and parameters– System settings (e.g., number of nodes, amount of

memory, …)

050100150200250300

1K 5K 10K 50K 100K 200K 500K 1M 10M

Bin length 

Exec. time

95

Page 91: Genomic Data Model and GenoMetric Query Language as

User Interface

97

Page 92: Genomic Data Model and GenoMetric Query Language as

Genomic Computing GMQL Web interface

98

http://www.bioinformatics.deib.polimi.it/GMQL/interfaces/

Page 93: Genomic Data Model and GenoMetric Query Language as

100

Genomic ComputingGMQL results on Integrated Genome Browser

Page 94: Genomic Data Model and GenoMetric Query Language as

Genomic Computing GMQL examples – Distal bindings: Visualization of results

101

Results are provided to user in GTF or Tab-delimited format

Page 95: Genomic Data Model and GenoMetric Query Language as

Genomic Computing GMQL examples – Distal bindings: Visualization of results

102

Page 96: Genomic Data Model and GenoMetric Query Language as

Applications

103

Page 97: Genomic Data Model and GenoMetric Query Language as

Genomic Computing Transcription Factor query & Genome Space

Source: ENCODE ChIP-seq datasets for transcription factors (TF)Goal: Generation of a transcriptional network from ChIP-seq dataMethod: Select differentially TFs which are also genes and derive TF-Genes and TF-TF links. Build a TFxGenes matrix Mij such that Mij = 1 if there is at least one binding site i in the gene region j, Mij = 0 otherwise.

# Extract gene information, some of them tagged with TF-encoding

GENE_ANN = SELECT(type == 'gene' AND provider == 'RefSeq') ANNOTATION;

GENES = PROJECT(feature == ’gene’) GENE_ANN; TF_GENES = PROJECT(encode_TF == ‘yes’) GENES; # red in next slide

# Collect TF samples (122)

TF = SELECT(dataType = 'ChipSeq' AND subType = 'TF' AND cell == 'k562' AND treatement == 'None') ENCODE_PEAK;

# Build TF Genomic Space

GS_TF = MAP(signal AS EXISTS) GENES TF;104

Page 98: Genomic Data Model and GenoMetric Query Language as

105

Genomic Computing Genome Space building

G1 G2 G4

TF1TF2TF4

Ann

Cell line: K562 (CML)# TFs: 122# Nodes: 6240# Edges: 30587

TF2

TF1 G1

G2

TF4G4

G3=TF3

G3=TF3

Page 99: Genomic Data Model and GenoMetric Query Language as

106

Genomic Computing Restriction of TF to open chromatin regions

# Collect open chromatin samples

DHS_2 = SELECT(dataType == 'DnaseSeq' AND cell == 'k562') ENCODE_PEAK; # 2 samples

FAIRE = SELECT(dataType == 'FaireSeq' AND cell == 'k562') ENCODE_PEAK; # 1 sample

# Merge DHS replicates in one sample

DHS = COVER_FLAT(2, ANY; pValue AS MIN(pValue)) DHS_2;

# Merge open chromatin regions from DHS and FAIRE assays

DHS_FAIRE = UNION DHS FAIRE;OPEN = COVER(1, ANY; pValue AS MIN(pValue)) DHS_FAIRE;

# Extract TFs in open chromatin regions only (active DNA # binding)

TF_OPEN = JOIN(distance < 0, left) TF OPEN;

Page 100: Genomic Data Model and GenoMetric Query Language as

107

Genomic Computing Transcription Factor Genome Space

# TF Genomic Space

GS_TF = MAP(signal AS EXISTS) GENES TF_OPEN;

G1 G2 G3TF3

G4 G5 … Gn

TF1 1 0 1 1 0 … 1

TF2 1 0 1 0 0 … 1

TF3 0 0 0 0 1 … 0

… 0 0 0 1 0 … 1

TFn 0 0 0 1 1 … 1

Page 101: Genomic Data Model and GenoMetric Query Language as

108

Genomic Computing Genome Space building

G1 G2 G4Ann

TF1

TF2

TF4

DHS

FAIRE

TF2

TF1

TF4

G4

G2 Cell line: K562 (CML)# TFs: 95# Nodes: 1717# Edges: 2367

G3=TF3

G3=TF3

G1

Page 102: Genomic Data Model and GenoMetric Query Language as

From Genometric Space to networks: K562 transcription network

109

Page 103: Genomic Data Model and GenoMetric Query Language as

Genomic Computing Differential binding research

Source: ENCODE JUN ChIP-seq datasets relative to cell line ‘k562’ (chronic myelogenous leukemia), with one control and 4 cases treated using interferon alpha or gamma at 30 minutes and 6 hours respectively, each with two replicas

Questions: • Q1: Find those JUN binding regions in treated cases where there are no

JUN binding peaks in the control• Q2: Find the JUN binding regions which are differentially present between

treated cases (e.g. in alpha and not in gamma interferon treated samples)

• Q3: Find the genes that overlap with JUN binding regions either in control or treated samples

• Q4: Find the average JUN ChIP-seq signal (in bedGraph files) of the treated samples in promoters (areas surrounding the TSS) of genes intersecting JUN binding regions either in control or treated samples

110

Page 104: Genomic Data Model and GenoMetric Query Language as

Select IFNa30 non overlapping with CONTROLS (Question 1, same for IFNa6h, IFNg30, IFNg6h)

# select controlsJUN_2 = SELECT (antibody == 'c-Jun' AND dataType == 'ChipSeq' AND

cell == 'k562' AND laboratory == 'Stanford' AND treatment == 'None') ENCODE_PEAK;

# extract the peak regions in at least two replicas and their minimum p-valueJUN = COVER(2, ANY; pValue AS MIN(pValue)) JUN_2;

# select treated samplesJUN_IFNa30_2 = SELECT(antibody == 'c-Jun' AND dataType == 'ChipSeq'

AND cell == 'k562' AND laboratory == 'Stanford' AND treatment == 'IFNa30') ENCODE_PEAK;

# extract the peak regions in at least two replicas and their minimum p-valueJUN_IFNa30_3 = COVER(2, ANY; pValue AS MIN(pValue)) JUN_IFNa30_2;

# extract peaks which do not overlap with controls (question 1)JUN_IFNa30 =JOIN(minDistance AND distance >= 0, left) JUN_IFNa30_3 JUN;

111

Page 105: Genomic Data Model and GenoMetric Query Language as

Regions expressed differentially by treatment alpha/gamma (Question 2)

# all treated samples: union of all peaks regions non-overlapping with controlsJUN_IFNa = UNION JUN_IFNa30 JUN_IFNa6h;JUN_IFNg = UNION JUN_IFNg30 JUN_IFNg6h;JUN_IFNag = UNION JUN_IFNa JUN_IFNg;

# regions that are present in all treatments alpha and not in any treatment# gamma (question 2)JUN_IFNag_ONLYa = DIFFERENCE() JUN_IFNa JUN_IFNg;

# regions that are present in all treatments gamma and not in any treatment# alpha (question 2)JUN_IFNag_ONLYg = DIFFERENCE() JUN_IFNg JUN_IFNa;

112

Page 106: Genomic Data Model and GenoMetric Query Language as

JUN overlapping gene lists (Question 3)

# extract all genesGENE_2 = SELECT(type == 'gene' AND provider == 'RefSeq')

ANNOTATION;GENE = PROJECT(feature == ’gene’) GENE_2;

# all samples: union of peak regions in controls or in treated samplesJUN_ALL = UNION JUN JUN_IFNag;

# extract genes intersecting with peak regions in control or treated samples# (question 3)GENE_JUN_ALL = JOIN(distance < 0, left) GENE JUN_ALL;

113

Page 107: Genomic Data Model and GenoMetric Query Language as

Prepare relevant promoter regions

# extract transcription start sites (TSS) of all genesTSS = SELECT(type == ‘TSS’) ANNOTATION;

# extend TSS to obtain promoter regions PROM = PROJECT(start = start - 1000, stop = stop + 500) TSS;

# extract the peak regions present in at least one sampleJUN_ALL_2 = COVER(1, ANY) JUN_ALL;

# extract all promoter regions intersecting with at least one peak region # in control or treated samplesPROM_JUN_ALL = JOIN(distance < 0, right) JUN_ALL_2 PROM;

114

Page 108: Genomic Data Model and GenoMetric Query Language as

Compute average signal on extended promoter regions (Question 4)

# Load bedGraph datasetsJUN_ALL_BG = SELECT(antibody == 'c-Jun' AND dataType == 'ChipSeq' AND

cell == 'k562' AND laboratory == 'Stanford' AND(treatment == 'IFNa30' OR treatment == 'IFNa6h' ORtreatment == 'IFNg30' OR treatment == 'IFNg6h' OR treatment == none)) ENCODE_BEDGRAPH;

# Map peak binding regions on promoters intersecting with at least one peak# region in control or treated samples, producing average promoter signal # (question 4)JUN_ALL_SIGNAL = MAP(promAvg AS AVG(signal)) PROM_JUN_ALL

JUN_ALL_BG;

115

PR1 PR2 PR3 PR4 PR5 … PRn

JUN1 6.7 7.5 5.2 3.2 5.2 … 15.4

JUN2 15.4 0.4 0.8 9.4 6.2 … 21.6

JUN3 5.5 14.5 7.2 5.4 5.2 … 4.5

… … … … … … … …

JUNn 2.2 2.1 2.5 6.5 6.2 … 6.7

Page 109: Genomic Data Model and GenoMetric Query Language as

Summary & Outlook

116

Page 110: Genomic Data Model and GenoMetric Query Language as

117

Genomic Computing Conclusions

• GDM: a data format-independent genomic data model– For genomic region data and related metadata– Easing integration and processing of heterogeneous

genomic data

• GMQL: a high-level declarative language– Easing the expression of even complex queries on

numerous data of multiple different types– Running also on cloud computing environments– Supporting a first processing also of big data, to extract

the relevant (usually smaller) ones for further processing

• Several GDM & GMQL application examples– Characterizing interplay and function of genomic regions

Page 111: Genomic Data Model and GenoMetric Query Language as

118

Genomic Computing Discovering biologically useful results

Working together with biologists at IEO & IIT for giving answers to biological problems. Ongoing projects:

1. 3D chromatine structure (Pelicci)2. DNA replication and gene expression (Pelicci)3. Analysis of unknown P53 binding loci (Amati)4. Chromatin state change in time-course Myc bindings (Amati)5. Myc bindings saturation patterns (Amati)6. Transcription factors co-occurrence with TEAD binding sites

(Campaner)7. Correlating hotspots of RAD21 bindings with high density TF

regions (Campaner)8. Copy number variants (CNVs) in Chromosome 7 (Testa)

Page 112: Genomic Data Model and GenoMetric Query Language as

119

A p

atte

rn o

f gen

omic

feat

ures

Genomic Computing FutureVision: Pattern-based queries from genome browser

Page 113: Genomic Data Model and GenoMetric Query Language as

120

Genomic Computing Future Vision: Cycle Query–Analysis–Visualization

• Query– By example– Using public DBs & ontologies– Search remote data– Query / extract remote data

• Analysis– (un)supervised learning– Region finding– Motif / pattern finding

• Visualization– Clustering– Long range interactions

VisualizationVisualization

QueryQuery

AnalysisAnalysis

Page 114: Genomic Data Model and GenoMetric Query Language as

Genomic Computing Future Long-term vision: Internet of Genomes

• The platform (client & servers) and language should support queries/computations involving different servers- Minimizing the information to be transferred among

servers and between them and the client• Each server should expose its own data for access by

exploratory search & crawlers

1 2 1 121

Page 115: Genomic Data Model and GenoMetric Query Language as

Genomic Computing Resources & Web sites & acknowledgements

Overview: http://www.bioinformatics.deib.polimi.it/genomic_computing/GMQL web site: http://www.bioinformatics.deib.polimi.it/GMQL/

Includes:• Local mode or MapReduce mode (over Hadoop, or Hadoop YARN)

for GNU/Linux systems - Download (122 MB)• Web services (over Hadoop YARN) - Download (60 MB)• Quick start - Install GMQL and get started• GMQL tutorial & Complete documentation• Functional comparison with BEDTools & BEDOPS• Pointer to publications (Bioinformatics, IEEE/ACM TCBB, Methods)

http://www.bioinformatics.deib.polimi.it/GMQL/interfaces/Includes:• User-friendly interface to creating/managing GMQL queries• Repository of ENCODE / Roadmap Epigenomics / TCGA datasets

122

Page 116: Genomic Data Model and GenoMetric Query Language as

Genomic Computing Resources, Web sites & acknowledgements

European Research Council “Data-Driven Genomic Computing”(GeCo) project: http://www.bioinformatics.deib.polimi.it/GeCo/

Our group, with Prof. Stefano Ceri and several PhD & master studentsand postDoc: • Abdulrahman Kaitoua• Pietro Pinoli• Arif Canakoglu

123

Page 117: Genomic Data Model and GenoMetric Query Language as

Thank you for your attention!

Any question?

Thank you for your attention!

Genomic Computing

124

http://www.bioinformatics.deib.polimi.it/genomic_computing/