reconstructing the past: deep learning for population genetics€¦ · to learn about hidden...

28
Reconstructing the past: deep learning for population genetics Flora Jay and Guillaume Charpiat LRI (Bioinfo and TAO) CDS Pitching Day

Upload: others

Post on 18-Jul-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Reconstructing the past: deep learning for population genetics€¦ · to learn about hidden genealogies and thus demography (1) from “expert statistics” ... distant histories

Reconstructing the past:deep learning for population genetics

Flora Jay and Guillaume CharpiatLRI (Bioinfo and TAO)

CDS Pitching Day

Page 2: Reconstructing the past: deep learning for population genetics€¦ · to learn about hidden genealogies and thus demography (1) from “expert statistics” ... distant histories

Overview

1. Introduction: past demography and genetics

2. Extracting information from genomic data

3. State of the art: summary statistics (hand designed) + ABC

4. Deep learning

4a. Challenges: variable input size

4b. Desired properties: invariances

4c. Plan

5. Summary

Page 3: Reconstructing the past: deep learning for population genetics€¦ · to learn about hidden genealogies and thus demography (1) from “expert statistics” ... distant histories

Introduction

Page 4: Reconstructing the past: deep learning for population genetics€¦ · to learn about hidden genealogies and thus demography (1) from “expert statistics” ... distant histories

Population genetic and demography

Introduction

Methods and applications

GENETICVARIATION

MUTATIONRECOMBINATION

DEMOGRAPHY

Admixture btw populations

Population structure

Expansions

Population size

NATURAL SELECTION

...

Eg. Peopling of the world by modern humans

Page 5: Reconstructing the past: deep learning for population genetics€¦ · to learn about hidden genealogies and thus demography (1) from “expert statistics” ... distant histories

Population genetic and demography

Introduction

GENETICVARIATION

MUTATIONRECOMBINATION

DEMOGRAPHY

Admixture btw populations

Population structure

Expansions

Population size

NATURAL SELECTION

...

Page 6: Reconstructing the past: deep learning for population genetics€¦ · to learn about hidden genealogies and thus demography (1) from “expert statistics” ... distant histories

Population genetic and demography

Introduction

Demographicinference

Methods and applicationsIdentify events, date them, estimate their strength, ...

GENETICVARIATION

MUTATIONRECOMBINATION

DEMOGRAPHY

Admixture btw populations

Population structure

Expansions

Population size

NATURAL SELECTION

...

Page 7: Reconstructing the past: deep learning for population genetics€¦ · to learn about hidden genealogies and thus demography (1) from “expert statistics” ... distant histories

Why inferring past demography ?

If you love at least one of the following...

● History

● Medicine

● Evolution

then you already have a good reason for trying to infer demography!

eg. did we mingle with Neandertals 50k year ago while peopling earth?

eg. is there a mutation increasing the risk of gettingbreast cancer?

eg. are Tibetan adapted to altitude and why?Are plant populations adapted to their environment and what could be the impact of climate change?

→ gives a null model to test non-neutral hypotheses eg. observed signal at a gene due to [demography] versus [demography+selection] ?

Introduction

Page 8: Reconstructing the past: deep learning for population genetics€¦ · to learn about hidden genealogies and thus demography (1) from “expert statistics” ... distant histories

8

Where is the information?

Page 9: Reconstructing the past: deep learning for population genetics€¦ · to learn about hidden genealogies and thus demography (1) from “expert statistics” ... distant histories

Past demography leaves signatures in genetic data

● Population sizes ● Migration / admixture between populations

Sousa & Hey 2013© Slatkin

Introduction

Page 10: Reconstructing the past: deep learning for population genetics€¦ · to learn about hidden genealogies and thus demography (1) from “expert statistics” ... distant histories

10

Project:Inferring demography using whole genomes

Develop a method using sequence data...

...to identify complex demographic histories.

past

Research

size

present pastpresent

Mutation → polymorphism (SNP)

Dim ~ n x 3Gb

Page 11: Reconstructing the past: deep learning for population genetics€¦ · to learn about hidden genealogies and thus demography (1) from “expert statistics” ... distant histories

11

Genetic process

Adapted from Wikimedia Commons

ancestors ancestors ancestors ...

Identical segment

(same recent ancestor)

Recombination point

On individual = 1 chromosome inherited from the father and 1 from the mother = mosaic of different ancestors

Page 12: Reconstructing the past: deep learning for population genetics€¦ · to learn about hidden genealogies and thus demography (1) from “expert statistics” ... distant histories

Genetic process

● Recombination along the DNA sequence at each generation→ not a single tree for the whole genome BUT multiple genealogies

● Genealogies are influenced by past demography

© Sheehan

Not observed

Palamara et al. 2012

recombination

recomb.

More coalescence when population is small

Page 13: Reconstructing the past: deep learning for population genetics€¦ · to learn about hidden genealogies and thus demography (1) from “expert statistics” ... distant histories

13

Inferring demographic history

The approach: Combining different type of informationto learn about hidden genealogies and thus demography

(1) from “expert statistics” (past and current research)(site-frequency spectrum, linkage disequilibrium, diversity, ...)Using Approximate Bayesian Computation

(2) by learning interesting features from raw data (PROJECT)Deep learning: build a deep neural network tuned forpopulation genetic data

Page 14: Reconstructing the past: deep learning for population genetics€¦ · to learn about hidden genealogies and thus demography (1) from “expert statistics” ... distant histories

14

State of the art: summary statistics + ABC

Page 15: Reconstructing the past: deep learning for population genetics€¦ · to learn about hidden genealogies and thus demography (1) from “expert statistics” ... distant histories

15

Understanding the data - Summary statistics

DNA is spatially structured

1. Distant sites: distant histories (different evolutionary trees)→ site frequency spectrum = histogram of allele counts

1...

2 n

coun

ts

Goal: extract information from genomes about hidden genealogies

Research – What's the data?

Page 16: Reconstructing the past: deep learning for population genetics€¦ · to learn about hidden genealogies and thus demography (1) from “expert statistics” ... distant histories

16

Summary statistics

1. histogram of allele counts at all segregating sites

1

...2 n

coun

ts

Long external branches → excess of Mutations carried

by only ONE individual

1

...2 n

coun

ts

Nb Mutations carried by onlyONE individual

present present

past past

Research – What's the data?

Page 17: Reconstructing the past: deep learning for population genetics€¦ · to learn about hidden genealogies and thus demography (1) from “expert statistics” ... distant histories

17

Summary statistics

DNA is spatially structured

1. Distant sites: distant histories (different evolutionary trees)

2. Less distant sites: related histories→ linkage disequilibrium (correlation between SNPs)

Research – What's the data?

Page 18: Reconstructing the past: deep learning for population genetics€¦ · to learn about hidden genealogies and thus demography (1) from “expert statistics” ... distant histories

18

Summary statistics

DNA is spatially structured

1. Distant sites: distant histories (different evolutionary trees)

2. Less distant sites: related histories

3. Adjacent sites share the same history→ Diversity per regions

Research – What's the data?

recent ancestorold common ancestor

Page 19: Reconstructing the past: deep learning for population genetics€¦ · to learn about hidden genealogies and thus demography (1) from “expert statistics” ... distant histories

19

Summary statistics

1. Distant sites: distant histories (different evolutionary trees)

2. Less distant sites: related histories

3. Adjacent sites share the same history→ Diversity per regions

And so on...

Research – What's the data?

Page 20: Reconstructing the past: deep learning for population genetics€¦ · to learn about hidden genealogies and thus demography (1) from “expert statistics” ... distant histories

20

Approximate Bayesian Computation (ABC)

Generate randomlythousands of histories

present pasttime

Ne

Ne

Ne

1...

2 nco

unts

LD

dist

1...

2 n

coun

ts

LD

dist

1...

2 n

coun

ts

LD

dist

Compute summarystatistics

Keep histories that produce sum. stat. similar to real ones

Infer history

present pasttime

Ne

PopsizeABC pipeline example

Page 21: Reconstructing the past: deep learning for population genetics€¦ · to learn about hidden genealogies and thus demography (1) from “expert statistics” ... distant histories

21

Application to real data

Log scale

Human dataBottleneck + Paleolithic and

neolithic expansions in European data

past

Jay et al (in prep)

Bos taurusBos primigenius (wild aurochs)

Danubianroute

Mediterraneanroute

Neolithic

Paleolithic

Cattle breed datapopulation decline

pastpresent

PopSizeABCBoitard, Rodríguez, Jay et al.

(PloS Genet. 2016)

Page 22: Reconstructing the past: deep learning for population genetics€¦ · to learn about hidden genealogies and thus demography (1) from “expert statistics” ... distant histories

22

Project: Deep Learning instead

Page 23: Reconstructing the past: deep learning for population genetics€¦ · to learn about hidden genealogies and thus demography (1) from “expert statistics” ... distant histories

23

Project

Simulations

Learning- data features

- functions predicting demographic model/parameters/...

Pseudo data

Comparison with

observationsdemo∼ prior

Summary statsSummary stats

Summary stats

Summary stats...

pseu

do

data

dem

ogra

phy

Multi-layer neural networks

ProjectLearn a global relationship between the data and the demographic

parameters with a deep neural network

ABC

Page 24: Reconstructing the past: deep learning for population genetics€¦ · to learn about hidden genealogies and thus demography (1) from “expert statistics” ... distant histories

24

Project

Learn a global relationship between the data and the demographic parameters with a deep neural network

Motivation from previous work in population genetics:

● Improvement when using a simple neural network inside ABC to learn locally the relationship between the summary statistics S(.) and the demographic parameters (Blum&François 2010, Boitard et al 2016, Jay et al in prep, …)

● A global relationship can be learned between S(.) and popgen. parameters using deep neural networks (fully connected) (Sheehan&Song (2016))→ they get rid of the rejection step (Method tested on coarse demographic models only)

● Natural next step: learn automatically the features from raw data → get rid of S(), gain information?

Project

Page 25: Reconstructing the past: deep learning for population genetics€¦ · to learn about hidden genealogies and thus demography (1) from “expert statistics” ... distant histories

25

Project - Challenges

Learn a global relationship between the data and the demographic parameters with a deep neural network

Challenges due to input data: ● Raw genetic data are large (larger than images)● Number of sequenced individuals vary● Length of sequences vary

➔ Need for flexibility & generalization w.r.t. input size➔ e.g.: if knowing how to predict past demography for sets of 10 sequences

of length 100 000 000, want not to start from scratch for a new set of 9 sequences of length 70 000 000.

➔ Recurrent networks can somehow deal with (1D) variable length, but not necessarily suited here (2D, more information by contemplating a whole column at once...)

Project

Page 26: Reconstructing the past: deep learning for population genetics€¦ · to learn about hidden genealogies and thus demography (1) from “expert statistics” ... distant histories

26

Project – Desired properties

Learn a global relationship between the data and the demographic parameters with a deep neural network

Incorporate coalescence knowledge in the architecture:

● Invariance by translation along the genome ● Invariance by permutation of the individuals*● Correlation decreases with distance (but rate depends on the demography)● ...

Project

* but maybe not on the permutation of the haplotypes

Page 27: Reconstructing the past: deep learning for population genetics€¦ · to learn about hidden genealogies and thus demography (1) from “expert statistics” ... distant histories

27

Project - Plan

Learn a global relationship between the data and the demographic parameters with a deep neural network

● Step 1: describe all properties specific to genetic data AND population genetic data to be used in the DNN

● Step 2: practical test with common DNN layers such as recurrent and convolution layers

● Step 3: study flexible architectures that scale to different input size → naturally scalable node functions (e.g.: max, average, variance...) → training a family of neural nets, by combining pre-defined families of layers (indexed by input size) → meta DNN generating architectures (take as input the data size and outputs a neural network)

Project

Page 28: Reconstructing the past: deep learning for population genetics€¦ · to learn about hidden genealogies and thus demography (1) from “expert statistics” ... distant histories

28

Summary

Asked funding: Grant for a Master internship = 3000 euros

References:- Boitard S, Rodriguez W, Jay F, et al. Inferring population size history from large samples of genome wide molecular data an approximate Bayesian computation approach. PLoS Genet. 2016 12(3):e1005877. - Sheehan S, Song YS. Deep learning for population genetic inference. PLoS Comput Biol. 2016 12(3):e1004845. - Stanley, Kenneth O., David B. D'Ambrosio, and Jason Gauci. A hypercube based encoding for evolving large scale neural networks. Artificial life 15.2 (2009): 185 -212.