irri - ksiconnectksiconnect.icrisat.org/wp-content/uploads/2014/02/...grisp centers (4) universities...

The 3000 Rice Genome Project: Update and Plans

IRRI GRiSP CAAS

Our Project Team & Donors

BGI IRRI CAAS

Shuai-Shuai Tai Ma. Elizabeth B. Naredo Wen-Sheng Wang

Xin Liu Sheila M. Q. Mercado Yong-Ming Gao

Jun Li Myla C. Rellosa Xiu-Qing Zhao

Guo-Jie Zhang Renato A. Reaño Jian-Long Xu

Bo Wang Grace Lee S. Capilit Tian-Qing Zheng

Xun Xu Flora C. de Guzman Fan Zhang

Gengyun Zhang N. R. Sackville Hamilton Yong-Li Zhou

Jauhar Ali Bin-Ying Fu

Ramil P. Mauleon Zhikang Li

Nicholai A. Alexandrov

Hei Leung

Kenneth L. McNally

GRiSP & CAAS

MOST:

Overview

• The Global Rice Science Partnership

• Building a global rice diversity platform

• IRG and CNCGB rice collections and rice

diversity

• Germplasm selection

• 3K rice genome sequences

• Preparing for phenomics

• Adapting to deep data:

– the International Rice Informatics Consortium

GRiSP activities for SNP genotyping and

sequencing

Product lines for theme 1: Harnessing genetic diversity to chart new productivity, quality, and health horizons

GW

AS

1.1. Ex situ conservation and dissemination of rice germplasm

1.2. Characterizing genetic diversity and creating novel gene pools

1.2.1 SNP Consortium for high density genotypes 1.2.2 Global phenotyping network for key traits

1.2.3 Whole genome sequencing of genebank stocks 1.2.4 Specialized populations for genetic studies

1.3. Genes and allelic diversity conferring stress tolerance and enhanced nutrition

1.4. Converting rice from C3 to C4 architecture and metabolism

IRRI

CRP 3.3: Global Rice Science Partnership (GRiSP)

computation, bioinformatics, databases

Public Genetic Diversity Research Platform

Use

Rice Diversity as Foundation Conserved Germplasm

Breeding Lines Specialized Genetic Stocks

Current problems

Drought tolerance

conservation

dissemination

Genotype-phenotype association

Durable disease-pest

resistance

Problem soils

Future challenges

C4 Rice Grain quality

Rice germplasm – IRGC and CNCGB

IRGC – the International Rice Genebank Collection

World’s largest collection of rice germplasm held in trust for the

world community and source countries

• Over 122,000 registered and

incoming accessions from 117

source countries

• Two cultivated species

Oryza sativa

Oryza glaberrima

• 22 wild species

• Relatively few accessions have

donated alleles to current, high-

yielding varieties

• http://www.irri.org/GRC

http://www.fao.org/rice2004/

China National Crop Gene Bank (http://icgr.caas.net.cn/cgrisintroduction.html)

CROP NAME Accessions

Rice 63297

Wild rice 6944

C-rice 1605

Genotype of rice 120

Total 71966 2014-1-10

CAAS working collection for Green

Super Rice project (GSR)

~500 inbred and hybrid lines

Global distribution of Isozyme variety groups

Germplasm selection

Semi-stratified design (IRGC) of 12K core

Genome type (Species)

Subspecies/isozyme group (eco-cultural type) Source country (geographical location) Usage/deployment (trait characterization)

Random selection (covering name-space) Nominations

0

500

1,000

1,500

2,000

2,500

3,000

No. of accessions

Source country with >100 entries

3000 genotype set

• 2466 from IRGC 12K core

– Random selection form 12K core

– Nominated entries from Cirad (Orytage panel) and

IRRI (founders of Indica-MAGIC)

• 534 from CNCGB

– Mini-core based on diversity assessment of morpho-

agron and molecular data

– Entries from GSR working collection

• All entries underwent purification by SSD prior to

sequencing.

Global distribution of 3000 genotypes from IRGC and CAAS

Regions (countries)

as % fraction of total

14.5%

from India

3000 genome sequences

IRG Traditional Germplasm 100,000 cultivated accessions

3 to 10% sampling,

then NGS @ ≥10X depth

Apply low-cost sequencing by next generation

technology

•Final sequencing data on 3,000 samples now obtained

•17 TB of filtered/trimmed reads produced at an average depth of 14X per genome

•Depth ranges from ~5X to >50X

•Variant calling relative to Nipponbare completed

•Other analyses underway

GRiSP Product 1.2.3. Sequencing the Genebank – 3000 Genome project

• Genomes uploaded to GigaDb.org as trimmed/filtered

FASTQ (raw) – DOI assignments for project and accession file sets

• Galaxy documentation for workflows and analyses

• Data Note and Commentary papers describing project,

goals, protocols submitted to GigaScience Journal

January 2014

– Accession list, seed and data availability and pathway for data

analysis

• Call for global consortium for value-added analyses in

framework of Human Thousand Genomes, Encode, …

Data quality by FastQC (1000 genome sample)

PASS WARN FAIL % FAIL Basic Statistics 14686 0 0 0.00% Kmer Content 13808 863 15 0.10%

Overrepresented sequences 14672 14 0 0.00%

Per base GC content 14241 385 60 0.40% Per base N content 14578 108 0 0.00%

Per base sequence content 5152 9443 91 0.60%

Per base sequence quality 13774 130 782 5.30%

Per sequence GC content 214 14030 442 3.00%

Per sequence quality scores 14686 0 0 0.00%

Sequence Duplication Levels 13953 733 0 0.00%

Sequence Length Distribution 14684 2 0 0.00%

N. Alexandrov

http://172.29.4.215:8080/qc3k/

Quality of reads with FastQC

FASTQC is a software tool to check quality of NGS reads

developed in Babraham Institute, Cambridge, UK

Basic statistics (seq length, # reads, %GC)

Per Base Sequence Quality

warning: if the lower quartile for any base is less than 10,

or if the median for any base is less than 25.

failure: if the lower quartile for any base is less than 5,

or if the median for any base is less than 20

PASS WARN FAIL % FAIL Per base

sequence quality 13774 130 782 5.30%

N. Alexandrov

Classification based on 5 sets of 200,000 random SNPs

indica

aus/boro

basmati/sadri

intermediate

japonica

tropical japonica temperate japonica

ML Distance

Unweighted NJ tree

Consensus for

5* 1000 bootstraps

Tai Shuai-shuai

Classification based on 5 sets of 200,000 random SNPs

IsoZymeVG Distribution

Chrom Genic mRNA 5'-UTR CDS Intron 3'-UTR Inter-

genic Total Syn

Non-

Syn NULL Total

Non-Syn

/Syn

Chr1 634912 630396 25880 291817 286601 26098 1252989 1887901 118095 173722 0 291817 1.471

Chr2 528417 524172 20087 243967 238738 21380 1013475 1541892 97306 146661 0 243967 1.507

Chr3 490402 487611 19899 223196 224129 20387 962304 1452706 88477 134719 0 223196 1.523

Chr4 730310 727473 19018 388220 301071 19164 1176274 1906584 160101 228115 4 388220 1.425

Chr5 489370 485848 13623 257327 200307 14591 867799 1357169 103723 153604 0 257327 1.481

Chr6 560506 557361 16943 280933 242635 16850 1023473 1583979 114625 166308 0 280933 1.451

Chr7 548266 546569 16210 280994 231797 17568 973670 1521936 115332 165662 0 280994 1.436

Chr8 582068 580181 16396 302785 244991 16009 998651 1580719 124025 178759 1 302785 1.441

Chr9 436037 434440 10692 222916 190025 10807 763771 1199808 90299 132617 0 222916 1.469

Chr10 476710 473603 11735 258013 192214 11641 806940 1283650 109451 148561 1 258013 1.357

Chr11 684803 681891 16642 354874 291049 19326 1148735 1833538 140772 214101 1 354874 1.521

Chr12 607336 603783 16549 319401 251103 16730 1055044 1662380 129296 190105 0 319401 1.470

ChrUn 19706 19706 0 12615 7091 0 26669 46375 5819 6796 0 12615 1.168

ChrSy 11463 11463 0 7913 3550 0 15043 26506 3846 4067 0 7913 1.057

Total 6800306 6764497 203674 3444971 2905301 210551 12084837 18885143 140116

7 2043797 7 3444971 1.459

SNP characteristics on alignment to Nipponbare IRGSP-1.0

and MSU V7 gene models

18.9 M SNP sites after population calling by GATK with MAF >0.001

MSU V7.0 rice gene annotation for 55986 genes and 66338 mRNA processed to

1) remove all but the primary mRNA transcript and

2) select the gene models with the highest support in cases of overlapping gene models.

SNP characteristics are for 55107 out of 55986 gene models, and

those in pseudogenes or where the reference base is N are not reported.

Xin Liu & Tai Shuai-shuai

Beyond 3K genome 1o analysis

• Forming IRIC – International Rice Informatics Consortium for data

analyses, curation, db, etc. (kickoff meeting at PAG-XXI, San Diego,

Jan 16, 2013)

• Scientific advisory panel with experts from large scale NGS projects,

database developers, sequencing centers, etc.

• Many partners for curation, db development, algorithms, population

genetics, etc.

iPlant AGI GigaDB

A*Star Bioinformatics USC KAUST

NIAS Cornell TGAC

MIPS Cirad IRD

CAS CAAS BGI

Academia Sinica MPI KZI

…

Phenomics on Sequenced genetic Stocks Up and coming

• 3 band multispectral sensor (400-800 nm)

• Infrared Canopy Temperature

• Ultrasonic Canopy Height

• Air Temperature, RH, PAR

• GPS -lat/lon, Sensor Height, Speed, (±2.5 cm)

• HD video

• Biomass, Canopy height, Canopy Temperature

• Growth, lodging, Chla, abiotic/biotic stress...

GPS

Scans: 8 plots (5 x 6 hills) @ 100 plots per minute

Paddy Scanner

6 5 4 3 2 1 7 8

3 m wide bunds

GPS

Auto Steer

Klassen et al.

20

100 cm

Swath View Area

cm cm2

Sonic 13 x 13 120

Spectral 25 x 54 1350

IRT 30° 47 x 54 2008

100 cm above canopy

Scan pattern

22

Fields of View

25

54

Single Plot

Multispectral

Reflectance

Infrared Canopy

Temperature

Ultrasonic

Canopy Ht

HD Video

Ambient:

T, RH, PAR

Geotagged @

2 cm resolution

IRRI Field Scanner

Plant Ht

Biomass

Phenotypic Correlation: NIR 760+

NIR

R² = 0.83

0.3

0.5

0.7

0.9

1.1

1.3

1.5

0.1 0.15 0.2 0.25 0.3 0.35

0408 S

on

ic H

t (c

m)

0408 NIR760

R² = 0.71

0

50

100

150

200

0.3 0.5 0.7 0.9 1.1 1.3 1.5

Av

e H

t (c

m)

0408 Sonic Ht (m)

April 8, 2013

R² = 0.88

0

50

100

150

200

250

300

0.1 0.2 0.3 0.4 0.5

FW

t (g

)

Ave 0408 NIR760

R² = 0.88

0

50

100

150

200

0.1 0.15 0.2 0.25 0.3 0.35

Av

e H

t (c

m)

Ave 0408 NIR760

International Rice Informatics Consortium

Initiating IRIC

Invitational side meeting at PAG-XXI

on January 16, 2013

• Invitations were sent to contacts at >50 organizations.

• Second meeting held at PAG 22 Tuesday, January 14, 2014

• Consortium agreement under revision

• Next meetings to be held at PAG-Asia May 19-21, 2014.

• IRIC workshop on program.

Contact Organizations

GRiSP Centers (4) Universities (14) Institutions (16)

IRRI Arizona Genomics Institute Academia Sinica, Taiwan

CIAT Cornell University CAAS, Beijing & Shenzhen

IRD Federal University of Pelotas, Brazil CAS, Beijing

Cirad Huazhong AU, China Cold Spring Harbor Laboratory

Katsetsart University, Thailand EMBL-EBI, U.K.

Breeding Companies (7) Kyung Hee University, Korea EMBRAPA, Brazil

Bayer CropSciences Louisiana State University ICAR, India

Biogemma Michigan State University INRA, France

Mahyco Oregon State University Kunming Zoo Institute, China

Mars Food Global Perpignan University, France MIPS, Germany

Pioneer UC-Riverside MPI-Tuebingen, Germany

RiceTec University of Delaware NCGR-CAS, SIBS, Shanghai

Syngenta University of Queensland, Australia NIAS, Japan

Wageningen UR, Netherlands The Genome Analysis Centre, U.K.

USDA-Research

Foundations (5) RDA, Korea

Gates Foundation Others (5) NCGR, Sante Fe , NM

Sloan Foundation GigaScience Journal

NSF, U.S.A. GigaDB.org Tech Companies (3)

USDA-CREES iPlant Collaborative, U.S.A. BGI-Shenzhen

USAID, U.S.A. Gramene Affymetrix

Plant & Trait Ontology Pacific Biosciences

Community Input

Development of IRIC is timely and needed.

• IAIC (AIP) and others as examples for path forward

• Build on existing efforts, tools, etc

• Germplasm and genetic stocks as the foundation and entry point

• Develop a Rice Information Portal to cross-integrate and provide

access to different databases/datastores

– Need better name than “RIP”

• Community agreement on standards for annotation of datatypes

• Curate data objects using ontologies, IDs, microcitations that are

semantic web/web service enabled

• Need approaches and tools for integrating BigData

• Visualization and analysis tools for access and application

(especially for breeding)

Scope of IRIC

Managed information, communication, collaboration

portal for rice:

• Germplasm as focal entry point -- diversity

• Genome sequences, genome maps, other derived data products in

standardized format

• Gene and gene function information, QTL/marker libraries, gene

expression data, etc.

• Well-structured phenotypic data with environmental information

• Specialized bioinformatics tools and applications for rice

• DNA and/or seeds that can be ordered for research

• Repository of documents (research papers, manuals, protocols,

standards, training materials, etc.)

• Integration – links with other portals/databases

• Regular meetings, groups, joint research projects

IRIC portal content

• Sequences and analysis of 3,000 genomes*

o SNPs

o assemblies

o phylogenetic trees

o genes associated with traits

o regulatory motifs

o most significant variations

• Other available rice genome sequences (~2,000 rice entries in SRA)

• Sequences of rice microorganisms

• Sequences of other grasses (e.g. for C4 project)

• Genotyping results from GBS, 44K and 700K affy chips

• Phenotypic data

• Gene expression data

• Gene functions and networks

• Analysis tools

• Linked to rice seed database(s)

• Linked to other IRRI databases and portals

*Total amount of genotyping data: ~3K*20M = 60B

IRIC portal development team @ IRRI

Rolando Jay Santos

Victor Jun Ulat

F. Nikki Borja

Venice M. Juanillas

Jeffrey Detras

Roven R. Fuentes

Ramil P. Mauleon

Kenneth L. McNally

Nickolai A. Alexandrov

Dmytro Chebatorov

Millicent Sanciangco

Visionary input

Achim Dobermann

Technological advice

Marco van den Berg

Management team

Hei Leung

Ruaraidh Sackville Hamilton

Kenneth McNally

Ramil Mauleon

Nickolai Alexandrov

Conclusions

• Potentially over 30 million SNPs now discovered in rice

(3K and other projects)

• Advances in sequencing and phenotyping have

immensely accelerated the amount of data being

generated.

• Integration of the data along with passport data,

pedigrees and models will add significant value and

lead to enhanced utilization of conserved germplasm.

• Establishing the International Rice Informatics

Consortium to integrate data across sources and

facilitate collaborations

• These efforts should lead to the ability to predict the

utility of an accession for breeding based on its

sequence.

Questions?

irri - ksiconnectksiconnect.icrisat.org/wp-content/uploads/2014/02/...grisp centers (4) universities...

Documents