irri - ksiconnectksiconnect.icrisat.org/wp-content/uploads/2014/02/...grisp centers (4) universities...
TRANSCRIPT
The 3000 Rice Genome Project: Update and Plans
IRRI GRiSP CAAS
Our Project Team & Donors
BGI IRRI CAAS
Shuai-Shuai Tai Ma. Elizabeth B. Naredo Wen-Sheng Wang
Xin Liu Sheila M. Q. Mercado Yong-Ming Gao
Jun Li Myla C. Rellosa Xiu-Qing Zhao
Guo-Jie Zhang Renato A. Reaño Jian-Long Xu
Bo Wang Grace Lee S. Capilit Tian-Qing Zheng
Xun Xu Flora C. de Guzman Fan Zhang
Gengyun Zhang N. R. Sackville Hamilton Yong-Li Zhou
Jauhar Ali Bin-Ying Fu
Ramil P. Mauleon Zhikang Li
Nicholai A. Alexandrov
Hei Leung
Kenneth L. McNally
GRiSP & CAAS
MOST:
Overview
• The Global Rice Science Partnership
• Building a global rice diversity platform
• IRG and CNCGB rice collections and rice
diversity
• Germplasm selection
• 3K rice genome sequences
• Preparing for phenomics
• Adapting to deep data:
– the International Rice Informatics Consortium
GRiSP activities for SNP genotyping and
sequencing
Product lines for theme 1: Harnessing genetic diversity to chart new productivity, quality, and health horizons
GW
AS
1.1. Ex situ conservation and dissemination of rice germplasm
1.2. Characterizing genetic diversity and creating novel gene pools
1.2.1 SNP Consortium for high density genotypes 1.2.2 Global phenotyping network for key traits
1.2.3 Whole genome sequencing of genebank stocks 1.2.4 Specialized populations for genetic studies
1.3. Genes and allelic diversity conferring stress tolerance and enhanced nutrition
1.4. Converting rice from C3 to C4 architecture and metabolism
IRRI
CRP 3.3: Global Rice Science Partnership (GRiSP)
computation, bioinformatics, databases
Public Genetic Diversity Research Platform
Use
Rice Diversity as Foundation Conserved Germplasm
Breeding Lines Specialized Genetic Stocks
Current problems
Drought tolerance
conservation
dissemination
Genotype-phenotype association
Durable disease-pest
resistance
Problem soils
Future challenges
C4 Rice Grain quality
Rice germplasm – IRGC and CNCGB
IRGC – the International Rice Genebank Collection
World’s largest collection of rice germplasm held in trust for the
world community and source countries
• Over 122,000 registered and
incoming accessions from 117
source countries
• Two cultivated species
Oryza sativa
Oryza glaberrima
• 22 wild species
• Relatively few accessions have
donated alleles to current, high-
yielding varieties
• http://www.irri.org/GRC
China National Crop Gene Bank (http://icgr.caas.net.cn/cgrisintroduction.html)
CROP NAME Accessions
Rice 63297
Wild rice 6944
C-rice 1605
Genotype of rice 120
Total 71966 2014-1-10
CAAS working collection for Green
Super Rice project (GSR)
~500 inbred and hybrid lines
Global distribution of Isozyme variety groups
Germplasm selection
Semi-stratified design (IRGC) of 12K core
Genome type (Species)
Subspecies/isozyme group (eco-cultural type) Source country (geographical location) Usage/deployment (trait characterization)
Random selection (covering name-space) Nominations
0
500
1,000
1,500
2,000
2,500
3,000
No. of accessions
Source country with >100 entries
3000 genotype set
• 2466 from IRGC 12K core
– Random selection form 12K core
– Nominated entries from Cirad (Orytage panel) and
IRRI (founders of Indica-MAGIC)
• 534 from CNCGB
– Mini-core based on diversity assessment of morpho-
agron and molecular data
– Entries from GSR working collection
• All entries underwent purification by SSD prior to
sequencing.
Global distribution of 3000 genotypes from IRGC and CAAS
Regions (countries)
as % fraction of total
14.5%
from India
3000 genome sequences
IRG Traditional Germplasm 100,000 cultivated accessions
3 to 10% sampling,
then NGS @ ≥10X depth
Apply low-cost sequencing by next generation
technology
•Final sequencing data on 3,000 samples now obtained
•17 TB of filtered/trimmed reads produced at an average depth of 14X per genome
•Depth ranges from ~5X to >50X
•Variant calling relative to Nipponbare completed
•Other analyses underway
GRiSP Product 1.2.3. Sequencing the Genebank – 3000 Genome project
• Genomes uploaded to GigaDb.org as trimmed/filtered
FASTQ (raw) – DOI assignments for project and accession file sets
• Galaxy documentation for workflows and analyses
• Data Note and Commentary papers describing project,
goals, protocols submitted to GigaScience Journal
January 2014
– Accession list, seed and data availability and pathway for data
analysis
• Call for global consortium for value-added analyses in
framework of Human Thousand Genomes, Encode, …
Data quality by FastQC (1000 genome sample)
PASS WARN FAIL % FAIL Basic Statistics 14686 0 0 0.00% Kmer Content 13808 863 15 0.10%
Overrepresented sequences 14672 14 0 0.00%
Per base GC content 14241 385 60 0.40% Per base N content 14578 108 0 0.00%
Per base sequence content 5152 9443 91 0.60%
Per base sequence quality 13774 130 782 5.30%
Per sequence GC content 214 14030 442 3.00%
Per sequence quality scores 14686 0 0 0.00%
Sequence Duplication Levels 13953 733 0 0.00%
Sequence Length Distribution 14684 2 0 0.00%
N. Alexandrov
Quality of reads with FastQC
FASTQC is a software tool to check quality of NGS reads
developed in Babraham Institute, Cambridge, UK
Basic statistics (seq length, # reads, %GC)
Per Base Sequence Quality
warning: if the lower quartile for any base is less than 10,
or if the median for any base is less than 25.
failure: if the lower quartile for any base is less than 5,
or if the median for any base is less than 20
PASS WARN FAIL % FAIL Per base
sequence quality 13774 130 782 5.30%
N. Alexandrov
Classification based on 5 sets of 200,000 random SNPs
indica
aus/boro
basmati/sadri
intermediate
japonica
tropical japonica temperate japonica
ML Distance
Unweighted NJ tree
Consensus for
5* 1000 bootstraps
Tai Shuai-shuai
Classification based on 5 sets of 200,000 random SNPs
IsoZymeVG Distribution
Chrom Genic mRNA 5'-UTR CDS Intron 3'-UTR Inter-
genic Total Syn
Non-
Syn NULL Total
Non-Syn
/Syn
Chr1 634912 630396 25880 291817 286601 26098 1252989 1887901 118095 173722 0 291817 1.471
Chr2 528417 524172 20087 243967 238738 21380 1013475 1541892 97306 146661 0 243967 1.507
Chr3 490402 487611 19899 223196 224129 20387 962304 1452706 88477 134719 0 223196 1.523
Chr4 730310 727473 19018 388220 301071 19164 1176274 1906584 160101 228115 4 388220 1.425
Chr5 489370 485848 13623 257327 200307 14591 867799 1357169 103723 153604 0 257327 1.481
Chr6 560506 557361 16943 280933 242635 16850 1023473 1583979 114625 166308 0 280933 1.451
Chr7 548266 546569 16210 280994 231797 17568 973670 1521936 115332 165662 0 280994 1.436
Chr8 582068 580181 16396 302785 244991 16009 998651 1580719 124025 178759 1 302785 1.441
Chr9 436037 434440 10692 222916 190025 10807 763771 1199808 90299 132617 0 222916 1.469
Chr10 476710 473603 11735 258013 192214 11641 806940 1283650 109451 148561 1 258013 1.357
Chr11 684803 681891 16642 354874 291049 19326 1148735 1833538 140772 214101 1 354874 1.521
Chr12 607336 603783 16549 319401 251103 16730 1055044 1662380 129296 190105 0 319401 1.470
ChrUn 19706 19706 0 12615 7091 0 26669 46375 5819 6796 0 12615 1.168
ChrSy 11463 11463 0 7913 3550 0 15043 26506 3846 4067 0 7913 1.057
Total 6800306 6764497 203674 3444971 2905301 210551 12084837 18885143 140116
7 2043797 7 3444971 1.459
SNP characteristics on alignment to Nipponbare IRGSP-1.0
and MSU V7 gene models
18.9 M SNP sites after population calling by GATK with MAF >0.001
MSU V7.0 rice gene annotation for 55986 genes and 66338 mRNA processed to
1) remove all but the primary mRNA transcript and
2) select the gene models with the highest support in cases of overlapping gene models.
SNP characteristics are for 55107 out of 55986 gene models, and
those in pseudogenes or where the reference base is N are not reported.
Xin Liu & Tai Shuai-shuai
Beyond 3K genome 1o analysis
• Forming IRIC – International Rice Informatics Consortium for data
analyses, curation, db, etc. (kickoff meeting at PAG-XXI, San Diego,
Jan 16, 2013)
• Scientific advisory panel with experts from large scale NGS projects,
database developers, sequencing centers, etc.
• Many partners for curation, db development, algorithms, population
genetics, etc.
iPlant AGI GigaDB
A*Star Bioinformatics USC KAUST
NIAS Cornell TGAC
MIPS Cirad IRD
CAS CAAS BGI
Academia Sinica MPI KZI
…
Phenomics on Sequenced genetic Stocks Up and coming
• 3 band multispectral sensor (400-800 nm)
• Infrared Canopy Temperature
• Ultrasonic Canopy Height
• Air Temperature, RH, PAR
• GPS -lat/lon, Sensor Height, Speed, (±2.5 cm)
• HD video
• Biomass, Canopy height, Canopy Temperature
• Growth, lodging, Chla, abiotic/biotic stress...
GPS
Scans: 8 plots (5 x 6 hills) @ 100 plots per minute
Paddy Scanner
6 5 4 3 2 1 7 8
3 m wide bunds
GPS
Auto Steer
Klassen et al.
20
100 cm
Swath View Area
cm cm2
Sonic 13 x 13 120
Spectral 25 x 54 1350
IRT 30° 47 x 54 2008
100 cm above canopy
Scan pattern
22
Fields of View
25
54
Single Plot
Multispectral
Reflectance
Infrared Canopy
Temperature
Ultrasonic
Canopy Ht
HD Video
Ambient:
T, RH, PAR
Geotagged @
2 cm resolution
IRRI Field Scanner
Plant Ht
Biomass
Phenotypic Correlation: NIR 760+
NIR
R² = 0.83
0.3
0.5
0.7
0.9
1.1
1.3
1.5
0.1 0.15 0.2 0.25 0.3 0.35
0408 S
on
ic H
t (c
m)
0408 NIR760
R² = 0.71
0
50
100
150
200
0.3 0.5 0.7 0.9 1.1 1.3 1.5
Av
e H
t (c
m)
0408 Sonic Ht (m)
April 8, 2013
R² = 0.88
0
50
100
150
200
250
300
0.1 0.2 0.3 0.4 0.5
FW
t (g
)
Ave 0408 NIR760
R² = 0.88
0
50
100
150
200
0.1 0.15 0.2 0.25 0.3 0.35
Av
e H
t (c
m)
Ave 0408 NIR760
International Rice Informatics Consortium
Initiating IRIC
Invitational side meeting at PAG-XXI
on January 16, 2013
• Invitations were sent to contacts at >50 organizations.
• Second meeting held at PAG 22 Tuesday, January 14, 2014
• Consortium agreement under revision
• Next meetings to be held at PAG-Asia May 19-21, 2014.
• IRIC workshop on program.
Contact Organizations
GRiSP Centers (4) Universities (14) Institutions (16)
IRRI Arizona Genomics Institute Academia Sinica, Taiwan
CIAT Cornell University CAAS, Beijing & Shenzhen
IRD Federal University of Pelotas, Brazil CAS, Beijing
Cirad Huazhong AU, China Cold Spring Harbor Laboratory
Katsetsart University, Thailand EMBL-EBI, U.K.
Breeding Companies (7) Kyung Hee University, Korea EMBRAPA, Brazil
Bayer CropSciences Louisiana State University ICAR, India
Biogemma Michigan State University INRA, France
Mahyco Oregon State University Kunming Zoo Institute, China
Mars Food Global Perpignan University, France MIPS, Germany
Pioneer UC-Riverside MPI-Tuebingen, Germany
RiceTec University of Delaware NCGR-CAS, SIBS, Shanghai
Syngenta University of Queensland, Australia NIAS, Japan
Wageningen UR, Netherlands The Genome Analysis Centre, U.K.
USDA-Research
Foundations (5) RDA, Korea
Gates Foundation Others (5) NCGR, Sante Fe , NM
Sloan Foundation GigaScience Journal
NSF, U.S.A. GigaDB.org Tech Companies (3)
USDA-CREES iPlant Collaborative, U.S.A. BGI-Shenzhen
USAID, U.S.A. Gramene Affymetrix
Plant & Trait Ontology Pacific Biosciences
Community Input
Development of IRIC is timely and needed.
• IAIC (AIP) and others as examples for path forward
• Build on existing efforts, tools, etc
• Germplasm and genetic stocks as the foundation and entry point
• Develop a Rice Information Portal to cross-integrate and provide
access to different databases/datastores
– Need better name than “RIP”
• Community agreement on standards for annotation of datatypes
• Curate data objects using ontologies, IDs, microcitations that are
semantic web/web service enabled
• Need approaches and tools for integrating BigData
• Visualization and analysis tools for access and application
(especially for breeding)
Scope of IRIC
Managed information, communication, collaboration
portal for rice:
• Germplasm as focal entry point -- diversity
• Genome sequences, genome maps, other derived data products in
standardized format
• Gene and gene function information, QTL/marker libraries, gene
expression data, etc.
• Well-structured phenotypic data with environmental information
• Specialized bioinformatics tools and applications for rice
• DNA and/or seeds that can be ordered for research
• Repository of documents (research papers, manuals, protocols,
standards, training materials, etc.)
• Integration – links with other portals/databases
• Regular meetings, groups, joint research projects
IRIC portal content
• Sequences and analysis of 3,000 genomes*
o SNPs
o assemblies
o phylogenetic trees
o genes associated with traits
o regulatory motifs
o most significant variations
• Other available rice genome sequences (~2,000 rice entries in SRA)
• Sequences of rice microorganisms
• Sequences of other grasses (e.g. for C4 project)
• Genotyping results from GBS, 44K and 700K affy chips
• Phenotypic data
• Gene expression data
• Gene functions and networks
• Analysis tools
• Linked to rice seed database(s)
• Linked to other IRRI databases and portals
*Total amount of genotyping data: ~3K*20M = 60B
IRIC portal development team @ IRRI
Rolando Jay Santos
Victor Jun Ulat
F. Nikki Borja
Venice M. Juanillas
Jeffrey Detras
Roven R. Fuentes
Ramil P. Mauleon
Kenneth L. McNally
Nickolai A. Alexandrov
Dmytro Chebatorov
Millicent Sanciangco
Visionary input
Achim Dobermann
Technological advice
Marco van den Berg
Management team
Hei Leung
Ruaraidh Sackville Hamilton
Kenneth McNally
Ramil Mauleon
Nickolai Alexandrov
Conclusions
• Potentially over 30 million SNPs now discovered in rice
(3K and other projects)
• Advances in sequencing and phenotyping have
immensely accelerated the amount of data being
generated.
• Integration of the data along with passport data,
pedigrees and models will add significant value and
lead to enhanced utilization of conserved germplasm.
• Establishing the International Rice Informatics
Consortium to integrate data across sources and
facilitate collaborations
• These efforts should lead to the ability to predict the
utility of an accession for breeding based on its
sequence.
Questions?