Supporting
bioinformatics:
About BioPlatforms Australia
Established in 2007
We are funded through the Australian Government’s NCRIS program
We exist to support researchers to do research by providing access to life
science research infrastructure
Our infrastructure falls into 4 platforms: Genomics, Proteomics,
Metabolomics, Bioinformatics
We have a core team of 5 staff
We support ~17 facilities around the country
We provide capital support for instrumentation and operational support for
expertise
BioPlatforms
strategic plan
Bioplatforms Australia is a national asset providing value for the Australian Life Science sector through the provision of networked capability in genomics, proteomics, metabolomics and bioinformatics.
Vision: Bioplatforms Australia will underpin initiatives to address significant national biomedical, agri-food and environmental research challenges and deliver long term social and economic returns to the nation through the provision of integrated biomolecular research infrastructure and the development of strategic partnerships.
Bioplatforms National Footprint
University of Queensland
University of NSW
ANU JCSMR
University of Melbourne
BioScience, Bio21
University of Queensland
The Australian Wine Research Institute
University of South Australia
Monash University
APAF MQ
University of Queensland
Genomics MetabolomicsProteomics Bioinformatics
University of Western Australia
Harry Perkins
AGRF
AGRF
AGRF
Ramaciotti Centre
KCCG
AGRF
University of Melbourne
Melbourne Bioinformatics
NCI
BioplatformsFramework Strategy
We create open-data initiatives
through collaborative research
projects, which build critical
‘omic datasets that support
scientific challenges of national
importance.
Size, scale, complexity:
the life science challenge
BioPlatforms Australia provides access to sequencing capability to better understand anything that is alive
Human health: cancer, disease, genetics
Agriculture: crop and food sustainability, disease resistance
Biodiversity: environmental conservation, endangered species, Australian microbiome
We estimate ~30,000 health/biosciences researchers representing 30% of research effort in Aus
There is more data being produced than can be analysed
Data deluge
50 in 5: Australian researchers plan to sequence 50 of Australia’s most endangered animals over the next five years
Earth BioGenome Project: a global effort to sequence the genetic code, or genomes of all 1.5 million known animal, plant, protozoan and fungal species on Earth
100,000 genomes: The project was established to sequence 100,000 genomes from around 85,000 NHS patients affected by a rare disease, or cancer
The Cancer Genome Atlas: molecularly characterized over 20,000 primary cancer and matched normal samples spanning 33 cancer types
Estimated # Australian biology researchers in 2018: 30,000
20,000
(→ 15,000)
biology-focussed bioscience
researchers
occasional users of bioinformatics
web services
Eg BLAST, Ensembl
7,000
(→ 12,000)
data-intensive
bioscience researchers
‘omics data analysis is a critical
contributor to the research
outcomes
Eg. RNAseq analysis to identify
upregulated genes in broader
research program
2,000
(→ 3,000)
bioinf-intensive bioscience
researchers
research is fully dependent on
advanced use of bioinformatics
Eg. Genomic cancer research,
population genomics/agricultural
genomics programs
Estimated #: 1,000
(In 5 years → 1,500)
bioinformaticians
research into/application of
techniques & tool development
Eg. research generating new tool or
statistical method; core facilities
applying complex analyses
Important
Transitions
8
New Methods and a Growing User base
2017 National Consultation
Strong support for and participation in the consultation process received from the community, the reference group and ISAG.
● The initial market test meeting in Brisbane on 9/Aug had over 30 senior attendees who supported the need for the activity.● The reference group meeting had strong attendance and robust discussion and shaped the consultation topics.● Six regional meetings and a national community meeting (attended by about 200) were held as follows:
September 2017 October 2017 November 2017 December 2017
Testing of Categories,Completing Mini Audit
Clarifying scopeConsult ISAG
Develop the concept with Reference Group, ISAG, EMBL-ABR team and key researchers
Reference Group (in person)ISAG comment
Review with Dept (ELIXIR director attended)
Phase 2 Report writing
Socialisation through consultations in all states and at national conferences
● Perth 10-11/Oct
● Canberra 30/Oct
● Sydney 3/Nov
● Melbourne 8/Nov, 17/Nov
● ABACBS 14/Nov
● Adelaide 20/Nov
Key findings from the national consultation
Estimated # Australian biology researchers: 30,000 growing
slowly in 5 years
biology-focussed
bioscience researchers
From 20,000 to 15,000
data-intensive
bioscience researchers
From 7,000 to 12,000
bioinformatics-intensive
bioscience researchers
From 2,000 to 3,000
bioinformaticians
From 1,000 to 1,500A compute and data intensive
infrastructure for the advancement of
bioinformatics research
A means by which datasets can be integrated by collaborators including across institutions
A service to which a researcher brings
their research goals, tools, pipelines
and data, that seamlessly integrates
with all the resources they might need
A means by which omics data can be included in data integration occuring in research
translation and in research applications
Broad Challenges
● The continuous reskilling of the
bioscience research workforce
● The simple retention of data at the
volumes expected.
● The curation of data at the scale
and complexity envisaged
● The onshore use of large offshore
sources of primary reference data
● The availability for research use of
large non-research genomic
collections
● The complexity of integration of
genomic data with many other
data types
10
An Australian BioCommons
Motivation:
● Bioinformatics and new biomolecular technologies drive breakthrough research
● Data is exploding, compute intensity is rising fast and the techniques are undergoing rapid enhancement
● New forms of research infrastructure supply and consumption are being tried and adopted
● International investment is driving many of the technology and policy aspects of likely solutions
Meeting the challenge is beyond the means of any single actor in the Australian research sector.
Therefore:
We should cooperate in developing an Australian BioCommons
so we can we can deliver the maximum benefit possible
to Australian researchers and research outcomes
There is a global wave of investment
Study trip:
broad
conclusions
Global scale compute and data infrastructures are increasingly underpinning global scale research in life sciences
Cloud First is pervasive across EU and US
In the US (but not EU) true partnerships arising with cloud providers
The concept of Data Commons is very strong in the US, focused on data and method sharing
ELIXIR is doing a very good job of coordinating data infrastructure across Europe
ELIXIR/EBI compute strategy firmly cloud focussed - federated compute + data across EU
Federated approaches to data infrastructure are developing, and accessible
National Bioinformatics Infrastructures can deliver benefits across industry engagement
Galaxy is extremely well regarded as a community analysis platform in both Europe and the USA
Conclusions
(December
2017)
Three key capabilities are missing
Capability I - A national omics analysis service providing:
A means to use standardised bioinformatics techniques through high level interfaces
Integrated with a regionally accessible support and training network
Providing direct access to underlying infrastructure for new technique developers
Capability II - Data Integration and Interrogation Facilities
One or more facilities at the infrastructure level - for data intensive computing on bioscience data and tools
Coupled with a critical mass of data science expertise versed in omics
Assigned by merit to support large team based research for extended periods (multi year)
Capability III - An Australian Bioscience Data Consortium
Policy development around rapidly emerging data asset issues;
The changing requirements on undergraduate and postgraduate training; and
Engagement with large scale omic resources onshore and offshore
Note: Today, we would add IV - A national solution for genome retention and access
15
We have an opportunityeResearch $911M
Complex Biology $216M
In addition to existing operating
The Australian BioCommons
The BioCommons will comprise three components:
1. BioCommons Hub - providing governance, leadership, planning and management
2. BioCommons Services - providing the services and functionality
3. BioCommons Cloud - providing the necessary compute and data infrastructure
To be progressed through:
● A BPA led investment into the BioCommons Leadership and Services components
● The construction of a BioCommons Cloud with the assistance of the AAF, AARNet, ARDC (and its
nodes), NCI and Pawsey and inclusive of the best use of AWS, Google and Azure
● The expansion of the BioCommons with the participation of the above and Universities, Medical
Research Institutes, Publicly Funded Research Agencies and other government agencies, applying
the BioCommons in the areas of Agriculture, BioDiversity and Human Health.
Australian BioCommons Principles
A national focus on capabilities and communities
Partner internationally: participate in and contribute to larger critical mass efforts where possible; reuse
and improve rather than build anew
Build a software and expertise capability that will reduce duplication of infrastructure management in
Australia and allow efforts to be re-focussed on methods development and dissemination
Promote the development of, and build on, high throughput cloud infrastructure that is interoperable with
international (initially US and European) equivalents, using established, well supported software platforms
Streamline the exchange of tools, workflows, data and training and expertise both nationally and
internationally
BioCommons Hub
Leadership
Andrew Lonie
Rhys Francis
Steven Manos
Jeff Christiansen
BioCommons
– current
partnerships
19
Initiation and development
2017 2018 2019 2020 2021 2022 2023 2024
Domain Applicationsand Services
Resourcesand Facilities
Leadership and Governance
Continuing Pathfinding
BioCloud
BPA National &InternationalConsultations
EMBL-ABR community building ANDS, NeCTAR, RDS programsBPA training, GVL, Galaxy services
PathfinderProject
Bioplatforms & eResearchcapabilities
Enduring
National
Research
Infrastruct
ure
Five technical activities/implementation studies
Human GenomeAccess and Archive
Interoperability with global data (Kids First)
Non-model GenomeAssembly & Annotation
Improvements to BYOD(phylogenetics, instruments,
CloudStor)
Apply a Pathfinder Cloud, use AWS, evaluate bothAccessible compute and storage
2019 Pathfinder Project - Informing the BioCommons
Five technical activities/implementation studiesmanaged as deliverables of the Pathfinder Project
Human GenomeAccess and Archive
Interoperability with global data (Kids First)
Non-model GenomeAssembly & Annotation
Highly accessible Tools and Workflows
(phylogenetics, instruments, CloudStor)
BioCloud - on-prem and commercial cloudAccessible compute and storage
What is
assembly and
annotation?
Genome assembly refers to the process of taking a large number of discrete and finite length DNA sequences (generated from various ‘short’ or ‘long’ sequencing technologies) and computationally putting them back together to create a representation of the original chromosomes from which the DNA originated. De novo genome assembly is the assembly process that needs to be undertaken for non-model organisms, because there is usually no prior knowledge of the source DNA sequence length, layout or composition for the particular species being examined.
Genome annotation is the process of identifying and labelling features on a genome assembly, including genes, repetitive elements, and promoter regions. An unannotated genome assembly is of very limited use, as it consists of vast stretches of pure, unlabelled DNA sequence, whereas the higher the quality of the annotation, the more we know about the genome features. This increases the utility of the genome in comparative genomics, functional genomics, and evolutionary and ecological genomics.
Expected
outcomes
Researchers (including, but not limited to the
Bioplatforms sponsored Oz Mammals Genomics and
Genomics for Australian Plants consortia) undertaking
de novo assembly will utilise tools and/or pipelines that
have been deployed on the BioCloud (as opposed to
using tools/pipelines that deployed on and duplicated
across many separate systems now).
Operators of the BioCloud will thoroughly understand
the usage patterns and computational requirements of
the infrastructure that should be provided for this
community, which will inform the design of the
BioCloud.
Pathfinding Activities (2019)
Communities and infrastructure services identified for common omic-based challenges:
e.g. Genome annotation; Multi-omics integration; Comparative Genomics; etc
Development of a Genome Annotation Infrastructure Roadmap for Australia
First iteration released for community comment last week
http://bit.ly/aus-genome-annotation – COMMENTS WELCOME!
Subsequent iterations following consultation with:
International entities operating genome annotation infrastructure
elsewhere (e.g. Ensembl, EBI)
Australian Infrastructure Providers (e.g. NCI, Pawsey, ARDC)
Final community agreed roadmap planned for November 2019, to inform
future BioCommons investment
Wildlife Genomics
Presented by
Dr Carolyn Hogg
Australasian Wildlife Genomics Group
SOLES
Bilby Genome
Assembly
10x Chromium sequencing; Supernova; 100GB raw data
Artemis
7 attempts to assemble all failed (lustre file system issue)
NCI
350 hours wall-time; assembly failed (lustre file system issue)
Microsoft AZURE (US$1,100)
144 hours (including two restarts)
32 CPUs; 500GB RAM; 2.1TB storage
3.1 GB genome
DELL T640 computer (~$15-20k)
Workforce transition and the infrastructure opportunity
Less field work more dry lab
Skills shortage
Poor interactions with existing technology
providers
Domain is a late comer to HPC party
Software not optimized
Skills don’t exist to rectify
How do we avoid the “box under the desk”
Accelerate science
Five technical activities/implementation studiesmanaged as deliverables of the Pathfinder Project
Human GenomeAccess and Archive
Interoperability with global data (Kids First)
Non-model GenomeAssembly & Annotation
Highly accessible Tools and Workflows
(phylogenetics, instruments, CloudStor)
BioCloud - on-prem and commercial cloudAccessible compute and storage
Genomes Global data
Annotation Better BYOD
Pathfinder Cloud
Pathfinder Proposition
Initiate a high throughput facility suitable for the pathfinder. Analyse pathfinder use
cases to determine future BioCloud architecture and performance requirements.
Intended Outcomes
● A Pathfinder BioCloud capable of supporting data intensive biology (albeit
restricted in scale to the pathfinder)
● Improved skill sets in high throughput data intensive computing and multi-cloud
interoperation
● Comparative cost benefit analyses of alternative strategies for high throughput
multi-cloud at scale to inform establishment of BioCloud
Initiate by July 2019
A significantly revised BioCLoud
V1 would not likely exist before
late 2020, so some multi-year
horizon is needed
An ability to expand will be
important
Exit when BioCloud V1
commissioned.
The Pathfinder Cloud may need
to be sustained for 3 years in the
case of exit strategies related to
a failure to proceed with the
BioCommons and BioCloud V1.
What’s all this look like for researchers?
Boxes
Under
Desks
Commercial
Cloud
Public
HPC
What ’s all this look like for researchers?
Boxes
Under
Desks
Commercial
Cloud
Public
HPC
The Biocommons @ Pawsey
What problems are we trying to solve?Just all of them…
Portability
Reproducibility & Provenance
Collaboration
Software dependencies
Ease of use
Performance (Python and apps with heavy I/O)
Data movement
The Biocommons @ Pawsey
What problems are we trying to solve?
Just all of them…
Portability
Reproducibility & Provenance
Collaboration
Software dependencies
Ease of use
Performance (Python and apps with heavy I/O)
Data movement
The Biocommons @ Pawsey
Establishing the ‘Biocloud‘
Software and Containerisation
Data Movement
Leveraging commercial cloud for hybrid work
The Biocommons @ Pawsey
Establishing the ‘Biocloud‘
Software and Containerisation
Data Movement
Leveraging commercial cloud for hybrid work
Key challenges
How do we use infrastructure to remain competitive in the global research
landscape?
How do we keep up with global $?
What do we specialise in, what do we bring to the global table?
How do we deal with distance, internally and externally? Data movement?
How do we keep staff?
What’s Australia’s place in the biosciences global research infrastructure
context?
Take home
messages
Life science hasn’t had the same interaction
with the evolution of HPC as other domains
BioCommons is looking for partners, there are
multiple ways to engage, please talk to us
How can infrastructure providers support
communities that aren’t computational
chemists and physicists
Opportunity for better engagement between
researchers and infrastructure providers
Contact
Sarah Nisbet, Platforms and Engagement Manager
BioPlatforms Australia
0420 959278
Andrew Lonie, Director, Australian Bioinformatics Commons
University of Melbourne
+61 3 83441395
Existing community expertise
INTERNATIONAL
ENGAGEMENT
- Bioplatforms
- International
study tour
- ARDC
- ?
- ?
SERVICES
- Monash
- QFAB
- USyd
Informatics
SUPPORT
- ARDC
distributed
helpdesk
- QFAB
INFRASTRUCTURE
- ARDC
- AARNet
- AAF
- NCI
- Pawsey
- QCIF
- Intersect
- UoM
- Monash
TRAINING
- EMBL ABR
- UoM
- QFAB
- Bioplatforms
- ABACBS
Qualities sought through the combined Consortia
Scalability05● Non partisan development of capability at national scale
● Providing both open-to-all and impactful merit based allocations
● Available for open ended expansion by participants for their own use
Critical mass04● Commitment to world class expertise teams in focussed groups
● Capacity to support concentrated technical capabilities
● Concomitant support for scaled up investments in human resources
Domain connectivity03● Strong connectivity to national research leadership
● Real time connectivity to International developments
● Understanding informatics in Human Health, Agri-Food and Biodiversity
● Currency in workflows, techniques, developments, data resources etc.
Technological competence02● Growing our world class expertise in delivery technologies including:
● Open stack, AWS, Google and Azure competencies; and
● High Performance Compute, Data and Networking systems
Nationwide inclusion01● Engaging fit for purpose national, institutional or regional capabilities
● Involving elements that are co-located with academic excellence
● Building a federated national expertise and support network
US National Institutes of Health: Data Commons
TOPMed: WGS data from 120,000
individuals funded by National Heart,
Lung, and Blood Institute.
Genotype Tissue expression project: 714
donors and 11688 RNA-seq samples across 53
tissue sites and 2 cell lines
Model Organism Databases: WormBase,
FlyBase, Zebrafish Information Network,
Saccharomyces , Mouse and Rat Genome
Databases.
Pilot
Phas
e
US National Institutes of Health: Data Commons
TOPMed: WGS data from 120,000
individuals funded by National Heart,
Lung, and Blood Institute.
Genotype Tissue expression project: 714
donors and 11688 RNA-seq samples across 53
tissue sites and 2 cell lines
Model Organism Databases: WormBase,
FlyBase, Zebrafish Information Network,
Saccharomyces , Mouse and Rat Genome
Databases.
Pilot
Phas
e
ELIXIR : 5 platforms of shared services led by leading European scientists
ELIXIR : 5 platforms of shared services led by leading European scientists
Current Projects - Compute
Reference Genomes (1.5 to 3.5GB)
Monolithic pipelines; Heavy I/O
150GB (raw data); 500GB RAM; 2.5TB disk space; 24-32 CPUs
Reseqeuenced Genomes
Monolithic/multi-thread pipelines
100-120GB (raw data); 96GB RAM; 1TB disk space; 24 CPUs
Transcriptomes
Multi-thread pipelines
100-120GB (raw data); 35GB RAM; 150GB disk space; 24 CPUs
Phylogenomics; RRS; DNA methylation; pathogen discovery
Small compute requirements; HPC suitable