Download - Supporting bioinformatics - HPC Advisory Council€¦ · genomics, proteomics, metabolomics ... ‘omics data analysis is a critical contributor to the research outcomes Eg. RNAseq

Supporting

bioinformatics:

About BioPlatforms Australia

Established in 2007

We are funded through the Australian Government’s NCRIS program

We exist to support researchers to do research by providing access to life

science research infrastructure

Our infrastructure falls into 4 platforms: Genomics, Proteomics,

Metabolomics, Bioinformatics

We have a core team of 5 staff

We support ~17 facilities around the country

We provide capital support for instrumentation and operational support for

expertise

BioPlatforms

strategic plan

Bioplatforms Australia is a national asset providing value for the Australian Life Science sector through the provision of networked capability in genomics, proteomics, metabolomics and bioinformatics.

Vision: Bioplatforms Australia will underpin initiatives to address significant national biomedical, agri-food and environmental research challenges and deliver long term social and economic returns to the nation through the provision of integrated biomolecular research infrastructure and the development of strategic partnerships.

Bioplatforms National Footprint

University of Queensland

University of NSW

ANU JCSMR

University of Melbourne

BioScience, Bio21


The Australian Wine Research Institute

University of South Australia

Monash University

APAF MQ


Genomics MetabolomicsProteomics Bioinformatics

University of Western Australia

Harry Perkins

AGRF

AGRF

AGRF

Ramaciotti Centre

KCCG

AGRF


Melbourne Bioinformatics

NCI

BioplatformsFramework Strategy

We create open-data initiatives

through collaborative research

projects, which build critical

‘omic datasets that support

scientific challenges of national

importance.

Size, scale, complexity:

the life science challenge

BioPlatforms Australia provides access to sequencing capability to better understand anything that is alive

Human health: cancer, disease, genetics

Agriculture: crop and food sustainability, disease resistance

Biodiversity: environmental conservation, endangered species, Australian microbiome

We estimate ~30,000 health/biosciences researchers representing 30% of research effort in Aus

There is more data being produced than can be analysed

Data deluge

50 in 5: Australian researchers plan to sequence 50 of Australia’s most endangered animals over the next five years

Earth BioGenome Project: a global effort to sequence the genetic code, or genomes of all 1.5 million known animal, plant, protozoan and fungal species on Earth

100,000 genomes: The project was established to sequence 100,000 genomes from around 85,000 NHS patients affected by a rare disease, or cancer

The Cancer Genome Atlas: molecularly characterized over 20,000 primary cancer and matched normal samples spanning 33 cancer types

Estimated # Australian biology researchers in 2018: 30,000

20,000

(→ 15,000)

biology-focussed bioscience

researchers

occasional users of bioinformatics

web services

Eg BLAST, Ensembl

7,000

(→ 12,000)

data-intensive

bioscience researchers

‘omics data analysis is a critical

contributor to the research

outcomes

Eg. RNAseq analysis to identify

upregulated genes in broader

research program

2,000

(→ 3,000)

bioinf-intensive bioscience

researchers

research is fully dependent on

advanced use of bioinformatics

Eg. Genomic cancer research,

population genomics/agricultural

genomics programs

Estimated #: 1,000

(In 5 years → 1,500)

bioinformaticians

research into/application of

techniques & tool development

Eg. research generating new tool or

statistical method; core facilities

applying complex analyses

Important

Transitions

8

New Methods and a Growing User base

2017 National Consultation

Strong support for and participation in the consultation process received from the community, the reference group and ISAG.

● The initial market test meeting in Brisbane on 9/Aug had over 30 senior attendees who supported the need for the activity.● The reference group meeting had strong attendance and robust discussion and shaped the consultation topics.● Six regional meetings and a national community meeting (attended by about 200) were held as follows:

September 2017 October 2017 November 2017 December 2017

Testing of Categories,Completing Mini Audit

Clarifying scopeConsult ISAG

Develop the concept with Reference Group, ISAG, EMBL-ABR team and key researchers

Reference Group (in person)ISAG comment

Review with Dept (ELIXIR director attended)

Phase 2 Report writing

Socialisation through consultations in all states and at national conferences

● Perth 10-11/Oct

● Canberra 30/Oct

● Sydney 3/Nov

● Melbourne 8/Nov, 17/Nov

● ABACBS 14/Nov

● Adelaide 20/Nov

Key findings from the national consultation

Estimated # Australian biology researchers: 30,000 growing

slowly in 5 years

biology-focussed


From 20,000 to 15,000

data-intensive


From 7,000 to 12,000

bioinformatics-intensive


From 2,000 to 3,000

bioinformaticians

From 1,000 to 1,500A compute and data intensive

infrastructure for the advancement of

bioinformatics research

A means by which datasets can be integrated by collaborators including across institutions

A service to which a researcher brings

their research goals, tools, pipelines

and data, that seamlessly integrates

with all the resources they might need

A means by which omics data can be included in data integration occuring in research

translation and in research applications

Broad Challenges

● The continuous reskilling of the

bioscience research workforce

● The simple retention of data at the

volumes expected.

● The curation of data at the scale

and complexity envisaged

● The onshore use of large offshore

sources of primary reference data

● The availability for research use of

large non-research genomic

collections

● The complexity of integration of

genomic data with many other

data types

10

An Australian BioCommons

Motivation:

● Bioinformatics and new biomolecular technologies drive breakthrough research

● Data is exploding, compute intensity is rising fast and the techniques are undergoing rapid enhancement

● New forms of research infrastructure supply and consumption are being tried and adopted

● International investment is driving many of the technology and policy aspects of likely solutions

Meeting the challenge is beyond the means of any single actor in the Australian research sector.

Therefore:

We should cooperate in developing an Australian BioCommons

so we can we can deliver the maximum benefit possible

to Australian researchers and research outcomes

There is a global wave of investment

Study trip:

broad

conclusions

Global scale compute and data infrastructures are increasingly underpinning global scale research in life sciences

Cloud First is pervasive across EU and US

In the US (but not EU) true partnerships arising with cloud providers

The concept of Data Commons is very strong in the US, focused on data and method sharing

ELIXIR is doing a very good job of coordinating data infrastructure across Europe

ELIXIR/EBI compute strategy firmly cloud focussed - federated compute + data across EU

Federated approaches to data infrastructure are developing, and accessible

National Bioinformatics Infrastructures can deliver benefits across industry engagement

Galaxy is extremely well regarded as a community analysis platform in both Europe and the USA

Conclusions

(December

2017)

Three key capabilities are missing

Capability I - A national omics analysis service providing:

A means to use standardised bioinformatics techniques through high level interfaces

Integrated with a regionally accessible support and training network

Providing direct access to underlying infrastructure for new technique developers

Capability II - Data Integration and Interrogation Facilities

One or more facilities at the infrastructure level - for data intensive computing on bioscience data and tools

Coupled with a critical mass of data science expertise versed in omics

Assigned by merit to support large team based research for extended periods (multi year)

Capability III - An Australian Bioscience Data Consortium

Policy development around rapidly emerging data asset issues;

The changing requirements on undergraduate and postgraduate training; and

Engagement with large scale omic resources onshore and offshore

Note: Today, we would add IV - A national solution for genome retention and access

15

We have an opportunityeResearch $911M

Complex Biology $216M

In addition to existing operating

The Australian BioCommons

The BioCommons will comprise three components:

1. BioCommons Hub - providing governance, leadership, planning and management

2. BioCommons Services - providing the services and functionality

3. BioCommons Cloud - providing the necessary compute and data infrastructure

To be progressed through:

● A BPA led investment into the BioCommons Leadership and Services components

● The construction of a BioCommons Cloud with the assistance of the AAF, AARNet, ARDC (and its

nodes), NCI and Pawsey and inclusive of the best use of AWS, Google and Azure

● The expansion of the BioCommons with the participation of the above and Universities, Medical

Research Institutes, Publicly Funded Research Agencies and other government agencies, applying

the BioCommons in the areas of Agriculture, BioDiversity and Human Health.

Australian BioCommons Principles

A national focus on capabilities and communities

Partner internationally: participate in and contribute to larger critical mass efforts where possible; reuse

and improve rather than build anew

Build a software and expertise capability that will reduce duplication of infrastructure management in

Australia and allow efforts to be re-focussed on methods development and dissemination

Promote the development of, and build on, high throughput cloud infrastructure that is interoperable with

international (initially US and European) equivalents, using established, well supported software platforms

Streamline the exchange of tools, workflows, data and training and expertise both nationally and

internationally

BioCommons Hub

Leadership

Andrew Lonie

Rhys Francis

Steven Manos

Jeff Christiansen

BioCommons

– current

partnerships

19

Initiation and development

2017 2018 2019 2020 2021 2022 2023 2024

Domain Applicationsand Services

Resourcesand Facilities

Leadership and Governance

Continuing Pathfinding

BioCloud

BPA National &InternationalConsultations

EMBL-ABR community building ANDS, NeCTAR, RDS programsBPA training, GVL, Galaxy services

PathfinderProject

Bioplatforms & eResearchcapabilities

Enduring

National

Research

Infrastruct

ure

Five technical activities/implementation studies

Human GenomeAccess and Archive

Interoperability with global data (Kids First)

Non-model GenomeAssembly & Annotation

Improvements to BYOD(phylogenetics, instruments,

CloudStor)

Apply a Pathfinder Cloud, use AWS, evaluate bothAccessible compute and storage

2019 Pathfinder Project - Informing the BioCommons

Five technical activities/implementation studiesmanaged as deliverables of the Pathfinder Project




Highly accessible Tools and Workflows

(phylogenetics, instruments, CloudStor)

BioCloud - on-prem and commercial cloudAccessible compute and storage

What is

assembly and

annotation?

Genome assembly refers to the process of taking a large number of discrete and finite length DNA sequences (generated from various ‘short’ or ‘long’ sequencing technologies) and computationally putting them back together to create a representation of the original chromosomes from which the DNA originated. De novo genome assembly is the assembly process that needs to be undertaken for non-model organisms, because there is usually no prior knowledge of the source DNA sequence length, layout or composition for the particular species being examined.

Genome annotation is the process of identifying and labelling features on a genome assembly, including genes, repetitive elements, and promoter regions. An unannotated genome assembly is of very limited use, as it consists of vast stretches of pure, unlabelled DNA sequence, whereas the higher the quality of the annotation, the more we know about the genome features. This increases the utility of the genome in comparative genomics, functional genomics, and evolutionary and ecological genomics.

Expected

outcomes

Researchers (including, but not limited to the

Bioplatforms sponsored Oz Mammals Genomics and

Genomics for Australian Plants consortia) undertaking

de novo assembly will utilise tools and/or pipelines that

have been deployed on the BioCloud (as opposed to

using tools/pipelines that deployed on and duplicated

across many separate systems now).

Operators of the BioCloud will thoroughly understand

the usage patterns and computational requirements of

the infrastructure that should be provided for this

community, which will inform the design of the

BioCloud.

http://www.bioplatforms.com/oz-mammals/

http://www.bioplatforms.com/australian-plants/

Pathfinding Activities (2019)

Communities and infrastructure services identified for common omic-based challenges:

e.g. Genome annotation; Multi-omics integration; Comparative Genomics; etc

Development of a Genome Annotation Infrastructure Roadmap for Australia

First iteration released for community comment last week

http://bit.ly/aus-genome-annotation – COMMENTS WELCOME!

Subsequent iterations following consultation with:

International entities operating genome annotation infrastructure

elsewhere (e.g. Ensembl, EBI)

Australian Infrastructure Providers (e.g. NCI, Pawsey, ARDC)

Final community agreed roadmap planned for November 2019, to inform

future BioCommons investment

http://bit.ly/aus-genome-annotation

Wildlife Genomics

Presented by

Dr Carolyn Hogg

Australasian Wildlife Genomics Group

SOLES

Bilby Genome

Assembly

10x Chromium sequencing; Supernova; 100GB raw data

Artemis

7 attempts to assemble all failed (lustre file system issue)

NCI

350 hours wall-time; assembly failed (lustre file system issue)

Microsoft AZURE (US$1,100)

144 hours (including two restarts)

32 CPUs; 500GB RAM; 2.1TB storage

3.1 GB genome

DELL T640 computer (~$15-20k)

Workforce transition and the infrastructure opportunity

Less field work more dry lab

Skills shortage

Poor interactions with existing technology

providers

Domain is a late comer to HPC party

Software not optimized

Skills don’t exist to rectify

How do we avoid the “box under the desk”

Accelerate science

Five technical activities/implementation studiesmanaged as deliverables of the Pathfinder Project




Highly accessible Tools and Workflows

(phylogenetics, instruments, CloudStor)

BioCloud - on-prem and commercial cloudAccessible compute and storage

Genomes Global data

Annotation Better BYOD

Pathfinder Cloud

Pathfinder Proposition

Initiate a high throughput facility suitable for the pathfinder. Analyse pathfinder use

cases to determine future BioCloud architecture and performance requirements.

Intended Outcomes

● A Pathfinder BioCloud capable of supporting data intensive biology (albeit

restricted in scale to the pathfinder)

● Improved skill sets in high throughput data intensive computing and multi-cloud

interoperation

● Comparative cost benefit analyses of alternative strategies for high throughput

multi-cloud at scale to inform establishment of BioCloud

Initiate by July 2019

A significantly revised BioCLoud

V1 would not likely exist before

late 2020, so some multi-year

horizon is needed

An ability to expand will be

important

Exit when BioCloud V1

commissioned.

The Pathfinder Cloud may need

to be sustained for 3 years in the

case of exit strategies related to

a failure to proceed with the

BioCommons and BioCloud V1.

What’s all this look like for researchers?

Boxes

Under

Desks

Commercial

Cloud

Public

HPC

What ’s all this look like for researchers?

Boxes

Under

Desks

Commercial

Cloud

Public

HPC

The Biocommons @ Pawsey

What problems are we trying to solve?Just all of them…

Portability

Reproducibility & Provenance

Collaboration

Software dependencies

Ease of use

Performance (Python and apps with heavy I/O)

Data movement


What problems are we trying to solve?

Just all of them…

Portability

Reproducibility & Provenance

Collaboration

Software dependencies

Ease of use

Performance (Python and apps with heavy I/O)

Data movement


Establishing the ‘Biocloud‘

Software and Containerisation

Data Movement

Leveraging commercial cloud for hybrid work


Establishing the ‘Biocloud‘

Software and Containerisation

Data Movement

Leveraging commercial cloud for hybrid work

Key challenges

How do we use infrastructure to remain competitive in the global research

landscape?

How do we keep up with global $?

What do we specialise in, what do we bring to the global table?

How do we deal with distance, internally and externally? Data movement?

How do we keep staff?

What’s Australia’s place in the biosciences global research infrastructure

context?

Take home

messages

Life science hasn’t had the same interaction

with the evolution of HPC as other domains

BioCommons is looking for partners, there are

multiple ways to engage, please talk to us

How can infrastructure providers support

communities that aren’t computational

chemists and physicists

Opportunity for better engagement between

researchers and infrastructure providers

Contact

Sarah Nisbet, Platforms and Engagement Manager

BioPlatforms Australia

[email protected]

0420 959278

Andrew Lonie, Director, Australian Bioinformatics Commons


[email protected]

+61 3 83441395

mailto:[email protected]

mailto:[email protected]

Existing community expertise

INTERNATIONAL

ENGAGEMENT

- Bioplatforms

- International

study tour

- ARDC

- ?

- ?

SERVICES

- Monash

- QFAB

- USyd

Informatics

SUPPORT

- ARDC

distributed

helpdesk

- QFAB

INFRASTRUCTURE

- ARDC

- AARNet

- AAF

- NCI

- Pawsey

- QCIF

- Intersect

- UoM

- Monash

TRAINING

- EMBL ABR

- UoM

- QFAB

- Bioplatforms

- ABACBS

Qualities sought through the combined Consortia

Scalability05● Non partisan development of capability at national scale

● Providing both open-to-all and impactful merit based allocations

● Available for open ended expansion by participants for their own use

Critical mass04● Commitment to world class expertise teams in focussed groups

● Capacity to support concentrated technical capabilities

● Concomitant support for scaled up investments in human resources

Domain connectivity03● Strong connectivity to national research leadership

● Real time connectivity to International developments

● Understanding informatics in Human Health, Agri-Food and Biodiversity

● Currency in workflows, techniques, developments, data resources etc.

Technological competence02● Growing our world class expertise in delivery technologies including:

● Open stack, AWS, Google and Azure competencies; and

● High Performance Compute, Data and Networking systems

Nationwide inclusion01● Engaging fit for purpose national, institutional or regional capabilities

● Involving elements that are co-located with academic excellence

● Building a federated national expertise and support network

US National Institutes of Health: Data Commons

TOPMed: WGS data from 120,000

individuals funded by National Heart,

Lung, and Blood Institute.

Genotype Tissue expression project: 714

donors and 11688 RNA-seq samples across 53

tissue sites and 2 cell lines

Model Organism Databases: WormBase,

FlyBase, Zebrafish Information Network,

Saccharomyces , Mouse and Rat Genome

Databases.

Pilot

Phas

e

ELIXIR : 5 platforms of shared services led by leading European scientists

Current Projects - Compute

Reference Genomes (1.5 to 3.5GB)

Monolithic pipelines; Heavy I/O

150GB (raw data); 500GB RAM; 2.5TB disk space; 24-32 CPUs

Reseqeuenced Genomes

Monolithic/multi-thread pipelines

100-120GB (raw data); 96GB RAM; 1TB disk space; 24 CPUs

Transcriptomes

Multi-thread pipelines

100-120GB (raw data); 35GB RAM; 150GB disk space; 24 CPUs

Phylogenomics; RRS; DNA methylation; pathogen discovery

Small compute requirements; HPC suitable

Download - Supporting bioinformatics - HPC Advisory Council€¦ · genomics, proteomics, metabolomics ... ‘omics data analysis is a critical contributor to the research outcomes Eg. RNAseq

Top Related