white paper - canadian genomics cloud · canadian genomics cloud white paper genomicscloud.ca | !10...

12
Canadian Genomics Cloud WHITE PAPER CONTACT Email [email protected] VERSION v1.0, Tuesday February 13, 2018

Upload: others

Post on 29-May-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

Canadian Genomics CloudWHITE PAPER

CONTACT Email [email protected]

VERSION v1.0, Tuesday February 13, 2018

Canadian Genomics Cloud White Paper

genomicscloud.ca | !2

ContributorsMarc Fiume, PhD., CEO, DNAstack

Jim Vlasblom, PhD., CTO, DNAstack

Miroslav Cupak, MSc., Senior Engineer, DNAstack

Jyotheeswar Arvind Manickavasagar, MSc., Engineer, DNAstack

Adrian Thorogood, B.A.&Sc., B.C.L./LL.B., Centre of Genomics and Policy

Canadian Genomics Cloud White Paper

genomicscloud.ca | !3

Table of ContentsBackground 4

Opportunity 4

The Rise of Genomics Data 4

Existing Infrastructures 4

Cloud Computing 6

Platform 7

Features 7

Principles 8

Security, Privacy, and Compliance 9

Encryption and Locality 10

Collaborators 12

Work With Us 12

Canadian Genomics Cloud White Paper

genomicscloud.ca | !4

BackgroundOPPORTUNITY There is an enormous and timely opportunity for Canada to implement a national precision medicine initiative that will deliver breakthrough discoveries and safer, more effective, and cost-efficient treatments for millions of people affected by genetic diseases in Canada and around the world. In order to seize this opportunity, Canada’s genome scientists need new technical infrastructure that is purpose-built for scalability, security, cost-efficiency, and accessibility.

THE RISE OF GENOMICS DATA The cost of genome sequencing has decreased exponentially (faster than Moore’s Law) since the completion of the $1.3B USD Human Genome Project in 2002, thanks to commercialization of sequencing technologies led by Illumina. In 2018, it costs around $1,000 USD to perform Whole Genome Sequencing (WGS) using Illumina’s HiSeqX platform on a human individual, with expectation that their new NovaSeq platform will enable $100 WGS . 1

Commodification of WGS has resulted in global expansion of genomics investigations into the molecular blueprints of humans and other species of importance to the point where genomics has become one of the most computationally intensive sciences. By 2025, up to 2 billion individuals will have their genomes sequenced, generating up to 40 exabytes of data, surpassing the volume of content expected to be uploaded onto Twitter and YouTube combined . 2

EXISTING INFRASTRUCTURES Despite having strong scientific leadership in genomics, and increased capacity for WGS through Canada’s Genomics Enterprise (CGEn), Canada has lacked a powerful, robust, secure, scalable, and sustainable technical infrastructure required to meet the unprecedented demands for data storage and computation that will support the genomics initiatives of the future.

Genome scientists currently rely on a disjointed collection of home-brew, proprietary, locally-controlled, antiquated, and institutionally-siloed software solutions that cannot support large-scale genomics initiatives for a number of reasons:

Illumina Introduces the NovaSeq Series—a New Architecture Designed to Usher in the 1

$100 Genome — https://www.illumina.com/company/news-center/press-releases/press-release-details.html?newsid=2236383

Big Data: Astronomical or Genomical? — http://journals.plos.org/plosbiology/article?2

id=10.1371/journal.pbio.1002195

“I SKATE TO WHERE THE PUCK IS GOING TO BE, NOT WHERE IT HAS BEEN.”

— Wayne Gretzky

Canadian Genomics Cloud White Paper

genomicscloud.ca | !5

1. Data sharing is inefficient and compromises security. Data is shared between organizations over file-based transfer protocols (e.g. FTP), requiring large volumes of data to be downloaded over the internet onto a local hard drive before it can be used, which propagates data redundantly and consumes expensive network bandwidth and data storage capacity. By repeatedly transferring data out of a single secured environment, stewards of sensitive, identifiable genomics information lose the ability to oversee access to bona fide researchers or audit their use of it.

2. Access to data and infrastructure isn’t democratized. Centralization of data and infrastructure in a small number of institutions restricts access by the wider community, and its growth. Institutional authorization controls often hinder access by external users or preclude it altogether. Collaborative projects usually bring together researchers who are not affiliated with a single institution and who may not be traditional scientists (e.g. researchers from clinical diagnostic laboratories, industry, other fields of study, or citizen scientists).

3. Methods are not reproducible. Data processing workflows are developed in-house and hardcoded to run on specific local hardware architectures, limiting their reproducibility and confounding the comparison of results. The resulting data are not easily comparable, limiting the reuse of valuable information and delaying their interpretation and applications to human health.

4. Local infrastructures are disconnected from large and growing resources on the cloud. New tools, data, and services become available all the time and are being hosted on cloud platforms where they can be analyzed using managed software services, including machine learning frameworks. Local infrastructures are disconnected from these repositories and services, making it challenging for researchers using them to remain competitive.

5. The model is unsustainable. Many local infrastructures are already oversubscribed and at capacity. Centralization of resources to a small number of institutions creates an unsustainable imperial model with single points of failure that are impossible to scale indefinitely. On the contrary, federated infrastructures comprised of many interoperable systems underpin large scale systems like the internet

Cost of Genome Sequencing over Time

Data Source: https://www.genome.gov/pages/der/Sequencing_Costs_Table_July_2017.xlsx

Canadian Genomics Cloud White Paper

genomicscloud.ca | !6

and allow for sustainable, organic growth.

6. They are expensive to build and maintain. Home-brewed, siloed, local infrastructures quickly become obsolete, requiring huge costs to build and maintain. Commercial cloud providers, by contrast, have substantially more resources and expertise building such platforms.

Continued inertial investments into the development of local hardware and software installations by Canada’s genome scientists is a significant distraction that consumes valuable time, money, and expertise.

CLOUD COMPUTING At the same time, companies like Google, Microsoft, Amazon, and IBM have made major investments to build infrastructures to enable cloud computing, an emerging paradigm — used to deliver high-volume services like Netflix, iTunes, and YouTube — that uses storage and processing resources of remote data centres that can be accessed securely over the internet. Cloud computing frees its users of substantial costs, inefficiencies, and burdens associated with maintaining hard infrastructure on-premises while providing on-demand access to massive compute capacity and a wide-range of managed services. Major cloud providers including those listed above have recently launched data centres in Canada.

Renewed thinking around the design and implementation of technical infrastructure is needed to enable a vibrant ecosystem of Canadian researchers, clinical laboratories, healthcare systems, pharmaceutical companies, and other stakeholders to successfully capitalize on emerging opportunities in genomics. By building infrastructure that better shares data, tools, and compute resources, we stand to vastly increase the value of each genome sequenced and drive new insights. To that end, Canadian leaders have come together to create the Canadian Genomics Cloud, a cloud-based technology platform designed to power the next generation of genomics and precision medicine in Canada.

Dat

a Vo

lum

es (

GB

)

0E+00

1E+10

2E+10

3E+10

4E+10

Twitter YouTube Genomics

Expected Range

Data Volumes by 2025

Data Source: Big Data: Astronomical or Genomical? — http://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.1002195

Canadian Genomics Cloud White Paper

genomicscloud.ca | !7

PlatformThe Canadian Genomics Cloud (CGC) is an integrated software platform to manage, analyze and share genome sequence and clinical data. This public cloud computing platform, the first of its kind in Canada, gives every scientist in the country unfettered access to award-winning technology empowering precision medicine and other applications in genome research.

The Canadian Genomics Cloud is based on a set of principles (see Principles) that collectively overcome systemic design flaws of local infrastructures (see Existing Infrastructures) while delivering the latest technologies at unprecedented scale and cost efficiency. The Canadian Genomics Cloud can be accessed at genomicscloud.ca.

The platform was architected by Canadian leaders with decades of experience in genomics, sequencing, cloud computing, software, security, and policy and is intended to democratize access to best-in-class infrastructure while respecting national and provincial requirements for data privacy and security.

FEATURES

! Sequencer Integration

Securely stream genomics (including FASTQ, BAM, VCF) data directly

from high-throughput sequencers into the platform and automate

analyses.

PharmaHospitals

Canadian Genomics Cloud

ClinicsAcademic Application Platform

Genome SequencersSequencer SequencerSequencer

“THE WORLD NEEDS MORE CANADA”

— Barack Obama

Canadian Genomics Cloud White Paper

genomicscloud.ca | !8

PRINCIPLES

Storage

High-volume, encrypted storage for any data type. Data redundancy

and disaster recovery by default, optional cost-efficient archival

options.

" Bioinformatics

Execute any workflow defined in WDL at virtually any scale. Dynamically

deploy pipelines across tens of thousands of available compute cores.

Support for CWL coming soon.

Data Sharing

Share data privately between invited collaborators, widely across a

research community, or publicly over industry standard protocols.

$ Visualization

Explore genomics data interactively through an integrated genome

browser and other visualization tools.

% Multiple Interfaces

Interact with the platform through a web browser, or programmatically

via software libraries and command line interfaces.

& Machine Learning

Deploy the latest machine learning models on hardware accelerated

Tensor Processing Units (TPUs).

COMING SOON

◎ EHR Integration

Connect to clinical records through major EHR systems, including

Canada’s leading providers.

( Collaboratories

Seamlessly link together data and tools from multiple organizations.

) Hybrid Cloud

Store, analyze, and share datasets between multiple cloud platforms,

powered by interoperable and industry standard APIs.

* Accessibility

CGC democratizes access to a broad range of stakeholders (any place,

expertise, disease-focus, and sector), including scientists from

academia, clinical practice, industry, and government.

THE CANADIAN GENOMICS CLOUD PROVIDES ON-DEMAND ACCESS TO LEADING DATA ANALYSIS WORKFLOWS DEVELOPED BY BROAD INSTITUTE, VERILY, AND SENTIEON

Canadian Genomics Cloud White Paper

genomicscloud.ca | !9

SECURITY, PRIVACY, AND COMPLIANCE

Disclaimer: This document does not provide any legal advice or contractual guarantees. Please obtain legal advice to confirm that your use of the platform complies with consents and any institutional, provincial or national laws. To inquire about security, privacy, or compliance of the platform, please email [email protected].

The Canadian Genomics Cloud aims to service the needs of Canadian genome scientists from research institutions, clinical laboratories, hospitals, and industry.

+ Collaboration

CGC facilitates seamless sharing or data, tools, and infrastructure

between collaborating organizations. Compliance with open

bioinformatics standards, like WDL and the Common Workflow

Language (CWL), guarantee reproducibility inside and outside the

platform.

, Openness

CGC is based on open standards, including those its leadership co-

develops with, for example, the Global Alliance for Genomics and

Health. CGC aims to comply with open standards where they exists and

connect to other open and interoperable systems.

- Cost Efficiency

CGC drastically decreases upfront and ongoing investment into

technical infrastructure (e.g. hard drives, compute capacity) and

engineering (e.g. software development). Pay as you go pricing charges

users for exactly what they use, with no idle waste.

" Scalability

CGC is built on massively scalable storage and compute architectures

to meet exponential growth in Canadian genomics. CGC enables

dynamic provisioning of massive compute capacity and optimizes

storage, network bandwidth, and computation consumption while

improving collaboration and security.

# Security, Privacy, and Compliance

CGC platform is designed to meet the largest set of federal and

provincial regulations on data privacy and security. All data handled by

the platform is encrypted at rest and in flight and, unless otherwise

noted (see Exceptions to Data Sovereignty) is stored and computed on

using Canadian data centres.

Canadian Genomics Cloud White Paper

genomicscloud.ca | !10

Analysis and sharing of human genomic and health-related data in the Canadian Genomics Cloud may attract the application of Canadian privacy laws. The platform is being developed to satisfy the largest set of federal, provincial, and territorial data privacy and security requirements possible. Currently, the Canadian Genomics Cloud is suitable for research use only.

Canada has a publicly funded health care system. Instead of a single national plan, healthcare insurance plans are administered at provincial and territorial levels. The provincial and territorial governments are responsible for the management, organization and delivery of health care services for their residents.

The Canadian Genomics Cloud complies with the Personal Information Protection and Electronic Documentation Act (PIPEDA), Canada’s federal privacy law governing the private sector. Canadian organizations using the platform may also need to comply with other privacy laws. Many provinces and territories also have their own public and private sector laws, as well as health privacy laws that apply to custodians of personal health information (e.g., Newfoundland and Labrador, Nova Scotia, New Brunswick, Ontario, and Alberta). PIPEDA will continue to apply unless the provincial law has been deemed to offer substantially similar protection. The Canadian Genomics Cloud can assist users of the platform to determine their compliance obligations under PIPEDA and numerous other Canadian privacy laws.

We intend to implement robust practices for privacy and security above and beyond legal requirements applicable in Canada, the United States (e.g., HIPAA), and the European Union (e.g., the General Data Protection Regulation). These robust safeguards will provide additional confidence for Canadian organizations hosting data on the platform, and facilitate both national and international health research collaborations.

The Canadian Genomics Cloud and its members can work with user organizations on a case by case basis to develop policies and tools (e.g., service contracts and consent forms) for cloud-based research that satisfy the highest legal and ethical standards.

ENCRYPTION AND LOCALITY The platform employs 256-bit Advanced Encryption Standard (AES-256) for data at rest and end-to-end SSL/TLS encryption for data in flight. Users of the platform can also supply their own encryption keys to enhance protection from unauthorized third parties.

The Canadian Genomics Cloud enables users to store and compute their data on servers located within Canada. This increases accessibility and performance for Canadian organizations, and helps them comply with specialized data residency requirements.

THE CANADIAN GENOMICS CLOUD REMOVES CRITICAL BARRIERS CANADIAN GENOME SCIENTISTS FACE WHEN DOING THEIR WORK. TOO OFTEN, THE BEST IDEAS ARE NEVER EXPLORED BECAUSE THE INFRASTRUCTURE THEY NEED IS TOO DIFFICULT, EXPENSIVE, OR TIME CONSUMING TO BUILD AND MANAGE.

Canadian Genomics Cloud White Paper

genomicscloud.ca | !11

Organizations from outside Canada may also opt to keep their data on Canadian servers. Canada has strong and comprehensive federal, provincial and territorial data privacy and security laws that apply across the private, public and health sectors. Indeed, the European Union has deemed these protections to be at least as strong as those in place in Europe.

While we are constantly striving to offer flexible data residency options, there are still some limited exceptions. Exceptions include the following services:

• Variant Store

• Search Application

Moreover, if users wish to provide access to the platform outside Canada (e.g., United States), or to certain geographical locations within Canada, this may necessarily involve transit of encrypted network traffic out of the country.

Canadian Genomics Cloud White Paper

genomicscloud.ca | !12

CollaboratorsThe Canadian Genomics Cloud brings together leaders in genomics, sequencing, cloud computing, software, security, and policy from public and private sectors with a common mission to develop a robust technical platform to enable large-scale genomics and precision medicine initiatives in Canada. Collaborators are expected to have a presence in Canada, agree to the Canadian Genomics Cloud principles, and make a strong commitment to support Canadian genomics initiatives.

WORK WITH US We are proud to support Canadian and international organizations from public or private sectors to advance genomics and precision medicine. To partner with us, contact [email protected].