genomics and bigdata - case study
DESCRIPTION
Genomics and BigData - case studyTRANSCRIPT
![Page 1: Genomics and BigData - case study](https://reader031.vdocuments.net/reader031/viewer/2022020122/54c1fed84a7959f14f8b45a7/html5/thumbnails/1.jpg)
dÉåçãáÅë=áå=_áÖ=a~í~==빅데이터=관점에서의=유전체=데이터
Permissions: you are free to blog or live-blog about this presentation as long as you attribute the work to its authors
Chang Bum Hong
![Page 2: Genomics and BigData - case study](https://reader031.vdocuments.net/reader031/viewer/2022020122/54c1fed84a7959f14f8b45a7/html5/thumbnails/2.jpg)
오늘 할 이야기는.. Genomics에서의 클라우드 필요성 Cloud Computing Cloud in Genomics 사례로 살펴보는
Big Data 분석에 필요한 HPC Cloud의 요소
![Page 3: Genomics and BigData - case study](https://reader031.vdocuments.net/reader031/viewer/2022020122/54c1fed84a7959f14f8b45a7/html5/thumbnails/3.jpg)
1,588명의 응답자 가운데 54%가 자신의 게놈을 분석해
보고 싶다고 응답
Nature, 2010
![Page 4: Genomics and BigData - case study](https://reader031.vdocuments.net/reader031/viewer/2022020122/54c1fed84a7959f14f8b45a7/html5/thumbnails/4.jpg)
“기초의학연구와 임상의학의 의료분야에 가장 큰 혜택을 줄것이며,
5-10년후 개인 맞춤의학이 보편화 될것, 하지만 컴퓨팅과 소프트웨어
부족은 큰 걸림돌”
Nature, 2010
![Page 5: Genomics and BigData - case study](https://reader031.vdocuments.net/reader031/viewer/2022020122/54c1fed84a7959f14f8b45a7/html5/thumbnails/5.jpg)
![Page 6: Genomics and BigData - case study](https://reader031.vdocuments.net/reader031/viewer/2022020122/54c1fed84a7959f14f8b45a7/html5/thumbnails/6.jpg)
Data Size
![Page 7: Genomics and BigData - case study](https://reader031.vdocuments.net/reader031/viewer/2022020122/54c1fed84a7959f14f8b45a7/html5/thumbnails/7.jpg)
Needs Public Data
![Page 8: Genomics and BigData - case study](https://reader031.vdocuments.net/reader031/viewer/2022020122/54c1fed84a7959f14f8b45a7/html5/thumbnails/8.jpg)
Various Software
http://seqanswers.com/wiki/Software
![Page 9: Genomics and BigData - case study](https://reader031.vdocuments.net/reader031/viewer/2022020122/54c1fed84a7959f14f8b45a7/html5/thumbnails/9.jpg)
Complicate Pipeline
![Page 10: Genomics and BigData - case study](https://reader031.vdocuments.net/reader031/viewer/2022020122/54c1fed84a7959f14f8b45a7/html5/thumbnails/10.jpg)
Computing resources
Fusaro VA, Patil P, Gafni E, Wall DP, Tonellato PJ (2011) Biomedical Cloud Computing With Amazon Web Services. PLoS Comput Biol 7(8): e1002147
![Page 11: Genomics and BigData - case study](https://reader031.vdocuments.net/reader031/viewer/2022020122/54c1fed84a7959f14f8b45a7/html5/thumbnails/11.jpg)
Linux skill
Nature Biotech (2006)
![Page 12: Genomics and BigData - case study](https://reader031.vdocuments.net/reader031/viewer/2022020122/54c1fed84a7959f14f8b45a7/html5/thumbnails/12.jpg)
To the Clinic - License, HIPPA
![Page 13: Genomics and BigData - case study](https://reader031.vdocuments.net/reader031/viewer/2022020122/54c1fed84a7959f14f8b45a7/html5/thumbnails/13.jpg)
To the Clinic - License, HIPPA
![Page 14: Genomics and BigData - case study](https://reader031.vdocuments.net/reader031/viewer/2022020122/54c1fed84a7959f14f8b45a7/html5/thumbnails/14.jpg)
To the Clinic - License, HIPPA
![Page 15: Genomics and BigData - case study](https://reader031.vdocuments.net/reader031/viewer/2022020122/54c1fed84a7959f14f8b45a7/html5/thumbnails/15.jpg)
Cloud Computing and HPC (Hight Performance Computing)
![Page 16: Genomics and BigData - case study](https://reader031.vdocuments.net/reader031/viewer/2022020122/54c1fed84a7959f14f8b45a7/html5/thumbnails/16.jpg)
컴퓨터와 스마트폰 동기화, 사진/음악/동영상 감상, 대용량 파일 공유
우리가 흔히 접하고 생각하는 클라우드
![Page 17: Genomics and BigData - case study](https://reader031.vdocuments.net/reader031/viewer/2022020122/54c1fed84a7959f14f8b45a7/html5/thumbnails/17.jpg)
오늘 이야기할 클라우드는…
![Page 18: Genomics and BigData - case study](https://reader031.vdocuments.net/reader031/viewer/2022020122/54c1fed84a7959f14f8b45a7/html5/thumbnails/18.jpg)
오늘 이야기할 클라우드는…
Journal of Biomedical Informatics (2013)
![Page 19: Genomics and BigData - case study](https://reader031.vdocuments.net/reader031/viewer/2022020122/54c1fed84a7959f14f8b45a7/html5/thumbnails/19.jpg)
하드웨어의 가상화뿐만 아니라, 다양한 서비스로의 확대
Amazon Web Services의 서비스 목록
![Page 20: Genomics and BigData - case study](https://reader031.vdocuments.net/reader031/viewer/2022020122/54c1fed84a7959f14f8b45a7/html5/thumbnails/20.jpg)
하드웨어의 가상화뿐만 아니라, 다양한 서비스로의 확대
Google의 서비스 목록
![Page 21: Genomics and BigData - case study](https://reader031.vdocuments.net/reader031/viewer/2022020122/54c1fed84a7959f14f8b45a7/html5/thumbnails/21.jpg)
하드웨어의 가상화로 부터 얻는 것
원하는 OS를 선택이 가능하다맞춤 OS 제작이 가능하다
![Page 22: Genomics and BigData - case study](https://reader031.vdocuments.net/reader031/viewer/2022020122/54c1fed84a7959f14f8b45a7/html5/thumbnails/22.jpg)
재현성 (reproducibility) 확보 손쉬운 분석 환경 구축
나의 분석 환경 + 데이터 + 스크립트 “나의 분석 환경 + 데이터 + 스크립트” 바로 얼려서 다른 사람들과 공유
![Page 23: Genomics and BigData - case study](https://reader031.vdocuments.net/reader031/viewer/2022020122/54c1fed84a7959f14f8b45a7/html5/thumbnails/23.jpg)
하드웨어의 가상화로 부터 얻는 것
운영중인 서버에 디스크 추가디스크의 내용을 이미지화하여 공유
![Page 24: Genomics and BigData - case study](https://reader031.vdocuments.net/reader031/viewer/2022020122/54c1fed84a7959f14f8b45a7/html5/thumbnails/24.jpg)
하드웨어의 가상화로 부터 얻는 것
원하는 사양의 서버를원하는 만큼 생성이 가능하다
![Page 25: Genomics and BigData - case study](https://reader031.vdocuments.net/reader031/viewer/2022020122/54c1fed84a7959f14f8b45a7/html5/thumbnails/25.jpg)
하드웨어의 가상화로 부터 얻는 것
또 다른 스토리지 서비스Object Storage
http://whatis.techtarget.com/reference/Object-storage-Fast-Guide
![Page 26: Genomics and BigData - case study](https://reader031.vdocuments.net/reader031/viewer/2022020122/54c1fed84a7959f14f8b45a7/html5/thumbnails/26.jpg)
하드웨어의 가상화로 부터 얻는 것
또 다른 스토리지 서비스Object Storage
![Page 27: Genomics and BigData - case study](https://reader031.vdocuments.net/reader031/viewer/2022020122/54c1fed84a7959f14f8b45a7/html5/thumbnails/27.jpg)
이 모든것이 바로 Cloud Software -IaaS 구축 가능한 오픈소스 클라우드 플랫폼
OpeStack, Eucalyptus, OpenNebula, CloudStack
![Page 28: Genomics and BigData - case study](https://reader031.vdocuments.net/reader031/viewer/2022020122/54c1fed84a7959f14f8b45a7/html5/thumbnails/28.jpg)
OpenStack Architecture
![Page 29: Genomics and BigData - case study](https://reader031.vdocuments.net/reader031/viewer/2022020122/54c1fed84a7959f14f8b45a7/html5/thumbnails/29.jpg)
HPC Clustering
![Page 30: Genomics and BigData - case study](https://reader031.vdocuments.net/reader031/viewer/2022020122/54c1fed84a7959f14f8b45a7/html5/thumbnails/30.jpg)
Cloud in Genomics
![Page 31: Genomics and BigData - case study](https://reader031.vdocuments.net/reader031/viewer/2022020122/54c1fed84a7959f14f8b45a7/html5/thumbnails/31.jpg)
Cloud in Genomics Genomics의 구체적 cloud 사례들
서비스 구분 서비스명 서비스 내용 KT
IaaS
Amazon WebServicesPaaS, SaaS 서비스회사들을 적극 지원
public dataset을 통한 간접 지원 ucloud biz
Google Cloud 컴퓨팅과 스토리지 지원 ucloud biz
NeCTAR Research Cloud OpenStack 기반 연구자를 위한 Private Cloud ucloud biz
SaaS
DNANexus NGS 데이터 분석 파이프라인 제공 g-Analysis
SevenBridge Genomics NGS 데이터 분석 파이프라인 제공 g-Analysis
GotCloud NGS 데이터 분석 파이프라인 제공 g-Analysis
Globus Genomics Galaxy를 AWS 기반으로 제공 g-Galaxy
GenomeSpace Storage 기반의 bioinformatics 서비스 제공 g-Storage
PaaS
CloudMan 스토리지 기반의 Bioinformatics 툴 지원
SeqWare 유전체 분석 가능한 기반 플랫폼 제공
StarCluster AWS 기반의 HPC Cluster 컴퓨팅 환경 제공
CycleComputing AWS 기반의 HPC Cluster 컴퓨팅 환경 제공
Google Genomics, BigQeury NGS 데이터 분석을 위한 API 제공 g-Insight
![Page 32: Genomics and BigData - case study](https://reader031.vdocuments.net/reader031/viewer/2022020122/54c1fed84a7959f14f8b45a7/html5/thumbnails/32.jpg)
사례로 살펴보는 Big Data 분석에 필요한 HPC Cloud의 요소
![Page 33: Genomics and BigData - case study](https://reader031.vdocuments.net/reader031/viewer/2022020122/54c1fed84a7959f14f8b45a7/html5/thumbnails/33.jpg)
http://seqware.github.io/docs/
![Page 34: Genomics and BigData - case study](https://reader031.vdocuments.net/reader031/viewer/2022020122/54c1fed84a7959f14f8b45a7/html5/thumbnails/34.jpg)
HPC Cloud의 요소 - Sequencers
http://seqware.github.io/docs/
LIMS
Object Storage
High Speed
File Transfer
IaaS HPC
Private Cloud OpenStack…
Job Schedule Bioinformatics
Linux…
Bioinformatics
Bioinformatics Hadoop and
Database
![Page 35: Genomics and BigData - case study](https://reader031.vdocuments.net/reader031/viewer/2022020122/54c1fed84a7959f14f8b45a7/html5/thumbnails/35.jpg)
LIMS for NGS
Clarity LIMS from GenoLogics
Galaxy LIMS for NGS Bioinformatics (2013)
![Page 36: Genomics and BigData - case study](https://reader031.vdocuments.net/reader031/viewer/2022020122/54c1fed84a7959f14f8b45a7/html5/thumbnails/36.jpg)
LIMS for NGS
Galaxy LIMS https://bitbucket.org/jelle/galaxy-central-tron-lims/
![Page 37: Genomics and BigData - case study](https://reader031.vdocuments.net/reader031/viewer/2022020122/54c1fed84a7959f14f8b45a7/html5/thumbnails/37.jpg)
Data Upload - online
![Page 38: Genomics and BigData - case study](https://reader031.vdocuments.net/reader031/viewer/2022020122/54c1fed84a7959f14f8b45a7/html5/thumbnails/38.jpg)
Data Upload - offline Import/Export
![Page 39: Genomics and BigData - case study](https://reader031.vdocuments.net/reader031/viewer/2022020122/54c1fed84a7959f14f8b45a7/html5/thumbnails/39.jpg)
HPC Cloud의 요소 - Storage
http://seqware.github.io/docs/
LIMS
Object Storage
High Speed
File Transfer
IaaS HPC
Private Cloud OpenStack…
Job Schedule Bioinformatics
Linux…
Bioinformatics
Bioinformatics Hadoop and
Database
![Page 40: Genomics and BigData - case study](https://reader031.vdocuments.net/reader031/viewer/2022020122/54c1fed84a7959f14f8b45a7/html5/thumbnails/40.jpg)
http://www.genomespace.org/
GenomeSpace is a cloud-based interoperability framework to support integrative genomics analysis through an easy-to-use Web interface.
![Page 41: Genomics and BigData - case study](https://reader031.vdocuments.net/reader031/viewer/2022020122/54c1fed84a7959f14f8b45a7/html5/thumbnails/41.jpg)
HPC Cloud의 요소 - HPC
http://seqware.github.io/docs/
LIMS
Object Storage
High Speed
File Transfer
IaaS HPC
Private Cloud OpenStack…
Job Schedule Bioinformatics
Linux…
Bioinformatics
Bioinformatics Hadoop and
Database
![Page 42: Genomics and BigData - case study](https://reader031.vdocuments.net/reader031/viewer/2022020122/54c1fed84a7959f14f8b45a7/html5/thumbnails/42.jpg)
Before Cloud Computing
구매 요청 -> 견적서…->서버실 확보>하드웨어설치-> 소프트웨어 설정 -> 최종 테스트 ->구형=하드웨어I=구형=데이터
![Page 43: Genomics and BigData - case study](https://reader031.vdocuments.net/reader031/viewer/2022020122/54c1fed84a7959f14f8b45a7/html5/thumbnails/43.jpg)
NeCTAR
![Page 44: Genomics and BigData - case study](https://reader031.vdocuments.net/reader031/viewer/2022020122/54c1fed84a7959f14f8b45a7/html5/thumbnails/44.jpg)
StarCluster and CloudMan
http://star.mit.edu/cluster/ https://wiki.galaxyproject.org/CloudMan/AWS/GettingStarted
![Page 45: Genomics and BigData - case study](https://reader031.vdocuments.net/reader031/viewer/2022020122/54c1fed84a7959f14f8b45a7/html5/thumbnails/45.jpg)
HPC Cloud의 요소 - Pipeline
http://seqware.github.io/docs/
LIMS
Object Storage
High Speed
File Transfer
IaaS HPC
Private Cloud OpenStack…
Job Schedule Bioinformatics
Linux…
Bioinformatics
Bioinformatics Hadoop and
Database
![Page 46: Genomics and BigData - case study](https://reader031.vdocuments.net/reader031/viewer/2022020122/54c1fed84a7959f14f8b45a7/html5/thumbnails/46.jpg)
DNANexus and HGSC
![Page 47: Genomics and BigData - case study](https://reader031.vdocuments.net/reader031/viewer/2022020122/54c1fed84a7959f14f8b45a7/html5/thumbnails/47.jpg)
DNANexus and HGSC
![Page 48: Genomics and BigData - case study](https://reader031.vdocuments.net/reader031/viewer/2022020122/54c1fed84a7959f14f8b45a7/html5/thumbnails/48.jpg)
DNANexus and HGSC
![Page 49: Genomics and BigData - case study](https://reader031.vdocuments.net/reader031/viewer/2022020122/54c1fed84a7959f14f8b45a7/html5/thumbnails/49.jpg)
DNANexus and HGSC
![Page 50: Genomics and BigData - case study](https://reader031.vdocuments.net/reader031/viewer/2022020122/54c1fed84a7959f14f8b45a7/html5/thumbnails/50.jpg)
HPC Cloud의 요소 - Web Service
http://seqware.github.io/docs/
LIMS
Object Storage
High Speed
File Transfer
IaaS HPC
Private Cloud OpenStack…
Job Schedule Bioinformatics
Linux…
Bioinformatics
Bioinformatics Hadoop and
Database
![Page 52: Genomics and BigData - case study](https://reader031.vdocuments.net/reader031/viewer/2022020122/54c1fed84a7959f14f8b45a7/html5/thumbnails/52.jpg)
Google Genomics API
http://googleresearch.blogspot.co.uk/2014/02/google-joins-global-alliance-for.html
Interoperability: One API, Many Apps
![Page 53: Genomics and BigData - case study](https://reader031.vdocuments.net/reader031/viewer/2022020122/54c1fed84a7959f14f8b45a7/html5/thumbnails/53.jpg)
Google Genomics Examples
API를 이용한 특정 샘플의 특정 read 정보 추출
![Page 54: Genomics and BigData - case study](https://reader031.vdocuments.net/reader031/viewer/2022020122/54c1fed84a7959f14f8b45a7/html5/thumbnails/54.jpg)
Google Genomics Examples
API와 JavaScript를 이용한 Genome Browser
![Page 55: Genomics and BigData - case study](https://reader031.vdocuments.net/reader031/viewer/2022020122/54c1fed84a7959f14f8b45a7/html5/thumbnails/55.jpg)
HPC Cloud의 요소 - Query Engine
http://seqware.github.io/docs/
LIMS
Object Storage
High Speed
File Transfer
IaaS HPC
Private Cloud OpenStack…
Job Schedule Bioinformatics
Linux…
Bioinformatics
Bioinformatics Hadoop and
Database
![Page 56: Genomics and BigData - case study](https://reader031.vdocuments.net/reader031/viewer/2022020122/54c1fed84a7959f14f8b45a7/html5/thumbnails/56.jpg)
Google BigQuery
![Page 57: Genomics and BigData - case study](https://reader031.vdocuments.net/reader031/viewer/2022020122/54c1fed84a7959f14f8b45a7/html5/thumbnails/57.jpg)
# Compute the Ti/Tv ratio for BRCA1. SELECT transitions, transversions, transitions/transversions AS titv FROM ( SELECT SUM(IF(mutation IN ('A->G', 'G->A', 'C->T', 'T->C'), INTEGER(num_snps), INTEGER(0))) AS transitions, SUM(IF(mutation IN ('A->C', 'C->A', 'G->T', 'T->G', 'A->T', 'T->A', 'C->G', 'G->C'), INTEGER(num_snps), INTEGER(0))) AS transversions, FROM ( SELECT CONCAT(reference_bases, CONCAT(STRING('->'), alternate_bases)) AS mutation, COUNT(alternate_bases) AS num_snps, FROM [google.com:biggene:1000genomes.variants1kG] WHERE contig = '17' AND position BETWEEN 41196312 AND 41277500 AND vt = 'SNP' GROUP BY mutation ORDER BY mutation));
Google BigQuery with plot
result <- query_exec(project = "google.com:biggene", dataset = "1000genomes",
query = sql, billing = billing_project)
Ti/Tv ratio in BRCA1
![Page 58: Genomics and BigData - case study](https://reader031.vdocuments.net/reader031/viewer/2022020122/54c1fed84a7959f14f8b45a7/html5/thumbnails/58.jpg)
# Count the variation for each sample including phenotypic traits SELECT samples.genotype.sample_id AS sample_id, gender, population, super_population, COUNT(samples.genotype.sample_id) AS num_variants_for_sample, SUM(IF(samples.af >= 0.05, INTEGER(1), INTEGER(0))) AS common_variant, SUM(IF(samples.af < 0.05 AND samples.af > 0.005, INTEGER(1), INTEGER(0))) AS middle_variant, SUM(IF(samples.af <= 0.005 AND samples.af > 0.001, INTEGER(1), INTEGER(0))) AS rare_variant, SUM(IF(samples.af <= 0.001, INTEGER(1), INTEGER(0))) AS very_rare_variant, FROM FLATTEN([google.com:biggene:1000genomes.variants1kG], genotype) AS samples JOIN [google.com:biggene:1000genomes.sample_info] p ON samples.genotype.sample_id = p.sample WHERE samples.vt = 'SNP' AND (samples.genotype.first_allele > 0 OR samples.genotype.second_allele > 0) GROUP BY sample_id, gender, population, super_population ORDER BY sample_id;
Google BigQuery with R
ggplot(result, aes(x = population, y = common_variant, fill = super_population)) +
geom_boxplot() + ylab("Count of common variants per sample") + ggtitle("Common Variants (Minimum Allelic Frequency 5%)")
Variant type
![Page 59: Genomics and BigData - case study](https://reader031.vdocuments.net/reader031/viewer/2022020122/54c1fed84a7959f14f8b45a7/html5/thumbnails/59.jpg)
Google BigQuery with RStudio
Markdown/Knit HTML
![Page 60: Genomics and BigData - case study](https://reader031.vdocuments.net/reader031/viewer/2022020122/54c1fed84a7959f14f8b45a7/html5/thumbnails/60.jpg)
Publish your R code with git and RPubs
https://github.com/ http://www.rpubs.com/
![Page 61: Genomics and BigData - case study](https://reader031.vdocuments.net/reader031/viewer/2022020122/54c1fed84a7959f14f8b45a7/html5/thumbnails/61.jpg)
Data storage: $0.026 (per GB/month) Data query: $0.005 / GB
1000 genomes data Data storage: $0.026 (per GB/month) * 5,500 GB = $143 / month = 15,700/월 Data query (allele frequency query): $0.005 / GB * 647 GB = $3.3 = 3,600원
Cost
30초안에 1000genomes의 모든 allele frequency를 구하다.
![Page 62: Genomics and BigData - case study](https://reader031.vdocuments.net/reader031/viewer/2022020122/54c1fed84a7959f14f8b45a7/html5/thumbnails/62.jpg)
Conclusion
적어도 지금 이순간까지
종합선물셋트는 없다.
맛없는 과자도 들어있다.
자신의 현재 상황에 맞는 과자를 담아라.