clouds: all fluff and no substance?

Download Clouds: All fluff and no substance?

If you can't read please download the document

Upload: guy-coates

Post on 01-Jul-2015

841 views

Category:

Technology


2 download

DESCRIPTION

Keynote given at BOSC, 2010. Does the hype surrounding cloud match the reality? Can we use them to solve the problems in provisioning IT services to support next-generation sequencing?

TRANSCRIPT

  • 1. Clouds: All fluff and no substance?
    • Guy Coates
    2. Wellcome Trust Sanger Institute 3. [email_address]

4. Outline

  • Background 5. Cloud: Where are we at? 6. Good Fit: Web services 7. Bad Fit: HPTC compute 8. Better fit...? 9. Data management 10. Collaboration 11. Grids

12. The Sanger Institute

  • Funded by Wellcome Trust.
    • 2 ndlargest research charity in the world. 13. ~700 employees. 14. Based in Hinxton Genome Campus, Cambridge, UK.
  • Large scale genomic research.
    • Sequenced 1/3 of the human genome. (largest single contributor). 15. We have active cancer, malaria, pathogen and genomic variation / human health studies.
  • All data is made publicly available.
    • Websites, ftp, direct database. access, programmatic APIs.

16. DNA sequencing 17. Economic Trends:

  • As cost of sequencing halves every 12 months.
    • cfMoore's Law
  • The Human genome project:
    • 13 years. 18. 23 labs. 19. $500 Million.
  • A Human genome today:
    • 3 days. 20. 1 machine. 21. $10,000. 22. Large centres are now doing studies with 1000s and 10,000s of genomes.
  • Changes in sequencing technology are going to continue this trend.
    • Next-next generation sequencers are on their way. 23. $500 genome is probablewithin 5 years.

24. The scary graph Instrument upgrades Peak Yearly capillary sequencing 25. Managing Growth

  • We have exponential growth in storage and compute.
    • Storage /compute doubles every 12 months.
      • 2009 ~7 PB raw
  • Gigabase of sequenceGigbyte of storage.
    • 16 bytes per base for for sequence data. 26. Intermediate analysis typically need 10x disk space of the raw data.
  • Moore's law will not save us.
    • Transistor/disk density:T d =18 months 27. Sequencing cost: T d =12 months

28. Cloud: Where are we at? 29. What is cloud?

  • Informatician's view:
    • On demand, virtual machines.
    • Root access, total ownership.
    • Pay-as-you-go model.
  • Upper management view:
    • Free compute we can use to solve all of the hard problems thrown up by new sequencing.
      • (8cents/hour is almost free, right...?)
    • Twatter/friendface use it, so it must be good.

30. Hype Cycle Awesome! Just works... 31. Lost in the clouds... 32. Victory! 33. Where are we? ? ? ? 34. Where are we?

  • We currently have three areas of activity:
    • Web presence
    • HPTC workload
    • Active Data Warehousing

35. Ensembl

  • Ensembl is a system for genome Annotation. 36. Data visualisation (Web Presence)
    • www.ensembl.org 37. Provides web / programmatic interfaces to genomic data. 38. 10k visitors / 126k page views per day.
  • Compute Pipeline (HPTC Workload)
    • Take a raw genome and run it through a compute pipeline to find genes and other features of interest. 39. Ensembl at Sanger/EBI provides automated analysis for 51 vertebrate genomes.
    • Software is Open Source (apache license). 40. Data is free for download.
  • We have done cloud experiments with both the web site and pipeline.

41. Ensembl Website 42. 43. Web Presence

  • Ensembl has a worldwide audience. 44. Historically, web site performance was not great.
    • Pages were quite heavyweight. 45. Not properly cached etc.
  • Web team spent along time re-designing the code to make it more streamlined.
    • Greatly improved performance.
  • Coding can only get you so-far.
    • If we want the website to be responsive, we need low latency. 46. A canna' change the laws of physics.
      • 240ms round trip time.
    • We need a set of geographically dispersed mirrors.

47. uswest.ensembl.org

  • Traditional mirror: Real machines in aco-lo facility in California. 48. Hardware was initially configured on site.
    • 16 servers, SAN storage, SAN switches, SAN management appliance, Ethernet switches, firewall, out-of-band management etc etc.
  • Shipped to the co-lo for installation.
    • Sent a person to California for 3 weeks. 49. Spent 1 week getting stuff into/out of customs.
      • ****ing FCC paperwork!
  • Additional infrastructure work.
    • VPN between UK and US.
  • Incredibly time consuming.
    • Really don't want to end up having to send someone on a plane to the US to fix things.

50. Usage

  • Geo-IP database to point people to the nearest mirror: 51. US-West currently takes ~1/3 rd of total Ensembl web traffic.
    • Latency down from XXXMs to XXms.

52. Usage 53. What has this got to do with clouds? 54. useast.ensembl.org

  • We want an east coast US mirror to complement our west coast mirror. 55. Built the mirror in AWS.
    • Initially a proof of concept /test-bed for virtual co-location. 56. Plan for production real soon now.

57. Building a mirror on AWS

  • No physical hardware.
    • Work can start as soon as we enter our credit card numbers...
  • Some software development / sysadmin work needed.
    • Preparation of OS images,software stack configuration. 58. West-coast was built as an extension of Sanger internal network via VPN. 59. AWS images built as standalone systems.
  • Significant amount of tuning required.
    • Initial mysql performance was pretty bad, especially for the large ensembl databases. (~1TB). 60. Lots of people doing Apache/mysql on AWS, so there is a good amount of best-practice etc available.

61. Does it work? 62. Is it cost effective?

  • Lots of misleading cost statements made about cloud.
    • Our analysis only cost $500. 63. $0.085 / hr.
  • What are we comparing against?
    • Doing the analysis once? Continually? 64. Buying a $2000 server? 65. Leasing a $2000 server for 3 years? 66. Using $150 of time at your local supercomputing facility? 67. Buying a $2000 of server but having to build a $1Mdatacentre to put it in?
  • Requires the dreaded Total Cost of Ownership (TCO) calculation.
    • hardware+ power + cooling + facilities + admin/developers etc
      • Incredibly hard to do.

68. Lets do it anyway...

  • Comparing costs to the co-lo is simpler.
    • power, cooling costs are all included. 69. Admin costs are the same, so we can ignore them.
      • Same people responsible for both.
  • Cost for Co-location facility:
    • $120,000 hardware + $51,000 /yrcolo. 70. $91,000 per year (3 years hardware lifetime).
  • Cost for AWS :
    • $77,000 per year.
  • Result: Estimated 16% cost saving.
    • Good saving. 71. It is not free!

72. Additional Benefits

  • No need to deal with real hardware.
    • Faster implementation. 73. No need to ship server or deal with US customs.
  • Free hardware upgrades.
    • As faster machines become available we can take advantage of them immediately. 74. No need to get tin decommissioned /re-installed at Co-lo.
  • Website + code is packaged together.
    • Can be conveniently given away to end users in a ready-to-run config. 75. Simplifies configuration for other users wanting to run Ensembl sites. 76. Configuring an ensembl site is non-trivial for non-informaticians.
      • Cvs, mysql setup, apache configuration etc.

77. Added benefits 78. Downsides

  • Packaging OS images and codes did take longer than expected.
    • Most of the web-code refactoring to make it mirror ready had been done for the initial real colo.
  • This needs to be re-done every ensembl release.
    • Now part of the ensembl software release process.
  • Management overhead does not necessarily go down.
    • But it does change.

79. Going forward

  • Expect mirror to go live later this year.
    • Far-east Amazon availability zone is also of interest.
      • No timeframe so far.
  • Virtual Co-location concept will be useful for a number of other projects.
    • Other Sanger websites?
  • Disaster recovery.
    • Eg replicate critical databases / storage into AWS.

80. Hype Cycle Web services 81. Ensembl Pipeline

  • HPTC element of Ensembl.
    • Takes raw genomes and lays annotation on top.

82. Compute Pipeline TCCTCTCTTTATTTTAGCTGGACCAGACCAATTTTGAGGAAAGGATACAGACAGCGCCTG GAATTGTCAGACATATACCAAATCCCTTCTGTTGATTCTGCTGACAATCTATCTGAAAAA TTGGAAAGGTATGTTCATGTACATTGTTTAGTTGAAGAGAGAAATTCATATTATTAATTA TTTAGAGAAGAGAAAGCAAACATATTATAAGTTTAATTCTTATATTTAAAAATAGGAGCC AAGTATGGTGGCTAATGCCTGTAATCCCAACTATTTGGGAGGCCAAGATGAGAGGATTGC TTGAGACCAGGAGTTTGATACCAGCCTGGGCAACATAGCAAGATGTTATCTCTACACAAA ATAAAAAAGTTAGCTGGGAATGGTAGTGCATGCTTGTATTCCCAGCTACTCAGGAGGCTG AAGCAGGAGGGTTACTTGAGCCCAGGAGTTTGAGGTTGCAGTGAGCTATGATTGTGCCAC TGCACTCCAGCTTGGGTGACACAGCAAAACCCTCTCTCTCTAAAAAAAAAAAAAAAAAGG AACATCTCATTTTCACACTGAAATGTTGACTGAAATCATTAAACAATAAAATCATAAAAG AAAAATAATCAGTTTCCTAAGAAATGATTTTTTTTCCTGAAAAATACACATTTGGTTTCA GAGAATTTGTCTTATTAGAGACCATGAGATGGATTTTGTGAAAACTAAAGTAACACCATT ATGAAGTAAATCGTGTATATTTGCTTTCAAAACCTTTATATTTGAATACAAATGTACTCC 83. Raw Sequence -> Something useful 84. Example annotation 85. Gene Finding DNA HMM PredictionAlignment with knownproteinsAlignment with fragmentsrecoveredin vivo Alignment with other genes and other species 86. Compute Pipeline

  • Architecture:
    • OO perl pipeline manager. 87. Core algorithms are C. 88. 200 auxiliary binaries.
  • Workflow:
    • Investigator describes analysis at high level. 89. Pipeline manager splits the analysis into parallel chunks.
      • Typically 50k-100k jobs.
    • Sorts out the dependences and then submits jobs to a DRM.
      • Typically LSF or SGE.
    • Pipeline state and results are stored in a mysql database.
  • Workflow is embarrassingly parallel.
    • Integer, not floating point. 90. 64 bit memory address is nice, but not required.
      • 64 bit file accessisrequired.
    • Single threaded jobs. 91. Very IO intensive.

92. Running the pipeline in practice

  • Requires a significant amount ofdomain knowledge. 93. Software install is complicated.
    • Lots of perl modules and dependencies. 94. Apache wranging if you want to run a website.
  • Need a well tuned compute cluster.
    • Pipeline takes ~500 CPU days for a moderate genome.
      • Ensembl chewed up 160k CPU days last year.
    • Code is IO bound in a number of places. 95. Typically need a high performance filesystem.
      • Lustre, GPFS, Isilon, Ibrix etc.
    • Need large mysql database.
      • 100GB-TB mysql instances, very high query load generated from the cluster.

96. Why Cloud?

  • Provides a good example for testing HPTC capabilities of the cloud.

97. Why Cloud?

  • Proof of concept
    • Is HTPC is even possible in Cloud infrastructures?
  • Coping with the big increase in data
    • Will we be able to provision new machines/datacentre space to keep up? 98. What happens if we need to out-source our compute? 99. Can we be in a position to shift peaks of demand to cloud facilities?

100. Expanding markets

  • There are going to be lots of new genomes that need annotating.
    • Sequencers moving into small labs, clinical settings. 101. Limited informatics / systems experience.
      • Typically postdocs/PhD who have a real job to do.
    • They may want to run thegenebuild pipeline on their data, but they may not have the expertise to do so.
  • We have already done all the hard work on installing the software and tuning it.
    • Can we package up the pipeline, put it in the cloud?
  • Goal: End user should simply be able to upload their data, insert theircredit-card number, and pressGO .

102. Porting HPTC code to the cloud

  • Software stack / machine image.
    • Creating images with software is reasonably straightforward. 103. No big surprises
  • Queuing system
    • Pipeline requires a queueing system: (LSF/SGE) 104. Getting them to run took a lot of fiddling. 105. Machines need to find each other one they are inside the cloud. 106. Building an automated self discovering cluster takes
      • Hopefully others can re-use it.
  • Mysql databases
    • Lots of best practice on how to do that on EC2.
  • But it took time, even for experienced systems people.
    • (You will not be firing your system-administrators just yet!).

107. The big problem...

  • Data: 108. Moving data into the cloud is hard 109. Doing stuff with data once it is in the cloud is also hard 110. If you look closely, most successful cloud projects have small amounts of data (10-100 Mbytes).

111. Moving data is hard

  • Tools:
    • Commonly used tools (FTP,ssh/rsync) are not suited to wide-area networks. 112. WAN tools:gridFTP/FDT/Aspera.
  • Data transfer rates (gridFTP/FDT via our 2 Gbit/s site link).
    • Cambridge -> EC2 East coast: 12 Mbytes/s (96 Mbits/s) 113. Cambridge -> EC2 Dublin: 25 Mbytes/s(200 Mbits/s) 114. 11 hours to move 1TB to Dublin. 115. 23 hours to move 1 TB to East coast.
  • What speedshouldwe get?
    • Once we leave JANET (UK academic network) finding out what the connectivity is and what we should expect is almost impossible.
  • Are our disks fast enough?
    • Do you have fast enough disks at each end to keep the network full?

116. Networking

  • How do we improve data transfers across the public internet?
    • CERN approach; don't. 117. Dedicated networking has been put in between CERN and the T1 centres who get all of the CERN data.
  • Our collaborations are different.
    • We have relatively short lived and fluid collaborations. (1-2 years, many institutions). 118. As more labs get sequencers, our potential collaborators also increase. 119. We need good connectivity to everywhere.

120. Moving data in the cloud

  • Compute nodes need to be able to see the data. 121. No viable global filesystems on EC2.
    • NFS has poor scaling at the best of times. 122. EC2 has poor inter-node networking. >8 NFS clients, everything stops.
  • The cloud way: store data in S3.
    • Web based object store.
      • Get, put, delete objects.
    • Not POSIX.
      • Code needs re-writing / forking.
    • Limitations; cannot store objects > 5GB.
  • Nasty-hacks:
    • Subcloud; commercial product that allows you to run a POSIX filesystem on top of S3.
      • Interesting performance, and you are paying by the hour...

123. Compute architecture VS CPU CPU CPU Fat Network Posix Global filesystem CPU CPU CPU CPU thin network Local storage Local storage Local storage Local storage Batch schedular hadoop/S3 Data-store Data-store 124. Elephant in the room 125. Why not use map-reduce?

  • Re-writing apps to use S3 or hadoop/HDFS is a real hurdle.
    • Nobody want to re-write existing applications.
      • They already work on our compute farm.
    • Not an issue for new apps. 126. But hadoop apps do not exist in isolation. 127. Barrier for entry seems much lower for file-systems.
      • We have a lot of non-expert users (this is a good thing).
  • Am I being a reactionary old fart?
    • 15 years ago clusters of PCs were not real supercomputers. 128. ...then beowulf took over the world.
  • Big difference: porting applications between the two architectures was easy.
    • MPI/PVM etc.
  • Will the market provide traditional compute clusters in the cloud?

129. Hype cycle HPTC 130. Where are we?

  • You cannot take an existing data-rich HPTC app and expect it to work.
    • IO architectures are too different.
  • There is some re-factoring going on for the ensembl pipeline.
    • Currently on a case-by-case basis. 131. For the less-data intensive parts.

132. Shared data archives 133. Past Collaborations Data Sequencing Centre + DCC Sequencing centre Sequencing centre Sequencing centre Sequencing centre 134. Future Collaborations Collaborations are short term: 18 months-3 years. Sequencing Centre 3 Sequencing Centre 1 Sequencing Centre 2A Sequencing Centre 2B Federated access 135. International Cancer Genome Project

  • Many cancer mutations are rare.
    • Low signal-to-noise ratio.
  • How do we find the rare but important mutations?
    • Sequence lots of cancer genomes.
  • International Cancer Genome Project.
    • Consortia of sequencing and cancer research centres in 10 countries.
  • Aim of the consortia.
    • Complete genomic analysis of 50 different tumor types. (50,000 genomes).

136. Genomics Data Unstructured data (flat files) Data size per Genome Structured data (databases) Clinical Researchers, non-infomaticians Sequencing informaticsspecialists Intensities / raw data (2TB) Alignments (200 GB) Sequence + quality data (500 GB) Variation data (1GB) Individualfeatures(3MB) 137. Sharing Unstructured data

  • Large data volumes, flat files. 138. Federated access.
    • Data is not going to be in once place. 139. Single institute will have data distributed for DR / worldwide access.
      • Some parts of the data may be on cloud stores.
  • Controlled access.
    • Many archives will be public. 140. Some will have patient identifiable data. 141. Plan for it now.

142. Dark Archives

  • Storing data in an archive is not particularly useful.
    • You need to be able to access the data and do something useful with it.
  • Data in current archives is dark.
    • You can put/get data, but cannot compute across it. 143. Is data in an inaccessible archive really useful?

144. Last week's bombshell

  • We want to run out pipeline across 100TB of data currently in EGA/SRA. 145. We will need to de-stage the data to Sanger, and then run the compute.
    • Extra 0.5 PB of storage, 1000 cores of compute. 146. 3 month lead time. 147. ~$1.5M capex.

148. Cloud / Computable archives

  • Can we move the compute to the data?
    • Upload workload onto VMs. 149. Put VMs on compute that is attached to the data.

Data CPU CPU CPU CPU Data CPU CPU CPU CPU VM 150. Practical Hurdles 151. Where does it live?

  • Most of us are funded to hold data, not to fund everyone else's compute costs to.
    • Now need to budget for raw compute power as well as disk. 152. Implement virtualisation infrastructure, billing etc.
      • Are you legally allowed to charge? 153. Who underwrites it if nobody actually uses your service?
  • Strongly implies data has to be held on a commercial provider.
    • Amazon etc already have billing infrastructures; why not use it. 154. Directly exposed to costs.
      • Is the service cost effective?

155. Identity management

  • Which identity management system to use for controlled access? 156. Culture shock. 157. Lots of solutions:
      • openID, shibboleth(aspis), globus/x509 etc.
  • What features are important?
    • How much security? 158. Single sign on? 159. Delegated authentication?
  • Finding consensus will be hard.

160. Networking:

  • We still need to get data in.
    • Fixing the internet is not going to be cost effective for us.
  • Fixing the internet may be cost effective for big cloud providers.
    • Core to their business model. 161. All we need to do is get data into Amazon, and then everyone else can get the data from there.
  • Do we investin afast links to Amazon?
    • It changes the business dynamic. 162. We have effectively tied ourselves to a single provider.

163. Summary 164. Acknowledgements

  • Phil Butcher 165. ISG Team
    • James Beal 166. Gen-Tao Chiang 167. Pete Clapham 168. Simon Kelley
  • 1k Genomes Project
    • Thomas Keane 169. Jim Stalker
  • Cancer Genome Project
    • Adam Butler 170. John Teague

171. Backup