large scale data management in chipster workflow environment

1. Large scale data management in Chipster workflow environmentAleksi Kallio, CSC IT Center for ScienceBOSC 2012, July 13th, Long Beach

2. Background Chipster is an environment for biological data analysis Developed since 2005, under GPL license Aimed at users without awesome computer skills Originally for microarray data moderate data sizes Extended for NGS data 3. FocusHow architecture built on usability wasremodelled to handle NGS scale datasets without sacrificing usability (too much) 4. What you can do with Chipster 2.0 NGS Microarrays ChIP-seq gene expression RNA-seq miRNA expression miRNA-seq protein expression MeDIP-seq aCGH CNA-seq SNP DNA-seq integration of different data types 5. ArchitectureMessage passing, thick client 6. Design principles of Chipster 1.x Client stores primary copies of data files Data files are immutable at server side Data handling at server side is invisible to the user Client orchestrates work 7. Design principles of Chipster 1.x Client stores primary copies of data files Data files are immutable at server side Data handling at server side is invisible to the user Client orchestrates work 8. New design principles There are no primary copies of data Data files are immutable at server side Data handling at server side is mostly invisible Client orchestrates work 9. In practice? Examples on how new principles were implemented 10. Data is moved only when needed Client keeps track of data with so-called sessions Session = a bag of annotated links to copies of data files If computing nodes and file brokers share a filesystem, they use it to access data directly Data is moved when user expects it and does not interfere with users flow In some cases not possible 11. Data is moved only when needed 12. Data is moved only when needed 13. Server quota handling with sessions Session is the basic unit of quota management Why?Users tend to keep their sessionsclutter freeFiles inside the sessions havesimilar lifespans 14. Server quota handling with sessions 15. Administering server disk space Management web console allows administrators to keep track of jobs and disk space usage Disk space usage shown per user and session 16. Administering server disk space 17. Other large data functionalities Shared directory mappingFor e.g. making institution wide networkshares directly visible in ChipsterPresented to user as (almost) like alocal filesystemMapping rules to allow e.g. matchingusernames to home directories anduser groups to group directories 18. Other large data functionalities Continuous session backupsServer OR client crash will not lose data Genomic annotation managementFor genome browserClient suggests making partial copy forbetter browsing experienceYou can also make complete copy,allowing offline usage 19. Release schedule 2.0 was released Jan 2012Annotation management, local datasource transparency, virtual machine distribution, continuous backups 3.0 in autumn 2012Remote sessions, remotetransparency, management console 3.1 in late 2012Long running jobs, intelligentscheduling 20. AcknowledgementsCSC IT Center for Science:Taavi HupponenPetri KlemelMassimiliano GentileKimmo MattilaAri-Matti SarenEija KorpelainenFinnish Red Cross Blood Service:Jarno TuimalaVU University Amsterdam:Ilari Scheinin 21. Thank you! Meet us at poster 5 in BOSC or posters J40 and U64 in ISMB Tech track talk TT37 on Tuesday 15:30 Browse to http://chipster.csc.fi and http://chipster.sourceforge.net Contact us at [email protected]

large scale data management in chipster workflow environment

Technology