big data repository for structural biology: challenges and opportunities by piotr sliz
TRANSCRIPT
Big Data Repository for Structural Biology:
Challenges and Opportunities
Piotr Sliz, PhD [email protected]
!SBGrid: http://sbgrid.org
SBGrid Data Bank: http://data.sbgrid.org Twitter: @SBGrid
YouTube: SBGridTV
SBGrid Consortium
Support Center at Harvard Medical School 300 Research Groups 13 Countries Long Term Sustainability: Membership Fee
Harvard Medical!School
SBGrid supports compilation, installation and upgrades of ~300 scientific applications
Several Software Categories (EM, NMR, Xrays, Comp Chem, etc.) Multiple versions of most applications OS X (10.6-10.10) and Linux support (CentOS 5-7) No additional, end-user configuration required
Software always works = more time for research
Core Mission:
Grid Computing (Open Science Grid VO + Grid Portal) General Research Infrastructure (Boston Area) Training (workshops, software cataloguing, webtales)
Webinars at youtube.com/SBGridTV Developer Resources Advocating for Open Source Software
Morin et al. Shining Light into Black Boxes. Science, 2012.
Other Activities:
Additional!Publications
Primary Citation:
Other Citations:
New Opportunity: Data
anonymous SBGrid member 1: “we cannot find the original frames for many of our structures (move from X to Y), including recent high impact projects. What do you recommend that we do?”
anonymous SBGrid member 2: “I was able to locate the data directory but I must have done a good job cleaning up the disk space before I left: usually there are only two .img files left in the data directory, the 1st and the last image of a full run.”
Lack of Storage Support for Diffraction Images
derive reproduce improve correct
• Stokes-Rees, I., Levesque, I., Murphy, F.V., Yang, W., Deacon, A., and Sliz, P. (2012). Adapting federated cyberinfrastructure for shared data collection facilities in structural biology. J Synchrotron Radiat 19, 462–467.
• Terwilliger, T.C., and Bricogne, G. (2014). Continuous mutual improvement of macromolecular structure models in the PDB and of X-ray crystallographic software: the dual role of deposited experimental data. Acta Crystallogr. D Biol. Crystallogr. 70, 2533–2543.
• Terwilliger, T.C. (2014). Archiving raw crystallographic data. Acta Crystallogr D Biol Crystallogr. • Guss, J.M., and McMahon (2014). How to make deposition of images a reality. Acta Crystallogr. D Biol. Crystallogr. 70,
2520–2532
Focus on Primary Data
SBGrid Data Bank. Pilot: May 1st, Production: June 1st, 2015
EZID
Dataset Lock BIODBCORE-‐000683
re3data.org
Data Mining and
Annotation
Web Interface
Related!Datasets
Depositors:
URL: data.sbgrid.org
Dataset Landing Page
DataCite!Schema CC0 License
DownloadDataset URL
Data Access Alliance:
Make Data easily accessible for reprocessing Minimize Project Cost Increase Redundancy
ChallengesDataset Size (APIs, Data Access Alliance) Journal + Data Automation
automated embargo release cross-referencing coordination/communication with journals Data vs Journal Citations
Metrics: Dataset Deposition Rates Data Use: DAA Membership vs. direct downloads Dataset Quality (Level 0-2) Data Citations
Master Format OME-TIFF vs DataCite vs DataVerse schema
Transition to a Research Data Management Software ORCID integration and adoption
Opportunities
Better support to ~300 structural biology laboratories: Compliance Reproducibility Integration with PDB and other repositories Other data types in addition to X-ray diffraction
Thank you
Piotr Sliz, PhD [email protected]
!SBGrid: http://sbgrid.org
SBGrid Data Bank: http://data.sbgrid.org !
Twitter: @SBGrid YouTube: SBGridTV
Stephanie Socias
Pete Meyer
Merce Crosas