ut research data repository chris jordan ut research cyberinfrastructure storage committee chair
TRANSCRIPT
UT Research Data Repository
Chris Jordan
UT Research Cyberinfrastructure
Storage Committee Chair
Outline
• UTRC Introduction/Current Status• Research Data Requirements• Current TACC storage infrastructure (Corral)• New UTRC capabilities• External services and partnerships• Research and UTRC future
UT Research Cyberinfrastructure
• Collaborative effort initiated by Dr. Ken Shine, Vice Chancellor for Health
• Jay Boisseau (TACC), Brian Herman (UTHSCSA) co-chairs
• Assessment of research CI needs across system campuses
• Data Storage emerged as highest priority/biggest unmet need
UTRC Proposal
• Approved by UT Regents November 2010• Expanded Lonestar 4 for HPC needs• Establish dedicated 10gb research network to
all campuses• Develop replicated, 5PB Research Data
Repository
Storage Committee Activities
• Proposed iterative approach with pilot deployment in late 2011
• 1st half of 2011 spent on requirements and architecture development
• Released RFP in June• Vendor selected in August• Installation in October• Initial users ~December
Sidebar: Why “The Cloud” is not the answer
• Cloud storage costs = $1000s/TB/year• Often not as reliable as advertised (Google,
Amazon have both had major issues)• Restrictive interfaces, lack of high-
performance access• Issues with institutional control, security
integration, etc
Pilot UTRDR Deployment
• 5PB Raw storage in each of two installations• Main installation at TACC added to existing
data infrastructure• Mirror installation at Arlington for replication• High level of redundancy within each
installation – Power supplies to storage controllers and servers
Research Data Requirements
• Persistent Storage is just the beginning• High reliability/availability is key• Complex, evolving security needs• Importance of Collaboration• Data Applications and Services• Data Management and Analysis
• Also, it has to be cheap (or free)
Research Data Security
• HIPAA Compliance is a major goal of the UTRDR effort
• But HIPAA is just the beginning• Intellectual property and research
confidentiality issues are more fine-grained• Long-term issues of availability/usability• Tiers of access, change over time
Example Application Areas
• Biology– Biodiversity (natural history collections)– Phylogenetics
• Health Sciences– Medical Imaging– High-throughput sequencing
• Social Sciences– Economic and social analysis
TACC Corral Architecture
• Emphasis on large-scale storage, highly flexible service infrastructure
• Fast networks and heterogeneous systems = malleable service and storage platform
• Allows integration of UTRC hardware into an existing infrastructure
• Near-transparent migration for existing users• Expansion improves reliability and availability
Corral Hardware and Services
• 1.2 Petabytes DataDirect SATA Disk• 16 Dell Servers• ~300 TB of heterogeneous disks and servers• High-Performance Parallel File System,
multiple databases, iRODS data management, replication to tape archive
• Multiple levels of access control• Supports almost any imaginable data need
iRODS at TACC
• Distributed/Replicated data management• Corral, Ranch, and offsite storage systems• Extensible metadata support• Policy/Rule-based automation and
enforcement• Used for sophisticated data management
needs• Provides wide variety of interfaces
Current Corral Usage
• >30 Data Allocations & Collections• 350 Users at TACC and UT• >500 External users accessing collections• >500TB Research and Reference Data• Data of all types and disciplines:
– Plant specimens and ‘omics, MRI, GIS, Simulations, Fish and Pottery, Economics and Medicine
Added Capabilities w/ UTRDR
• Synchronous replication• Very high availability (weather, comet strikes)• Tiers of storage and data management• Huge performance boost (>80GB/sec)• Accessibility from all UT System campuses• HIPAA Compliance
UTRDR Pilot Access
• Accelerated access for early adopters• Allows us to shake out bugs, assess
readiness for production• Helps to develop requirements present and
future• Research network performance assessment• Expect to open to all UT System researchers
early 2012
UTRDR Long-term sustainability
• After pilot phase, storage will be free to all Pis up to some small limit (5TB?)
• Additional storage will be available for cost-recovery fee per TB
• Currently only trying to recoup costs on an annual basis
• Long-term preservation costs are TBD but are of major interest
Fee-based Research Storage
• 2 Major types of service:– Simple storage (iSCSI, SCP/FTP) based on per-
TB/year costs– Application services (databases, web applications,
data management, etc)
• Provides fixed, relatively low costs that can be written into grant proposals
• Can include both disk and tape + offsite storage
• Long-term model for UTRDR
Existing/Upcoming Partnerships
• University of Alaska• UC Berkeley• University of North Texas Libraries• Texas Digital Library• University of Florida• Indiana University• NSF XSEDE – 15 Institutions
UTRC Plan 2012-2013
• Initial production in early 2012• Design assessment and adjustment based on
initial experiences• Expansion proposal mid-2012• Significant expansion likely late 2012/early
2013• Ongoing assessment and design adjustments
integral to the process
TACC Storage Research
• Data upload and ingest processes• Storage reliability and management• Data Integrity/Long-term planning• Automated data management applications• Wide-area storage and replication efforts in
the NSF XSEDE project
Acknowledgements
• Dr. Ken Shine – UT System• Dr. Patricia Hurn – UT System• Jay Boisseau and Brian Herman• Jerry York – UTHSCSA• UTRC Storage Committee
– Brian Grimm, Kevin Granhold, Huapei Chen, Wayne Mueller, Bill Sanns
• And many, many others
Q&A