prace wp4 - distributed systems managementdev.prace-ri.eu › img › pdf ›...
TRANSCRIPT
PRACE WP4 – Distributed Systems ManagementRiccardo Murri, CSCS – Swiss National Supercomputing Centre
2
PRACE WP4
• WP4 is the “Distributed Systems Management” activity– User administration and accounting
– Distributed data management
– Trust between sites and security
– Monitoring of distributed resources
– Resource management and allocation
– Grid access
• Provides tools for:– Consistent management of the Tier-0 systems
– Smooth interoperation of the PRACE infrastructure with the national, regional or institutional HPC services
– Seamless access for users to the PRACE infrastructure
3
WP4 defines the PRACE Middleware Stack
• To connect all PRACE systems in a coherent whole– e.g., uniform interfaces for job submission and data
transfer• Yet allow users to take advantage of the diversity of
PRACE systems– Do not make all machines look equal, as they have different
characteristics which can be successfully exploited by computational jobs
• Iterative process to define the middleware stack– Second release in June, currently being deployed– Work towards final release starts now!
4
Middleware selection
• HPC ecosystem integration– Wide range of tools to adapt to the manifold needs of users– Standards compliance a must– Client software must be readily available at end-user sites
• Leverage DEISA experience– DEISA has been running a distributed supercomputing
infrastructure since several years
– Strong cooperation between the two projects• User-centric view
– Assessments also based on survey and user feedback
5
The PRACE Infrastructure, today
• Six prototype systems– In BSC (Spain), CEA (France), CSC (Finland), FZJ (Germany),
HLRS (Germany), SARA (The Netherlands)– Diversity of computational architectures
• Private high-speed interconnect– 10Gb/s dedicated links– Shared with the DEISA supercomputing federation
• Leverage DEISA experience and tools in running it
– Star topology, with hub in Frankfurt a.M.
• Public Internet access– Through the European GÉANT R&E network
6
Two-tiered structure
• All PRACE services exposed to both the private network and the Internet
• Flexibility in service setup– Cater for diversity in systems capabilities and site policies
• “Inner circle” on the private network– Geared towards high-speed transfers and strong integration of
PRACE services
• “Outer circle” on the public Internet– Secures access to the PRACE services– “Door nodes”: bastion hosts that act as a gateway to the PRACE
private network
7
A vision for access to PRACE systems
• X.509-based authentication and authorization– Uniform authentication for all PRACE services
• UNICORE, SSH, GridFTP/RFT, …
– Provides encryption and confidentiality of all network communications
– Widely-adopted standard– EUGridPMA/IGTF provides trust anchor
• Globus GSI support– Single Sign-On: seamlessly mix several Grid services– The Grid standard
8
Data movement
• GridFTP supported at all sites– Standard for Grid high-speed data transfers
• Supported by all major Grid infrastructures: DEISA, EGEE, TeraGrid, OSG, GridAustralia, ...
– “Door nodes” act as gateway between private and public network– Command-line and graphical clients available for all major OSes
• Globus RFT for automated file transfer– Unattended and reliable transfer of large files– One server serves both the private and the public network
9
Job submission and control: command-line
• Command-line access fully supported– Still the most popular way of accessing HPC systems
• Direct access to local batch execution and scheduling systems– Exploit system-specific features
• SSH provides command-line access– Well-known secure protocol– X.509- and GSI- authentication supported– “Door node” provides gateway between private and public
network
10
Job submission and control: UNICORE 6
• UNICORE 6– UNIform interface to COmputing REsources
• Both command-line and Eclipse-based rich graphical client
• Extensible: use the API or embed GridBeans into the rich client to create submission interfaces tailored to specific needs
• X.509-authentication
– Workflow engine• Can coordinate data transfer with job execution, and stage files to the
target system
– Several data transfer protocols supported• Can be used as an alternative data transfer system,
• Does much more than just this!– See http://www.unicore.eu/ for reference
11
Resource monitoring
• User-level testing with INCA– Executes tests as an unprivileged user and reports status
• e.g., to verify that a certain application is available
– Aggregates results and reports in a color-coded web page– Different views with X.509-authorization
• Network performance testing– Iperf data– Shared with DEISA
• Monitoring features still actively discussed and developed
12
User management and Accounting
• Centralized user management– One account to access all PRACE services– Compatible with the DEISA system
• Might merge into a single database in the future
• Web client for accounting/reporting– X.509-authorization with different authorization levels: user,
project manager, site manager– Developed by DEISA and in daily production use since years
13
Thank you!
• WP4 will start working on the new release of the middleware stack shortly– What features would you like to see added?– What other use cases do you think should be supported on
PRACE Tier-0 systems?– What modifications would best support your use of the PRACE
Tier-0 systems?
• We value your input!
14
References: WP4 deliverables so far
• D4.1.1 – “Requirement analysis for Tier-0 systems management”
• D4.1.2 – “Report on existing Tier-0 systems management solutions”
• D4.1.3 – “Deployment of initial software stack to selected sites”
• D4.2.2 – “Deployment of enhanced solutions”