session e: prs - cis.temple.edugiorgio/old/cis307s09/readings/cloud... · 6 open cirrus stack prs...
TRANSCRIPT
Session E:PRS
2
PRSRichard Gass
Intel
3
AgendaSessions:(A) Introduction 8.30-9.00 (B) Hadoop 9.00-10.00
Break 10.00-10.15Hadoop 10.15-11:30Lunch 11.30-12.30
(C) Pig 12.30-1.30Break 1.30-1.45
(D) Tashi 1.45-3.30Break 3.30-3.45
(E) PRS 3.45-5.00
I. OverviewII. Plans/StatusIII. User ViewIV. AdminsitrationV. InstallationVI. Summary
4
Overview
5
Open Cirrus Stack
Compute + network + storage resources
Power + cooling
Management andcontrol subsystem
Physical Resource set (PRS) service
Credit: John Wilkes (HP)
6
Open Cirrus Stack
PRS service
Eucalyptus Tashi/HDFS NFS storage service
Experiment
PRS clients, each with theirown “physical data center”
7
Open Cirrus Stack
PRS service
Eucalyptus Tashi/HDFS NFS storage service
Experiment
Virtual cluster Virtual cluster
Virtual clusters
8
Open Cirrus Stack
PRS service
Eucalyptus Tashi/HDFS NFS storage service
Experiment
Virtual cluster Virtual cluster
BigData App
Hadoop
1. Application running2. On Hadoop3. On Tashi virtual cluster4. On a PRS5. On real hardware
Web Service
9
Open Cirrus stack - PRS• PRS service goals
– Provide mini-datacenters to users– Isolate mini-datacenters from each other
• PRS service approach– Allocate sets of physical co-located nodes, isolated
inside VLANs.
• Initial PRS implementation from HP• Re-write from Intel (in collaboration with HP)
contributed to Apache Software Foundation
PRS service
Further Motivation
• Enable innovation in virtualization
• Allow running without virtualization overhead– Necessary for predictable QoS
• e.g. cache interference
11
Goals
• Reduce complexity in allocating physical resources• Gain User Confidence
– Show users that we can efficiently allocate/deallocate resources
• Stop the squatting– Incentives
• HP’s tycoon (economic model)• Simple points scheme for good behavior• Early return
12
Responsibilities of PRS
• Isolate domains• Provision system software• Provide platform control
– On/Off• Provide boot debug
VLANPXEIPMI
IPMI
13
VLAN• Virtual LAN technology allows a single physical
network to appear as several isolated networks– Ethernet packets are tagged with a VLAN id– Switches and NICs enforce the policies associated
with each VLAN
• By associating PRS domains with different VLANs, they can be isolated from each other
• The PRS system provides the interfaces necessary to abstract switch configuration programming across multiple switch vendors
14
PXE
• Enables provisioning of OS image over the network
• On machine boot, the NIC firmware contacts a PXE server via the DHCP process for the appropriate kernel and initrd to load
• Once loaded, the init scripts in the initrd can pull the filesystem to the machine
• In our environment, we download the desired filesystem to a ramdisk from a NFS server– enabling a very rapid provisioning (30 seconds or less) while leaving the host filesystem undisturbed
Pre-
eXecution
Environment
15
IPMI
• Defines a standardized, abstracted, message-based interface to intelligent platform management hardware
• Defines standardized records for describing platform management devices and their characteristics
• Enables cross-platform management software
Intelligent
Platform
Management
Interface
16
Status/Plans
17
PRS Roadmap• Stage 1
• Manages all cluster hardware• Handles resource provisioning• Provides interfaces for VLAN definition/programming• Administrator is still in the allocation decision-making loop
• Stage 2• Introduces a request queue and primitive scheduler• Admin may still be in loop, definitely for special cases• Enables provisioning of OS to local disk• Enables virtual disk conversion to physical
• Stage 3• Incentives module added (Tycoon)• Tashi integration
18
Some History
• Previous prototype developed at HP Labs• Focus on economic model
• Nice web interface which will be available upon reconvergence of code
23
User View
24
PRS Roles
• Admin: root of all authority– Controls the physical resources
• User: requests domains– Controls the domain, once allocated
25
Domains
• A Domain is the unit of PRS isolation• A simple domain is a set of compute
nodes gathered into a single VLAN
• Nodes are allocated from pools of available resources
26
The PRS Interface• Users and Admins currently interact with the
PRS system through a command line interface
• This interface both:– Queries and updates records in the PRS database– Wraps the various commands that must be issued to
effect changes in the cluster
• PRS is currently a centralized system; users log into the PRS manager to issue commands– An RPC interface is planned for the near future
27
PRS Usage
Usage: prs <options>Standard options:--help [show this help message and exit]--version [show program's version number and exit]--verbose [be verbose]
Common options:--nodeName <name> [Specify node]--switchPort <port> [Specify switchport switchname:portnum]
Common admin options:--userName <name> [Specify user name]--uid <UID> [Specify user id]
28
Image Management Interface
--addImage <img> [Add image to PRS]--delImage <img> [Delete image]
29
User Allocation Interface--createDomain <name>
– May fail if name already exists --submitDomainRequest <name>--destroyDomain –domain <name>
--requestNodes --domain <name> [--count <N>] [--nodeName <name>] [--cores <n> …]
– Add the requested nodes to the domain--assignImage <kernel> <image>
– Assign image to resource
--associateNewVlan –domain <name>– Allocate an unused VLAN number to domain
--createReservation <YYYYMMDD> <YYYYMMDD>– Specify duration of node reservation where start time may be “ASAP”
--reservationNotes “notes”--updateReservation
30
Admin Allocation Interface
--allocateNode [Assign node to a user]--releaseNode [Release node
allocation]
--vlanIsolate <vlanid> [Specify vlan for isolation]
31
Hardware Control
--hardware [Make hardware call]--powerStatus [Get power status]--rebootNode [Reboot node (Soft)]--powerCycle [Power Cycle (Hard)]--powerOff [Power off node]--powerOn [Power on node]
32
Query Interface--showReservations [Show current node reservations]--showResources [Show available resources to choose from]
--procs <N> [Filter by number of processors]--clock <N> [Filter by processor clock]--memory <N> [Filter by amount of memory (Bytes)]--cpuflags “flags” [Filter by CPU flags]--cores <N> [Filter by number of cores]
--showPxeImages [Show available PXE images to choose from]
--showPxeImageMap [Show PXE images host mapping]
33
Administration Interface--admin Enter Admin mode
--addPxeImage [Add PXE image to database]
--enableHostPort [Enable a switch port]--disableHostPort [Disable a switch port]
--removeVlan <vlanId> [Remove vlan from all switches]--createVlan <vlanId> [Create a vlan on all switches]--addNodeToVlan <vlanId> [Add node to a vlan]--removeNodeFromVlan <vlanId> [Remove node from a vlan]--setNativeVlan <vlanId> [Configure native vlan]--restoreNativeVlan [Restore native vlan]--removeAllVlans [Removes all vlans from a switchport]
--sendSwitchCommand “<command>” [Send Raw Switch Command, BE CAREFUL]--interactiveSwitchConfig “<switchname>” [Interactively configure a switch]
--showSwitchConfig <nodename> [Show switch config for node]
34
Administration
35
Typical Workflow
1. Admin queries available systems2. Admin requests systems with desired user configuration
1. i.e., cores, memory, image, duration, etc
3. Request goes in queue4. PRS locates resources and provides a list to admin/Tashi.5. Admin/Tashi moves VMs to free resources
1. Add node to blacklist and tell hadoop to reload
6. PRS allocates resources1. Provides estimated time to get resources2. User can query 3. PRS sends notification when allocated
7. PRS reclaims resources and adds them back into respective pools1. User may extend time period before expiration
36
After allocation
• A returned PRS node is typically untrusted– update the system to default settings
• Clean physical node by PXE booting a reset image• Restore all setting to defaults (address, IPMI passwords)• Repartition and format disks
• (Option) Trust images from some users– No re-format needed
• Clean network configuration (VLAN)
37
Example: Minicluster./prs –addimage hardy-rgass-testing:hardy:8.03./prs –assignimage hardy-rgass-testing –nodename
r1r1u25./prs –allocatenode –nodename r1r1u25 –username rgass
–reservationDuration 30 –vlanisolate 300 –notes “Practice allocation”
./prs –addnodetovlan 300 –nodename r1r1u25
./prs –hardware –rebootnode –nodename r1r1u25
Example: CloudConnect 1
• Network isolate a rack of machines and PXE boot them with a user’s kernel and initrd
• Create a VM that acts as a SSH gateway and a NAT for the private cluster
• Dynamically configure switches to support the networking experiment
1Gb/s Switch
100Mb/s Switch
Rack A region
Rack B region
Rack C region
Rack D region
4x1Gb trunk link
1 Gb/sSwitch
MRack A
4Gb/s Switch
- server
M
VLAN #2: Optical
VLAN #1: Electrical
100Mb/s Switch
Rack B Rack C Rack D
- switch
M - manager
39
Example: CloudConnect 2for i in r1r1u12 r1r1u13 r1r1u14 r1r1u15;do
./prs --admin --setnativevlan 300 -n ${i}
./prs --admin --addnodetovlan 800 -n ${i}
./prs --admin --addnodetovlan 801 -n ${i}
./prs --admin --addnodetovlan 802 -n ${i}done./prs --admin --switchport sw0-r1r1 --sendswitchcommand "config;interface range ethernet g(25-28); spanning-tree
disable"./prs --admin --switchport sw0-r1r1 --sendswitchcommand "config;interface ethernet g25;switchport mode trunk;exit"./prs --admin --switchport sw0-r1r1 --sendswitchcommand "config;interface ethernet g26;switchport mode trunk;exit"./prs --admin --switchport sw0-r1r1 --sendswitchcommand "config;interface ethernet g27;switchport mode trunk;exit"./prs --admin --switchport sw0-r1r1 --sendswitchcommand "config;interface ethernet g28;switchport mode trunk;exit"./prs --admin --switchport sw0-r1r1 --sendswitchcommand "config;interface ethernet g28;switchport mode trunk;exit"
./prs --admin --switchport sw0-r1r1:25 --setnativevlan 802 -v
./prs --admin --switchport sw0-r1r1:26 --setnativevlan 804 -v
./prs --admin --switchport sw0-r1r1:27 --setnativevlan 806 -v
./prs --admin --switchport sw0-r1r1:28 --setnativevlan 808 -v
for i in $(seq 12 16);do./prs --hardware --rebootnode -n r1r1u${i}
done
40
Example cluster: Networking Testbed
• Network isolate a rack of machines and PXE boot them with a user’s kernel and initrd
• Create a VM that acts as a SSH gateway and a NAT for the private cluster
• Dynamically configure switches to support the networking experiment
41
Future Work
• Integration with Tashi…– Would enable free exchange of resources
between the Tashi pool and the free pool
42
PRS client queries PRS server
for available resources
VMVM
VMVM
VM
VM
VM
VM
VM
VM
VM
System Servers
VMVM VM
VM
VM
VM
VM
VM
VM
VM
VM
PRS server
DB
Tashi Cluster Manager
VMVM VMManagement Servers
PXE server
PRS client
Administratoror
Cluster ManagerVM
VM
VMVM
PRS queries DB to locate available
resources
Node 1 : 8 Core, 16G memory, 6TB disk,30dayNode 2 : 8 Core, 16G memory, 6TB disk,30 dayNode 3 : 8 Core, 16G memory, 6TB disk,90 dayNode 4 : 8 Core, 16G memory, 6TB disk,1 dayNode 5 : 8 Core, 8G memory, 2TB disk, 90 dayNode 6 : 8 Core, 8G memory, 2TB disk,90 dayNode 7 : 8 Core, 8G memory, 2TB disk,90 dayNode 8 : 8 Core, 8G memory, 2TB disk,90 dayNode 9 : 8 Core, 8G memory, 2TB disk,90 dayNode 10: 8 Core, 8G memory, 2TB disk,30 day…
Results are sent back to
the client
User choosesmachine attributes
and submits a request for the resources
for some time period
43
VMVM
VMVM
VM
VM
VM
VM
VM
VM
VM
System Servers
VMVM VM
VM
VM
VM
VM
VM
VM
VM
VM
PRS server
DB
Tashi Cluster Manager
VMVM VMManagement Servers
PXE server
PRS client
Administratoror
Cluster ManagerVM
VM
VMVM
Request Queue
R1
44
VMVM
VMVM
VM
VM
VM
VM
VM
VM
VM
System Servers
VMVM VM
VM
VM
VM
VM
VM
VM
VM
VM
PRS server
DB
Tashi Cluster Manager
VMVM VMManagement Servers
PXE server
PRS client
Administratoror
Cluster ManagerVM
VM
VMVM
PRS processes request and
identifies physical machines that satify
the user request
VM VM VM
VMVM
VM
45
VMVM
VMVM
VM
VM
VM
VM
VM
VM
VM
System Servers
VMVM VM
VM
VM
VM
VM
VM
VM
VM
VM
PRS server
DB
Tashi Cluster Manager
VMVM VMManagement Servers
PXE server
PRS client
Administratoror
Cluster ManagerVM
VM
VMVM
PRS sends requestto Tashi to free selected nodes
Tashi moves virtual machines off of selected nodes
VM VM VM
VM
VMVM
46
VMVM
VMVM
VM
VM
VM
VM
VM
VM
VM
System Servers
VMVM VM
VM
VM
VM
VM
VM
VM
VM
VM
PRS server
DB
Tashi Cluster Manager
VMVM VMManagement Servers
PXE server
PRS client
Administratoror
Cluster ManagerVM
VM
VMVM
PRS allocated the physical machines to the requested user and isolates them from the network
using VLANsTashi notifies PRS that migration of
virutal machines has completed
VM VM
VM
VM
VMVM
PRS reboots the physical machine and
sets PXE image to users VM
VM
PXE
Virtual disk image is converted to PXE
image
Physical machines boot up with PXE
image
PXEPXEPXE
47
VMVM
VMVM
VM
VM
VM
VM
VM
VM
VM
System Servers
VMVM VM
VM
VM
VM
VM
VM
VM
VM
VM
PRS server
DB
Tashi Cluster Manager
VMVM VMManagement Servers
PXE server
PRS client
Administratoror
Cluster ManagerVM
VM
VMVM VM VM
VM
VM
VMVM
VM
PRS updates reservation database
PXE
PXE
PXE
PRS client queries server for allocation
User connects to the machines and starts running experiments
48
VMVM
VMVM
VM
VM
VM
VM
VM
VM
VM
System Servers
VMVM VM
VM
VM
VM
VM
VM
VM
VM
VM
PRS server
DB
Tashi Cluster Manager
VMVM VMManagement Servers
PXE server
PRS client
Administratoror
Cluster ManagerVM
VM
VMVM
49
VMVM
VMVM
VM
VM
VM
VM
VM
VM
VM
System Servers
VMVM VM
VM
VM
VM
VM
VM
VM
VM
VM
PRS server
DB
Tashi Cluster Manager
VMVM VMManagement Servers
PXE server
PRS client
Administratoror
Cluster ManagerVM
VM
VMVM
50
Installation
51
Necessary Components• DHCP Server• PXE Server• NFS Server• DNS Server (optional)• Configurable switches
– New switch types may require new PRS modules• Hardware access method
– E.g. IPMI– IP-addressable PDUs enable rescue if IPMI becomes
compromised
52
Internals
53
Notes on Current Software
• PRS client code is Python 2.5• PRS database implemented in MySQL
– Reachable through python-MySQLdb interface
• pExpect used for switch configuration• User information currently obtained
through LDAP
54
Summary
55
PRS• PRS lays the foundation of the Open Cirrus
software stack– easing management of multiple projects in a single cluster
• PRS enables partitioning clusters into isolated domains of physical resources
• Current implementation allows rapid provisioning of system software
• PRS code base is open source software available through Tashi project in Apache Incubator– Contributions welcome