practice and challenges from building iaas

26
Practice and Challenges from Building Infrastracture-as-a-Service 朱可 [email protected]

Upload: -

Post on 18-Dec-2014

562 views

Category:

Technology


1 download

DESCRIPTION

It is an invited presentation for NCSC2012 (China National Conference on Social Computing) on cloud computing from industry. It summarized what we learn on developing and operating an Infrastructure as a Service in a highly scalable manner. The service described inside the corporation is kind of dogfood that engineers work with in their daily work.

TRANSCRIPT

Page 1: Practice and challenges from building IaaS

Practice and Challenges from

Building Infrastracture-as-a-Service

朱可[email protected]

Page 2: Practice and challenges from building IaaS

Disclaimer

● Representing personal opinion only

Page 3: Practice and challenges from building IaaS

IaaS in Our Development Lab

$ ./iaas-deploy-vms -i centos63 -n 100$ ./iaas-deploy-vms -i centos63 -n 100

● Virtual machine● Block storage● Virtual machine template● VLAN● Static ip address● Virtual Desktop

Page 4: Practice and challenges from building IaaS

The Machinery

Rack: 20+ nodes, 2 rack switches

Node: 16 Cores 192GB RAM 1,6TB

Page 5: Practice and challenges from building IaaS

Quick Stats

● 5,800 VMs provisioned in 2 months● 700+ individual visitors per month● 50,000+ requests to web services per single

day– Less than 40% requests are sent by human

Page 6: Practice and challenges from building IaaS

Design for Failure

● “Failure is not an option, it's a requirement.”● Things will crash

– Linux kernel panic– Defunct process– File system becomes read only suddenly

● HW just doesn't work in every week– Broken disk– Flaws in CPU– Network adapter varies among 10/100/1000 Mbps

Page 7: Practice and challenges from building IaaS

Event In Red: Failure

Page 8: Practice and challenges from building IaaS

Nov 14 00:39:27 r007x072 kernel: bonding: bond1: link status definitely down for interface eth3, disabling itNov 14 00:39:35 r007x072 kernel: e1000e: eth3 NIC Link is Up 100 Mbps Full Duplex, Flow Control: RX/TXNov 14 00:39:35 r007x072 kernel: e1000e 0000:1a:00.1: eth3: 10/100 speed: disabling TSONov 14 00:39:35 r007x072 kernel: bonding: bond1: link status definitely up for interface eth3.Nov 14 00:39:36 r007x072 kernel: e1000e: eth3 NIC Link is DownNov 14 00:39:36 r007x072 kernel: bonding: bond1: link status definitely down for interface eth3, disabling it

Unqualified Network CablesUnqualified Network Cables

Flakiness

AnalysisAnalysis

Page 9: Practice and challenges from building IaaS

[root@r007x072 ~]# cat /proc/net/bonding/bond1 Ethernet Channel Bonding Driver: v3.6.0 (September 26, 2009) Bonding Mode: adaptive load balancingPrimary Slave: NoneCurrently Active Slave: eth2MII Status: upMII Polling Interval (ms): 100Up Delay (ms): 0Down Delay (ms): 0 Slave Interface: eth2MII Status: upLink Failure Count: 0Permanent HW addr: 00:1b:21:98:2a:4cSlave queue ID: 0 Slave Interface: eth3MII Status: upLink Failure Count: 1627Permanent HW addr: 00:1b:21:98:2a:4dSlave queue ID: 0

Page 10: Practice and challenges from building IaaS

Keep Simple and Robust

● “I have 4 letters for you: KISS (Keep it simple and stupid)”

● Complex system === hazardous system● Just enough fault-tolerance

– Reboot machine if it goes wrong– Logout iSCSI session and login again– Mini toolkit to fix broken DM (device mapper)

table

Page 11: Practice and challenges from building IaaS

Example: Stateless OS

● Mount root partition in RAM– Think about how you install Ubuntu or Fedora

● Fix problem by reboot only

[root@r009x090 ~]# df -hFilesystem Size Used Avail Use% Mounted on/dev/mapper/live-rw 7,9G 1,5G 6,4G 19% /tmpfs 71G 4,0K 71G 1% /dev/shm/dev/sda2 7,9G 1,4G 6,2G 18% /var/log/dev/sda4 1,6T 183G 1,4T 12% /iaas/local-storage

Page 12: Practice and challenges from building IaaS

P2P based Socialized Communication

● Bots “talk” to each other● Anyone can be re-run in seconds when things

go wrong

Page 13: Practice and challenges from building IaaS

Robust Application

● A number of roles in distributed system do there own jobs– Bot, manager, watch dog, zookeeper, agent,

hbase, hadoop, etc

RegionServerRegionServer

DatanodeDatanode

Regular bot Watch dog Manager bot

zookeeperDatanodeDatanode

DatanodeDatanode

RegionServerRegionServer

RegionServerRegionServer

HDFS

HBase

(http://zookeeper.apache.org/images/zookeeper_small.gif)

Page 14: Practice and challenges from building IaaS

Dedicated Network-accessible Services

● NTP (controversial in VM but good enough)● ZooKeeper

– Node presence– Configuration data– Leader election

● HBase: store schema-less data● Rsyslog: centralize logs● Web Service: accept HTTP requests only

Page 15: Practice and challenges from building IaaS

Scale-out Architecture For Growth

● Single namespace for global infrastructure– v525400ffffff.region-a.cloud.xx.ibm.com/service-foo

● Multi-region for Geo-distribution● Use cache when possible● Share nothing by autonomy● Leader election (elect new manager if former

dies)● Collect metrics

Page 16: Practice and challenges from building IaaS

Requirement grows/decreases faster than purchasing HW

● “I need 200 large VMs this afternoon and will terminate all of them tomorrow.”

Page 17: Practice and challenges from building IaaS

Storage is Always Not Enough

● Walk-around: recycle unused files– Move low hit virtual images out of hot zone– Setup SLA to limit availability (provide

redundancy only when necessary)

Page 18: Practice and challenges from building IaaS

Metrics Collection is Critical

● “Gathering, storing, and displaying metrics should be considered a mission-critical part of your infrastructure.”*

● Measurement for performance boost (or downgrade)

(* comes from chapter 3 of the book “web operations”)

Page 19: Practice and challenges from building IaaS

Example #1: Fix Side Effect of the Leap Second ● The latest leap second occurred on the end of

June 2012

tgtd: work_timer_evt_handler(89) failed to read from timerfd, Resource temporarily unavailable

/var/log/messages grows too muchIt take 10 times long in job distribution between bots

# service ntpd stop; date -s “`date`”; service ntpd start

Page 20: Practice and challenges from building IaaS

Example #2: Recycle Unused Resources

@zhukecdl Our analysis of your VM instance(s) shows that CPU utilization and network traffic in the past 48 hourshave dropped below 2% and 10 MB.

Instance ID CPU Time (s) CPU Rate (%) TX (MB)r007.x072.17897.u51393 337.3 0.20 0

We would strongly urge you to consider recycling your instance(s) so that others can make use of these resources.

If you didn't contact the administrator before 2011-08-16 17:00+8000, the instance r007.x072.17897.u51393 will be recycled

Regards,

Page 21: Practice and challenges from building IaaS

Automated Operation (and More)

● Goals– Daily upgrade all components– One administrator for 1k systems– No working overtime

● Tool– Ruby chef– SmartCloud portfolio

● Process– Run benchmark to the system every week– Stay in office until build break is fixed

Page 22: Practice and challenges from building IaaS

Run Benchmark to the System Often

● Measurement to your performance tweaks● Tools

– Netperf– Apache JMeter

Benchmarking network infrastructure# netserver# netperf -H 10.10.1.97 -l 43200 TCP_CRR &# netperf -H 9.123.127.227 -l 43200 TCP_CRR &

Page 23: Practice and challenges from building IaaS

Infrastructure as Code

● Building network accessible services● Integration these services

[root@beijing-mn03 ~]# virsh list Id Name State---------------------------------- 1 hbm1 running 2 bj-jenkins running 3 hjt running 4 webservice-1 running 5 hnn2 running 6 bugzilla running 8 hslave07 running 9 hslave08 running 10 hslave09 running 11 hslave10 running 12 hslave11 running 13 ScannerSlackware running

Page 24: Practice and challenges from building IaaS

Real time Feedback by Tracing Logs

● Manager X: “I need daily success rate report on deploy VM from department Y today.”

Page 25: Practice and challenges from building IaaS

Visualize Traces Via Timeline

Page 26: Practice and challenges from building IaaS

Summary

● Keep it simple and robust● Scale-out architecture● Automated operation