make sure your applications crash
DESCRIPTION
Presentation for PyCon 2012 about application reliability.TRANSCRIPT
![Page 1: Make Sure Your Applications Crash](https://reader033.vdocuments.net/reader033/viewer/2022060109/555672e6d8b42abc5a8b4e38/html5/thumbnails/1.jpg)
Make Sure Your Applications Crash
Moshe Zadka
![Page 2: Make Sure Your Applications Crash](https://reader033.vdocuments.net/reader033/viewer/2022060109/555672e6d8b42abc5a8b4e38/html5/thumbnails/2.jpg)
True story
![Page 3: Make Sure Your Applications Crash](https://reader033.vdocuments.net/reader033/viewer/2022060109/555672e6d8b42abc5a8b4e38/html5/thumbnails/3.jpg)
Python doesn't crash
Memory managed, no direct pointer arithmetic
![Page 4: Make Sure Your Applications Crash](https://reader033.vdocuments.net/reader033/viewer/2022060109/555672e6d8b42abc5a8b4e38/html5/thumbnails/4.jpg)
...except it does
C bugs, untrapped exception, infinite loops,blocking calls, thread dead-lock, inconsistent
resident state
![Page 5: Make Sure Your Applications Crash](https://reader033.vdocuments.net/reader033/viewer/2022060109/555672e6d8b42abc5a8b4e38/html5/thumbnails/5.jpg)
Recovery is important
"[S]ystem failure can usually be considered tobe the result of two program errors[...] the
second, in the recovery routine[...]"
![Page 6: Make Sure Your Applications Crash](https://reader033.vdocuments.net/reader033/viewer/2022060109/555672e6d8b42abc5a8b4e38/html5/thumbnails/6.jpg)
Crashes and inconsistent data
A crash results in data from an arbitraryprogram state.
![Page 7: Make Sure Your Applications Crash](https://reader033.vdocuments.net/reader033/viewer/2022060109/555672e6d8b42abc5a8b4e38/html5/thumbnails/7.jpg)
Avoid storage
Caches are better than master copies.
![Page 8: Make Sure Your Applications Crash](https://reader033.vdocuments.net/reader033/viewer/2022060109/555672e6d8b42abc5a8b4e38/html5/thumbnails/8.jpg)
Databases
Transactions maintain consistencyDatabases can crash too!
![Page 9: Make Sure Your Applications Crash](https://reader033.vdocuments.net/reader033/viewer/2022060109/555672e6d8b42abc5a8b4e38/html5/thumbnails/9.jpg)
Atomic operations
File rename
![Page 10: Make Sure Your Applications Crash](https://reader033.vdocuments.net/reader033/viewer/2022060109/555672e6d8b42abc5a8b4e38/html5/thumbnails/10.jpg)
Example: Counting
def update_counter(): fp = file("counter.txt") s = fp.read() counter = int(s.strip()) counter += 1 # If there is a crash before this point, # no changes have been done. fp = file("counter.txt.tmp", 'w') print >>fp, counter fp.close() # If there is a crash before this point, # only a temp file has been modified # The following is an atomic operation os.rename("counter.txt.tmp", "counter.txt")
![Page 11: Make Sure Your Applications Crash](https://reader033.vdocuments.net/reader033/viewer/2022060109/555672e6d8b42abc5a8b4e38/html5/thumbnails/11.jpg)
Efficient caches, reliable masters
Mark inconsistency of cache
![Page 12: Make Sure Your Applications Crash](https://reader033.vdocuments.net/reader033/viewer/2022060109/555672e6d8b42abc5a8b4e38/html5/thumbnails/12.jpg)
No shutdown
Crash in testing
![Page 13: Make Sure Your Applications Crash](https://reader033.vdocuments.net/reader033/viewer/2022060109/555672e6d8b42abc5a8b4e38/html5/thumbnails/13.jpg)
Availability
If data is consistent, just restart!
![Page 14: Make Sure Your Applications Crash](https://reader033.vdocuments.net/reader033/viewer/2022060109/555672e6d8b42abc5a8b4e38/html5/thumbnails/14.jpg)
Improving availability
Limit impactFast detectionFast start-up
![Page 15: Make Sure Your Applications Crash](https://reader033.vdocuments.net/reader033/viewer/2022060109/555672e6d8b42abc5a8b4e38/html5/thumbnails/15.jpg)
Vertical splitting
Different execution paths, different processes
![Page 16: Make Sure Your Applications Crash](https://reader033.vdocuments.net/reader033/viewer/2022060109/555672e6d8b42abc5a8b4e38/html5/thumbnails/16.jpg)
Horizontal splitting
Different code bases, different processes
![Page 17: Make Sure Your Applications Crash](https://reader033.vdocuments.net/reader033/viewer/2022060109/555672e6d8b42abc5a8b4e38/html5/thumbnails/17.jpg)
Watchdog
Monitor -> Flag -> Remediate
![Page 18: Make Sure Your Applications Crash](https://reader033.vdocuments.net/reader033/viewer/2022060109/555672e6d8b42abc5a8b4e38/html5/thumbnails/18.jpg)
Watchdog principles
Keep it simple, keep it safe!
![Page 19: Make Sure Your Applications Crash](https://reader033.vdocuments.net/reader033/viewer/2022060109/555672e6d8b42abc5a8b4e38/html5/thumbnails/19.jpg)
Watchdog: Heartbeats
## In a Twisted processdef beat(): file('beats/my-name', 'a').close()task.LoopingCall(beat).start(30)
![Page 20: Make Sure Your Applications Crash](https://reader033.vdocuments.net/reader033/viewer/2022060109/555672e6d8b42abc5a8b4e38/html5/thumbnails/20.jpg)
Watchdog: Get time-outs
def getTimeout() timeout = dict() now = time.time() for heart in glob.glob('hearts/*'): beat = int(file(heart).read().strip()) timeout[heart] = now-beat return timeout
![Page 21: Make Sure Your Applications Crash](https://reader033.vdocuments.net/reader033/viewer/2022060109/555672e6d8b42abc5a8b4e38/html5/thumbnails/21.jpg)
Watchdog: Mark problems
def markProblems(): timeout = getTimeout() for heart in glob.glob('beats/*'): mtime = os.path.getmtime(heart) problem = 'problems/'+heart if (mtime<timeout[heart] and not os.path.isfile(problem)): fp = file('problems/'+heart, 'w') fp.write('watchdog') fp.close()
![Page 22: Make Sure Your Applications Crash](https://reader033.vdocuments.net/reader033/viewer/2022060109/555672e6d8b42abc5a8b4e38/html5/thumbnails/22.jpg)
Watchdog: check solutions
def checkSolutions(): now = time.time() problemTimeout = now-30 for problem in glob.glob('problems/*'): mtime = os.path.getmtime(problem) if mtime<problemTimeout: subprocess.call(['restart-system'])
![Page 23: Make Sure Your Applications Crash](https://reader033.vdocuments.net/reader033/viewer/2022060109/555672e6d8b42abc5a8b4e38/html5/thumbnails/23.jpg)
Watchdog: Loop
## Watchdogwhile True: markProblems() checkSolutions() time.sleep(1)
![Page 24: Make Sure Your Applications Crash](https://reader033.vdocuments.net/reader033/viewer/2022060109/555672e6d8b42abc5a8b4e38/html5/thumbnails/24.jpg)
Watchdog: accuracy of
Custom checkers can manufacture problems
![Page 25: Make Sure Your Applications Crash](https://reader033.vdocuments.net/reader033/viewer/2022060109/555672e6d8b42abc5a8b4e38/html5/thumbnails/25.jpg)
Watchdog: reliability of
Use cron for main loop
![Page 26: Make Sure Your Applications Crash](https://reader033.vdocuments.net/reader033/viewer/2022060109/555672e6d8b42abc5a8b4e38/html5/thumbnails/26.jpg)
Watchdog: reliability of
Use software/hardware watchdogs
![Page 27: Make Sure Your Applications Crash](https://reader033.vdocuments.net/reader033/viewer/2022060109/555672e6d8b42abc5a8b4e38/html5/thumbnails/27.jpg)
Conclusions
Everything crashes -- plan for it
![Page 28: Make Sure Your Applications Crash](https://reader033.vdocuments.net/reader033/viewer/2022060109/555672e6d8b42abc5a8b4e38/html5/thumbnails/28.jpg)
Questions?
![Page 29: Make Sure Your Applications Crash](https://reader033.vdocuments.net/reader033/viewer/2022060109/555672e6d8b42abc5a8b4e38/html5/thumbnails/29.jpg)
Welcome to the back-up slides
Extra! Extra!
![Page 30: Make Sure Your Applications Crash](https://reader033.vdocuments.net/reader033/viewer/2022060109/555672e6d8b42abc5a8b4e38/html5/thumbnails/30.jpg)
Example: Counting on Windows
def update_counter(): fp = file("counter.txt") s = fp.read() counter = int(s.strip()) counter += 1 # If there is a crash before this point, # no changes have been done. fp = file("counter.txt.tmp", 'w') print >>fp, counter fp.close() # If there is a crash before this point, # only a temp file has been modified os.remove("counter.txt") # At this point, the state is inconsistent* # The following is an atomic operation
![Page 31: Make Sure Your Applications Crash](https://reader033.vdocuments.net/reader033/viewer/2022060109/555672e6d8b42abc5a8b4e38/html5/thumbnails/31.jpg)
os.rename("counter.txt.tmp", "counter.txt")
![Page 32: Make Sure Your Applications Crash](https://reader033.vdocuments.net/reader033/viewer/2022060109/555672e6d8b42abc5a8b4e38/html5/thumbnails/32.jpg)
Example: Counting on Windows(Recovery)
def recover(): if not os.path.exists("counter.txt"): # The permanent file has been removed # Therefore, the temp file is valid os.rename("counter.txt.tmp", "counter.txt")
![Page 33: Make Sure Your Applications Crash](https://reader033.vdocuments.net/reader033/viewer/2022060109/555672e6d8b42abc5a8b4e38/html5/thumbnails/33.jpg)
Example: Counting with versions
def update_counter(): files = [int(name.split('.')[-1]) for name in os.listdir('.') if name.startswith('counter.')] last = max(files) counter = int(file('counter.%s' % last ).read().strip()) counter += 1 # If there is a crash before this point, # no changes have been done. fp = file("tmp.counter", 'w') print >>fp, counter fp.close() # If there is a crash before this point, # only a temp file has been modified
![Page 34: Make Sure Your Applications Crash](https://reader033.vdocuments.net/reader033/viewer/2022060109/555672e6d8b42abc5a8b4e38/html5/thumbnails/34.jpg)
os.rename('tmp.counter', 'counter.%s' % (last+1)) os.remove('counter.%s' % last)
![Page 35: Make Sure Your Applications Crash](https://reader033.vdocuments.net/reader033/viewer/2022060109/555672e6d8b42abc5a8b4e38/html5/thumbnails/35.jpg)
Example: Counting with versions(cleanup)
# This is not a recovery routine, but a cleanup# routine.# Even in its absence, the state is consistentdef cleanup(): files = [int(name.split('.')[-1]) for name in os.listdir('.') if name.startswith('counter.')] files.sort() files.pop() for n in files: os.remove('counter.%d' % n) if os.path.exists('tmp.counter'): os.remove('tmp.counter')
![Page 36: Make Sure Your Applications Crash](https://reader033.vdocuments.net/reader033/viewer/2022060109/555672e6d8b42abc5a8b4e38/html5/thumbnails/36.jpg)
Correct ordering
def activate_due(): scheduled = rs.smembers('scheduled') now = time.time() for el in scheduled: due = int(rs.get(el+':due')) if now<due: continue rs.sadd('activated', el) rs.delete(el+':due') rs.sremove('scheduled', el)
![Page 37: Make Sure Your Applications Crash](https://reader033.vdocuments.net/reader033/viewer/2022060109/555672e6d8b42abc5a8b4e38/html5/thumbnails/37.jpg)
Correct ordering (recovery)
def recover(): inconsistent = rs.sinter('activated', 'scheduled') for el in inconsistent: rs.delete(el+':due') #* rs.sremove('scheduled', el)
![Page 38: Make Sure Your Applications Crash](https://reader033.vdocuments.net/reader033/viewer/2022060109/555672e6d8b42abc5a8b4e38/html5/thumbnails/38.jpg)
Example: Key/value stores
0.log: ['add', 'key-0', 'value-0'] ['add', 'key-1', 'value-1'] ['add', 'key-0', 'value-2'] ['remove', 'key-1'] . . .
1.log: . . .
2.log:
![Page 39: Make Sure Your Applications Crash](https://reader033.vdocuments.net/reader033/viewer/2022060109/555672e6d8b42abc5a8b4e38/html5/thumbnails/39.jpg)
. . .
![Page 40: Make Sure Your Applications Crash](https://reader033.vdocuments.net/reader033/viewer/2022060109/555672e6d8b42abc5a8b4e38/html5/thumbnails/40.jpg)
Example: Key/value stores (utilityfunctions)
## Get the level of a filedef getLevel(s) return int(s.split('.')[0])
## Get all files of a given typedef getType(tp): return [(getLevel(s), s) for s in files if s.endswith(tp)]
![Page 41: Make Sure Your Applications Crash](https://reader033.vdocuments.net/reader033/viewer/2022060109/555672e6d8b42abc5a8b4e38/html5/thumbnails/41.jpg)
Example: Key/value stores(classifying files)
## Get all relevant filesdef relevant(d): files = os.listdir(d): mlevel, master = max(getType('.master')) logs = getType('.log') logs.sort() return master+[log for llevel, log in logs if llevel>mlevel]
![Page 42: Make Sure Your Applications Crash](https://reader033.vdocuments.net/reader033/viewer/2022060109/555672e6d8b42abc5a8b4e38/html5/thumbnails/42.jpg)
Example: Key/value stores (reading)
## Read in a single filedef update(result, fp): for line in fp: val = json.loads(line) if val[0] == 'add': result[val[1]] = val[2] else: del result[val[1]]
## Read in several filesdef read(files): result = dict() for fname in files: try: update(result, file(fname))
![Page 43: Make Sure Your Applications Crash](https://reader033.vdocuments.net/reader033/viewer/2022060109/555672e6d8b42abc5a8b4e38/html5/thumbnails/43.jpg)
except ValueError: pass return result
![Page 44: Make Sure Your Applications Crash](https://reader033.vdocuments.net/reader033/viewer/2022060109/555672e6d8b42abc5a8b4e38/html5/thumbnails/44.jpg)
Example: Key/value stores (writerclass)
class Writer(object): def __init__(self, level): self.level = level self.fp = None self._next() def _next(self): self.level += 1 if self.fp: self.fp.close() name ='%3d.log' % self.currentLevel self.fp = file(name, 'w') self.rows = 0 def write(self, value):
![Page 45: Make Sure Your Applications Crash](https://reader033.vdocuments.net/reader033/viewer/2022060109/555672e6d8b42abc5a8b4e38/html5/thumbnails/45.jpg)
print >>self.fp, json.dumps(value) self.fp.flush() self.rows += 1 if self.rows>200: self._next()
![Page 46: Make Sure Your Applications Crash](https://reader033.vdocuments.net/reader033/viewer/2022060109/555672e6d8b42abc5a8b4e38/html5/thumbnails/46.jpg)
Example: Key/value stores (storageclass)
## The actual data store abstraction.class Store(object): def __init__(self): files = relevant(d) self.result = read(files) level = getLevel(files[-1]) self.writer = Writer(level) def get(self, key): return self.result[key] def add(self, key, value): self.writer.write(['add', key, value]) def remove(self, key): self.writer.write(['remove', key])
![Page 47: Make Sure Your Applications Crash](https://reader033.vdocuments.net/reader033/viewer/2022060109/555672e6d8b42abc5a8b4e38/html5/thumbnails/47.jpg)
Example: Key/value stores(compression code)
## This should be run periodically# from a different threaddef compress(d): files = relevant(d)[:-1] if len(files)<2: return result = read(files) master = getLevel(files[-1])+1 fp = file('%3d.master.tmp' % master, 'w') for key, value in result.iteritems(): towrite = ['add', key, value]) print >>fp, json.dumps(towrite) fp.close()
![Page 48: Make Sure Your Applications Crash](https://reader033.vdocuments.net/reader033/viewer/2022060109/555672e6d8b42abc5a8b4e38/html5/thumbnails/48.jpg)
Vertical splitting: Example
def forking_server(): s = socket.socket() s.bind(('', 8080)) s.listen(5) while True: client = s.accept() newpid = os.fork() if newpid: f = client.makefile() f.write("Sunday, May 22, 1983 " "18:45:59-PST") f.close() os._exit()
![Page 49: Make Sure Your Applications Crash](https://reader033.vdocuments.net/reader033/viewer/2022060109/555672e6d8b42abc5a8b4e38/html5/thumbnails/49.jpg)
Horizontal splitting: front-end
## Process oneclass SchedulerResource(resource.Resource): isLeaf = True def __init__(self, filepath): resource.Resource.__init__(self) self.filepath = filepath def render_PUT(self, request): uuid, = request.postpath content = request.content.read() child = self.filepath.child(uuid) child.setContent(content)fp = filepath.FilePath("things")r = SchedulerResource(fp)s = server.Site(r)reactor.listenTCP(8080, s)
![Page 50: Make Sure Your Applications Crash](https://reader033.vdocuments.net/reader033/viewer/2022060109/555672e6d8b42abc5a8b4e38/html5/thumbnails/50.jpg)
Horizontal splitting: scheduler
## Process twors = redis.Redis(host='localhost', port=6379, db=9)while True: for fname in os.listdir("things"): when = int(file(fname).read().strip()) rs.set(uuid+':due', when) rs.sadd('scheduled', uuid) os.remove(fname) time.sleep(1)
![Page 51: Make Sure Your Applications Crash](https://reader033.vdocuments.net/reader033/viewer/2022060109/555672e6d8b42abc5a8b4e38/html5/thumbnails/51.jpg)
Horizontal splitting: runner
## Process threers = redis.Redis(host='localhost', port=6379, db=9)recover()while True: activate_due() time.sleep(1)
![Page 52: Make Sure Your Applications Crash](https://reader033.vdocuments.net/reader033/viewer/2022060109/555672e6d8b42abc5a8b4e38/html5/thumbnails/52.jpg)
Horizontal splitting: messagequeues
No direct dependencies
![Page 53: Make Sure Your Applications Crash](https://reader033.vdocuments.net/reader033/viewer/2022060109/555672e6d8b42abc5a8b4e38/html5/thumbnails/53.jpg)
Horizontal splitting: messagequeues: sender
## Process fourrs = redis.Redis(host='localhost', port=6379, db=9)params = pika.ConnectionParameters('localhost')conn = pika.BlockingConnection(params)channel = conn.channel()channel.queue_declare(queue='active')while True: activated = rs.smembers('activated') finished = set(rs.smembers('finished')) for el in activated: if el in finished: continue
![Page 54: Make Sure Your Applications Crash](https://reader033.vdocuments.net/reader033/viewer/2022060109/555672e6d8b42abc5a8b4e38/html5/thumbnails/54.jpg)
channel.basic_publish( exchange='', routing_key='active', body=el) rs.add('finished', el)
![Page 55: Make Sure Your Applications Crash](https://reader033.vdocuments.net/reader033/viewer/2022060109/555672e6d8b42abc5a8b4e38/html5/thumbnails/55.jpg)
Horizontal splitting: messagequeues: receiver
## Process five# It is possible to get "dups" of bodies.# Application logic should deal with thatparams = pika.ConnectionParameters('localhost')conn = pika.BlockingConnection(params)channel = conn.channel()channel.queue_declare(queue='active')def callback(ch, method, properties, el): syslog.syslog('Activated %s' % el)channel.basic_consume(callback, queue='hello', no_ack=True)channel.start_consuming()
![Page 56: Make Sure Your Applications Crash](https://reader033.vdocuments.net/reader033/viewer/2022060109/555672e6d8b42abc5a8b4e38/html5/thumbnails/56.jpg)
Horizontal splitting: point-to-point
Use HTTP (preferably, REST)