make sure your applications crash

Make Sure Your Applications Crash Moshe Zadka

Upload: moshe-zadka

Post on 16-May-2015




2 download


Presentation for PyCon 2012 about application reliability.


Page 1: Make Sure Your  Applications Crash

Make Sure Your Applications Crash

Moshe Zadka

Page 2: Make Sure Your  Applications Crash

True story

Page 3: Make Sure Your  Applications Crash

Python doesn't crash

Memory managed, no direct pointer arithmetic

Page 4: Make Sure Your  Applications Crash

...except it does

C bugs, untrapped exception, infinite loops,blocking calls, thread dead-lock, inconsistent

resident state

Page 5: Make Sure Your  Applications Crash

Recovery is important

"[S]ystem failure can usually be considered tobe the result of two program errors[...] the

second, in the recovery routine[...]"

Page 6: Make Sure Your  Applications Crash

Crashes and inconsistent data

A crash results in data from an arbitraryprogram state.

Page 7: Make Sure Your  Applications Crash

Avoid storage

Caches are better than master copies.

Page 8: Make Sure Your  Applications Crash


Transactions maintain consistencyDatabases can crash too!

Page 9: Make Sure Your  Applications Crash

Atomic operations

File rename

Page 10: Make Sure Your  Applications Crash

Example: Counting

def update_counter(): fp = file("counter.txt") s = counter = int(s.strip()) counter += 1 # If there is a crash before this point, # no changes have been done. fp = file("counter.txt.tmp", 'w') print >>fp, counter fp.close() # If there is a crash before this point, # only a temp file has been modified # The following is an atomic operation os.rename("counter.txt.tmp", "counter.txt")

Page 11: Make Sure Your  Applications Crash

Efficient caches, reliable masters

Mark inconsistency of cache

Page 12: Make Sure Your  Applications Crash

No shutdown

Crash in testing

Page 13: Make Sure Your  Applications Crash


If data is consistent, just restart!

Page 14: Make Sure Your  Applications Crash

Improving availability

Limit impactFast detectionFast start-up

Page 15: Make Sure Your  Applications Crash

Vertical splitting

Different execution paths, different processes

Page 16: Make Sure Your  Applications Crash

Horizontal splitting

Different code bases, different processes

Page 17: Make Sure Your  Applications Crash


Monitor -> Flag -> Remediate

Page 18: Make Sure Your  Applications Crash

Watchdog principles

Keep it simple, keep it safe!

Page 19: Make Sure Your  Applications Crash

Watchdog: Heartbeats

## In a Twisted processdef beat(): file('beats/my-name', 'a').close()task.LoopingCall(beat).start(30)

Page 20: Make Sure Your  Applications Crash

Watchdog: Get time-outs

def getTimeout() timeout = dict() now = time.time() for heart in glob.glob('hearts/*'): beat = int(file(heart).read().strip()) timeout[heart] = now-beat return timeout

Page 21: Make Sure Your  Applications Crash

Watchdog: Mark problems

def markProblems(): timeout = getTimeout() for heart in glob.glob('beats/*'): mtime = os.path.getmtime(heart) problem = 'problems/'+heart if (mtime<timeout[heart] and not os.path.isfile(problem)): fp = file('problems/'+heart, 'w') fp.write('watchdog') fp.close()

Page 22: Make Sure Your  Applications Crash

Watchdog: check solutions

def checkSolutions(): now = time.time() problemTimeout = now-30 for problem in glob.glob('problems/*'): mtime = os.path.getmtime(problem) if mtime<problemTimeout:['restart-system'])

Page 23: Make Sure Your  Applications Crash

Watchdog: Loop

## Watchdogwhile True: markProblems() checkSolutions() time.sleep(1)

Page 24: Make Sure Your  Applications Crash

Watchdog: accuracy of

Custom checkers can manufacture problems

Page 25: Make Sure Your  Applications Crash

Watchdog: reliability of

Use cron for main loop

Page 26: Make Sure Your  Applications Crash

Watchdog: reliability of

Use software/hardware watchdogs

Page 27: Make Sure Your  Applications Crash


Everything crashes -- plan for it

Page 28: Make Sure Your  Applications Crash


Page 29: Make Sure Your  Applications Crash

Welcome to the back-up slides

Extra! Extra!

Page 30: Make Sure Your  Applications Crash

Example: Counting on Windows

def update_counter(): fp = file("counter.txt") s = counter = int(s.strip()) counter += 1 # If there is a crash before this point, # no changes have been done. fp = file("counter.txt.tmp", 'w') print >>fp, counter fp.close() # If there is a crash before this point, # only a temp file has been modified os.remove("counter.txt") # At this point, the state is inconsistent* # The following is an atomic operation

Page 31: Make Sure Your  Applications Crash

os.rename("counter.txt.tmp", "counter.txt")

Page 32: Make Sure Your  Applications Crash

Example: Counting on Windows(Recovery)

def recover(): if not os.path.exists("counter.txt"): # The permanent file has been removed # Therefore, the temp file is valid os.rename("counter.txt.tmp", "counter.txt")

Page 33: Make Sure Your  Applications Crash

Example: Counting with versions

def update_counter(): files = [int(name.split('.')[-1]) for name in os.listdir('.') if name.startswith('counter.')] last = max(files) counter = int(file('counter.%s' % last ).read().strip()) counter += 1 # If there is a crash before this point, # no changes have been done. fp = file("tmp.counter", 'w') print >>fp, counter fp.close() # If there is a crash before this point, # only a temp file has been modified

Page 34: Make Sure Your  Applications Crash

os.rename('tmp.counter', 'counter.%s' % (last+1)) os.remove('counter.%s' % last)

Page 35: Make Sure Your  Applications Crash

Example: Counting with versions(cleanup)

# This is not a recovery routine, but a cleanup# routine.# Even in its absence, the state is consistentdef cleanup(): files = [int(name.split('.')[-1]) for name in os.listdir('.') if name.startswith('counter.')] files.sort() files.pop() for n in files: os.remove('counter.%d' % n) if os.path.exists('tmp.counter'): os.remove('tmp.counter')

Page 36: Make Sure Your  Applications Crash

Correct ordering

def activate_due(): scheduled = rs.smembers('scheduled') now = time.time() for el in scheduled: due = int(rs.get(el+':due')) if now<due: continue rs.sadd('activated', el) rs.delete(el+':due') rs.sremove('scheduled', el)

Page 37: Make Sure Your  Applications Crash

Correct ordering (recovery)

def recover(): inconsistent = rs.sinter('activated', 'scheduled') for el in inconsistent: rs.delete(el+':due') #* rs.sremove('scheduled', el)

Page 38: Make Sure Your  Applications Crash

Example: Key/value stores

0.log: ['add', 'key-0', 'value-0'] ['add', 'key-1', 'value-1'] ['add', 'key-0', 'value-2'] ['remove', 'key-1'] . . .

1.log: . . .


Page 39: Make Sure Your  Applications Crash

. . .

Page 40: Make Sure Your  Applications Crash

Example: Key/value stores (utilityfunctions)

## Get the level of a filedef getLevel(s) return int(s.split('.')[0])

## Get all files of a given typedef getType(tp): return [(getLevel(s), s) for s in files if s.endswith(tp)]

Page 41: Make Sure Your  Applications Crash

Example: Key/value stores(classifying files)

## Get all relevant filesdef relevant(d): files = os.listdir(d): mlevel, master = max(getType('.master')) logs = getType('.log') logs.sort() return master+[log for llevel, log in logs if llevel>mlevel]

Page 42: Make Sure Your  Applications Crash

Example: Key/value stores (reading)

## Read in a single filedef update(result, fp): for line in fp: val = json.loads(line) if val[0] == 'add': result[val[1]] = val[2] else: del result[val[1]]

## Read in several filesdef read(files): result = dict() for fname in files: try: update(result, file(fname))

Page 43: Make Sure Your  Applications Crash

except ValueError: pass return result

Page 44: Make Sure Your  Applications Crash

Example: Key/value stores (writerclass)

class Writer(object): def __init__(self, level): self.level = level self.fp = None self._next() def _next(self): self.level += 1 if self.fp: self.fp.close() name ='%3d.log' % self.currentLevel self.fp = file(name, 'w') self.rows = 0 def write(self, value):

Page 45: Make Sure Your  Applications Crash

print >>self.fp, json.dumps(value) self.fp.flush() self.rows += 1 if self.rows>200: self._next()

Page 46: Make Sure Your  Applications Crash

Example: Key/value stores (storageclass)

## The actual data store abstraction.class Store(object): def __init__(self): files = relevant(d) self.result = read(files) level = getLevel(files[-1]) self.writer = Writer(level) def get(self, key): return self.result[key] def add(self, key, value): self.writer.write(['add', key, value]) def remove(self, key): self.writer.write(['remove', key])

Page 47: Make Sure Your  Applications Crash

Example: Key/value stores(compression code)

## This should be run periodically# from a different threaddef compress(d): files = relevant(d)[:-1] if len(files)<2: return result = read(files) master = getLevel(files[-1])+1 fp = file('%3d.master.tmp' % master, 'w') for key, value in result.iteritems(): towrite = ['add', key, value]) print >>fp, json.dumps(towrite) fp.close()

Page 48: Make Sure Your  Applications Crash

Vertical splitting: Example

def forking_server(): s = socket.socket() s.bind(('', 8080)) s.listen(5) while True: client = s.accept() newpid = os.fork() if newpid: f = client.makefile() f.write("Sunday, May 22, 1983 " "18:45:59-PST") f.close() os._exit()

Page 49: Make Sure Your  Applications Crash

Horizontal splitting: front-end

## Process oneclass SchedulerResource(resource.Resource): isLeaf = True def __init__(self, filepath): resource.Resource.__init__(self) self.filepath = filepath def render_PUT(self, request): uuid, = request.postpath content = child = self.filepath.child(uuid) child.setContent(content)fp = filepath.FilePath("things")r = SchedulerResource(fp)s = server.Site(r)reactor.listenTCP(8080, s)

Page 50: Make Sure Your  Applications Crash

Horizontal splitting: scheduler

## Process twors = redis.Redis(host='localhost', port=6379, db=9)while True: for fname in os.listdir("things"): when = int(file(fname).read().strip()) rs.set(uuid+':due', when) rs.sadd('scheduled', uuid) os.remove(fname) time.sleep(1)

Page 51: Make Sure Your  Applications Crash

Horizontal splitting: runner

## Process threers = redis.Redis(host='localhost', port=6379, db=9)recover()while True: activate_due() time.sleep(1)

Page 52: Make Sure Your  Applications Crash

Horizontal splitting: messagequeues

No direct dependencies

Page 53: Make Sure Your  Applications Crash

Horizontal splitting: messagequeues: sender

## Process fourrs = redis.Redis(host='localhost', port=6379, db=9)params = pika.ConnectionParameters('localhost')conn = pika.BlockingConnection(params)channel ='active')while True: activated = rs.smembers('activated') finished = set(rs.smembers('finished')) for el in activated: if el in finished: continue

Page 54: Make Sure Your  Applications Crash

channel.basic_publish( exchange='', routing_key='active', body=el) rs.add('finished', el)

Page 55: Make Sure Your  Applications Crash

Horizontal splitting: messagequeues: receiver

## Process five# It is possible to get "dups" of bodies.# Application logic should deal with thatparams = pika.ConnectionParameters('localhost')conn = pika.BlockingConnection(params)channel ='active')def callback(ch, method, properties, el): syslog.syslog('Activated %s' % el)channel.basic_consume(callback, queue='hello', no_ack=True)channel.start_consuming()

Page 56: Make Sure Your  Applications Crash

Horizontal splitting: point-to-point

Use HTTP (preferably, REST)