Download - Sea Amsterdam 2014 November 19
![Page 1: Sea Amsterdam 2014 November 19](https://reader034.vdocuments.net/reader034/viewer/2022042820/55d1c066bb61eb83478b465c/html5/thumbnails/1.jpg)
GoDataDrivenPROUDLY PART OF THE XEBIA GROUP
Real time data driven applications
Giovanni LanzaniData Whisperer
Using Python + pandas as back end
![Page 2: Sea Amsterdam 2014 November 19](https://reader034.vdocuments.net/reader034/viewer/2022042820/55d1c066bb61eb83478b465c/html5/thumbnails/2.jpg)
Who am I?
2008-2012: PhD Theoretical Physics
2012-2013: KPMG
2013-Now: GoDataDriven
![Page 3: Sea Amsterdam 2014 November 19](https://reader034.vdocuments.net/reader034/viewer/2022042820/55d1c066bb61eb83478b465c/html5/thumbnails/3.jpg)
GoDataDriven
Feedback
@gglanzani
![Page 4: Sea Amsterdam 2014 November 19](https://reader034.vdocuments.net/reader034/viewer/2022042820/55d1c066bb61eb83478b465c/html5/thumbnails/4.jpg)
GoDataDriven
Real-time, data driven app?
• No store and retrieve;
• Store, {transform, enrich, analyse} and retrieve;
• Real-time: retrieve is not a batch process;
• App: something your mother could use:
SELECT attendees FROM pydataberlin2014 WHERE password = '1234';
![Page 5: Sea Amsterdam 2014 November 19](https://reader034.vdocuments.net/reader034/viewer/2022042820/55d1c066bb61eb83478b465c/html5/thumbnails/5.jpg)
GoDataDriven
Get insight about event impact
![Page 6: Sea Amsterdam 2014 November 19](https://reader034.vdocuments.net/reader034/viewer/2022042820/55d1c066bb61eb83478b465c/html5/thumbnails/6.jpg)
GoDataDriven
Get insight about event impact
![Page 7: Sea Amsterdam 2014 November 19](https://reader034.vdocuments.net/reader034/viewer/2022042820/55d1c066bb61eb83478b465c/html5/thumbnails/7.jpg)
GoDataDriven
Get insight about event impact
![Page 8: Sea Amsterdam 2014 November 19](https://reader034.vdocuments.net/reader034/viewer/2022042820/55d1c066bb61eb83478b465c/html5/thumbnails/8.jpg)
GoDataDriven
Get insight about event impact
![Page 9: Sea Amsterdam 2014 November 19](https://reader034.vdocuments.net/reader034/viewer/2022042820/55d1c066bb61eb83478b465c/html5/thumbnails/9.jpg)
GoDataDriven
Is it Big Data?
• Raw logs are in the order of 40TB;
• We use Hadoop for storing, enriching and pre-processing;
• (10 nodes, 24TB per nodes)
![Page 10: Sea Amsterdam 2014 November 19](https://reader034.vdocuments.net/reader034/viewer/2022042820/55d1c066bb61eb83478b465c/html5/thumbnails/10.jpg)
GoDataDriven
Challenges
1. Privacy;2. Huge pile of data;3. Real-time retrieval;
4. Some real-time analysis.
![Page 11: Sea Amsterdam 2014 November 19](https://reader034.vdocuments.net/reader034/viewer/2022042820/55d1c066bb61eb83478b465c/html5/thumbnails/11.jpg)
GoDataDriven
1. Privacy
![Page 12: Sea Amsterdam 2014 November 19](https://reader034.vdocuments.net/reader034/viewer/2022042820/55d1c066bb61eb83478b465c/html5/thumbnails/12.jpg)
GoDataDriven
3. Real-time retrieval
• Harder than it looks;
• Large data;
• Retrieval is by giving date, center location + radius.
![Page 13: Sea Amsterdam 2014 November 19](https://reader034.vdocuments.net/reader034/viewer/2022042820/55d1c066bb61eb83478b465c/html5/thumbnails/13.jpg)
GoDataDriven
4. (Some) real-time analysis
![Page 14: Sea Amsterdam 2014 November 19](https://reader034.vdocuments.net/reader034/viewer/2022042820/55d1c066bb61eb83478b465c/html5/thumbnails/14.jpg)
GoDataDriven
Architecture
AngularJS app.py
helper.py
REST
Front-end Back-end
JSON
![Page 15: Sea Amsterdam 2014 November 19](https://reader034.vdocuments.net/reader034/viewer/2022042820/55d1c066bb61eb83478b465c/html5/thumbnails/15.jpg)
GoDataDriven
JS - 1
![Page 16: Sea Amsterdam 2014 November 19](https://reader034.vdocuments.net/reader034/viewer/2022042820/55d1c066bb61eb83478b465c/html5/thumbnails/16.jpg)
GoDataDriven
JS - 2
![Page 17: Sea Amsterdam 2014 November 19](https://reader034.vdocuments.net/reader034/viewer/2022042820/55d1c066bb61eb83478b465c/html5/thumbnails/17.jpg)
Flask
from flask import Flaskapp = Flask(__name__)
@app.route("/hello")def hello(): return "Hello World!"
if __name__ == "__main__": app.run()
![Page 18: Sea Amsterdam 2014 November 19](https://reader034.vdocuments.net/reader034/viewer/2022042820/55d1c066bb61eb83478b465c/html5/thumbnails/18.jpg)
GoDataDriven
app.py example
@app.route('/api/<postcode>/<date>/<radius>', methods=['GET'])@app.route('/api/<postcode>/<date>', methods=['GET'])def datapoints(postcode, date, radius=1.0): ... stats, timeline, points = helper.get_json(postcode, date, radius) return … # returns a JSON object for AngularJS
![Page 19: Sea Amsterdam 2014 November 19](https://reader034.vdocuments.net/reader034/viewer/2022042820/55d1c066bb61eb83478b465c/html5/thumbnails/19.jpg)
GoDataDriven
data example
date hour id_activity postcode hits delta sbi
2013-01-01 12 1234 1234AB 35 22 1
2013-01-08 12 1234 1234AB 45 35 1
2013-01-01 11 2345 5555ZB 2 1 2
2013-01-08 11 2345 5555ZB 55 2 2
![Page 20: Sea Amsterdam 2014 November 19](https://reader034.vdocuments.net/reader034/viewer/2022042820/55d1c066bb61eb83478b465c/html5/thumbnails/20.jpg)
GoDataDriven
Levenshtein distancedef levenshtein(s, t): if len(s) == 0: return len(t) if len(t) == 0: return len(s) cost = 0 if s[-1] == t[-1] else 1 return min(levenshtein(s[:-1], t) + 1, levenshtein(s, t[:-1]) + 1, levenshtein(s[:-1], t[:-1]) + cost)
levenshtein("search", “engine”) # 6 levenshtein("Amsterdam", “meetup") # 7levenshtein("Lplein", “Leidseplein") # 5
![Page 21: Sea Amsterdam 2014 November 19](https://reader034.vdocuments.net/reader034/viewer/2022042820/55d1c066bb61eb83478b465c/html5/thumbnails/21.jpg)
GoDataDriven
Aside
• Memoize every recursive algorithm in Python!from cytoolz import functoolz
@functoolz.memoizedef memo_levenshtein(s, t): if len(s) == 0: return len(t) if len(t) == 0: return len(s) cost = 0 if s[-1] == t[-1] else 1 return min(levenshtein(s[:-1], t) + 1, levenshtein(s, t[:-1]) + 1, levenshtein(s[:-1], t[:-1]) + cost)
![Page 22: Sea Amsterdam 2014 November 19](https://reader034.vdocuments.net/reader034/viewer/2022042820/55d1c066bb61eb83478b465c/html5/thumbnails/22.jpg)
GoDataDriven
Aside
• Memoize the algorithm if you’re using Python!a = “Marco Polo str.”b = “Marco Polo straat”%timeit memo_levenshtein(a, b) # 213ns%timeit levenshtein(a, b) # 1.84 µs
![Page 23: Sea Amsterdam 2014 November 19](https://reader034.vdocuments.net/reader034/viewer/2022042820/55d1c066bb61eb83478b465c/html5/thumbnails/23.jpg)
GoDataDriven
helper.py example
def get_json(postcode, date, radius): ...
lat, lon = get_lat_lon(postcode) postcodes = get_postcodes(postcode, radius) data = get_data(postcodes, dates)
stats = get_statistics(data, sbi) timeline = get_timeline(data, sbi)
return stats, timeline, data.to_json(orient='records')
![Page 24: Sea Amsterdam 2014 November 19](https://reader034.vdocuments.net/reader034/viewer/2022042820/55d1c066bb61eb83478b465c/html5/thumbnails/24.jpg)
GoDataDriven
helper.py example
def get_statistics(data, sbi): sbi_df = data[data.sbi == sbi] # select * from data where sbi = sbi hits = sbi_df.hits.sum() # select sum(hits) from … delta_hits = sbi_df.delta.sum() # select sum(delta) from … if delta_hits: percentage = (hits - delta_hits) / delta_hits else: percentage = 0
return {"sbi": sbi, "total": hits, "percentage": percentage}
![Page 25: Sea Amsterdam 2014 November 19](https://reader034.vdocuments.net/reader034/viewer/2022042820/55d1c066bb61eb83478b465c/html5/thumbnails/25.jpg)
GoDataDriven
helper.py exampledef get_timeline(data, sbi): df_sbi = data.groupby([“date”, “hour", “sbi"]).aggregate(sum) # select sum(hits), sum(delta) from data group by date, hour, sbi return df_sbi
![Page 26: Sea Amsterdam 2014 November 19](https://reader034.vdocuments.net/reader034/viewer/2022042820/55d1c066bb61eb83478b465c/html5/thumbnails/26.jpg)
GoDataDriven
helper.py exampledef get_json(postcode, date, radius): ... lat, lon = get_lat_lon(postcode) postcodes = get_postcodes(postcode, radius) dates = date.split(';') data = get_data(postcodes, dates)
stats = get_statistics(data) timeline = get_timeline(data, dates)
return stats, timeline, data.to_json(orient='records')
![Page 27: Sea Amsterdam 2014 November 19](https://reader034.vdocuments.net/reader034/viewer/2022042820/55d1c066bb61eb83478b465c/html5/thumbnails/27.jpg)
GoDataDriven
Who has my data?
• First iteration was a (pre)-POC, less data (3GB vs 500GB);
• Time constraints;
• Oeps:
import pandas as pd...source_data = pd.read_csv("data.csv", …)...def get_data(postcodes, dates): result = filter_data(source_data, postcodes, dates) return result
![Page 28: Sea Amsterdam 2014 November 19](https://reader034.vdocuments.net/reader034/viewer/2022042820/55d1c066bb61eb83478b465c/html5/thumbnails/28.jpg)
GoDataDriven
Advantage of “everything is a df ”
Pro:
• Fast!!
• Use what you know
• NO DBA’s!
• We all love CSV’s!
Contra:
• Doesn’t scale;
• Huge startup time;
• NO DBA’s!
• We all hate CSV’s!
![Page 29: Sea Amsterdam 2014 November 19](https://reader034.vdocuments.net/reader034/viewer/2022042820/55d1c066bb61eb83478b465c/html5/thumbnails/29.jpg)
GoDataDriven
If you want to go down this path
• Set the dataframe index wisely;
• Align the data to the index:
• Beware of modifications of the original dataframe!source_data.sort_index(inplace=True)
![Page 30: Sea Amsterdam 2014 November 19](https://reader034.vdocuments.net/reader034/viewer/2022042820/55d1c066bb61eb83478b465c/html5/thumbnails/30.jpg)
GoDataDriven
If you want to go down this path
The reason pandas is faster is because I came up with a better algorithm
![Page 31: Sea Amsterdam 2014 November 19](https://reader034.vdocuments.net/reader034/viewer/2022042820/55d1c066bb61eb83478b465c/html5/thumbnails/31.jpg)
GoDataDriven
New architecture
data = get_data(db, postcodes, dates)data = get_data(postcodes, dates)
database.py
Data
psycopg2
AngularJS app.py
helper.py
REST
Front-end Back-end
JSON
![Page 32: Sea Amsterdam 2014 November 19](https://reader034.vdocuments.net/reader034/viewer/2022042820/55d1c066bb61eb83478b465c/html5/thumbnails/32.jpg)
GoDataDriven
Handling geo-datadef get_json(postcode, date, radius): """ ... """ lat, lon = get_lat_lon(postcode) postcodes = get_postcodes(postcode, radius) dates = date.split(';') data = get_data(postcodes, dates)
stats = get_statistics(data) timeline = get_timeline(data, dates)
return stats, timeline, data.to_json(orient='records')
![Page 33: Sea Amsterdam 2014 November 19](https://reader034.vdocuments.net/reader034/viewer/2022042820/55d1c066bb61eb83478b465c/html5/thumbnails/33.jpg)
GoDataDriven
Issues?!
• With a radius of 10km, in Amsterdam, you get 10k postcodes. You need to do this in your SQL:
• Index on date and postcode, but single queries running more than 20 minutes.
SELECT * FROM datapoints WHERE date IN date_array
AND postcode IN postcode_array;
![Page 34: Sea Amsterdam 2014 November 19](https://reader034.vdocuments.net/reader034/viewer/2022042820/55d1c066bb61eb83478b465c/html5/thumbnails/34.jpg)
GoDataDriven
Postgres + Postgis (2.x)
PostGIS is a spatial database extender for PostgreSQL. Supports geographic objects allowing location queries
SELECT *FROM datapointsWHERE ST_DWithin(lon, lat, 1500)AND dates IN ('2013-02-30', '2013-02-31');-- every point within 1.5km -- from (lat, lon) on imaginary dates
![Page 35: Sea Amsterdam 2014 November 19](https://reader034.vdocuments.net/reader034/viewer/2022042820/55d1c066bb61eb83478b465c/html5/thumbnails/35.jpg)
Other db’s?
![Page 36: Sea Amsterdam 2014 November 19](https://reader034.vdocuments.net/reader034/viewer/2022042820/55d1c066bb61eb83478b465c/html5/thumbnails/36.jpg)
GoDataDriven
Steps to solve it
1. Align data on disk by date;2. Use the temporary table trick:
3. Lose precision: 1234AB→1234
CREATE TEMPORARY TABLE tmp (postcodes STRING NOT NULL PRIMARY KEY);INSERT INTO tmp (postcodes) VALUES postcode_array;
SELECT * FROM tmp JOIN datapoints d ON d.postcode = tmp.postcodes WHERE d.dt IN dates_array;
![Page 37: Sea Amsterdam 2014 November 19](https://reader034.vdocuments.net/reader034/viewer/2022042820/55d1c066bb61eb83478b465c/html5/thumbnails/37.jpg)
GoDataDriven
Take home messages
1. Geospatial problems are “hard” and can kill your queries;
2. Not everybody has infinite resources: be smart and KISS!
3. SQL or NoSQL? (Size, schema)
![Page 38: Sea Amsterdam 2014 November 19](https://reader034.vdocuments.net/reader034/viewer/2022042820/55d1c066bb61eb83478b465c/html5/thumbnails/38.jpg)
GoDataDriven
We’re hiring / Questions? / Thank you!
Giovanni LanzaniData Whisperer