pydata paris 2015 - track 4.2 jean maynier
TRANSCRIPT
![Page 1: PyData Paris 2015 - Track 4.2 Jean Maynier](https://reader034.vdocuments.net/reader034/viewer/2022042818/55aabbe71a28ab54138b45f3/html5/thumbnails/1.jpg)
Python, SQLAlchemy
and Scrapy at Kpler
Jean Maynier
CTO
![Page 2: PyData Paris 2015 - Track 4.2 Jean Maynier](https://reader034.vdocuments.net/reader034/viewer/2022042818/55aabbe71a28ab54138b45f3/html5/thumbnails/2.jpg)
Business : commodities markets
� Domain: trading / maritime transport
� Customers: physical traders / analysts
� Needs: follow cargoes flows and prices impact
![Page 3: PyData Paris 2015 - Track 4.2 Jean Maynier](https://reader034.vdocuments.net/reader034/viewer/2022042818/55aabbe71a28ab54138b45f3/html5/thumbnails/3.jpg)
Commercial Database
“Meta” AIS (10 networks)
Ports Authorities
Gas Inventories
Methodology : 4 data layers
The first “Cargo-Tracking” system, bridging the gap between LNG Shipping and Trading.
![Page 4: PyData Paris 2015 - Track 4.2 Jean Maynier](https://reader034.vdocuments.net/reader034/viewer/2022042818/55aabbe71a28ab54138b45f3/html5/thumbnails/4.jpg)
Why python ?
� Simple, fast to prototype
� Data libraries (sqlAlchemy, Pandas, Scikit-learn)
� Data oriented community
� Scrapy: best scraping framework
Step 1
![Page 5: PyData Paris 2015 - Track 4.2 Jean Maynier](https://reader034.vdocuments.net/reader034/viewer/2022042818/55aabbe71a28ab54138b45f3/html5/thumbnails/5.jpg)
Scrapy
� Simplify crawling
� Reusable extractors (xpath, css selectors)
� PaaS: Scrapinghub
� Blacklist: Tor, crawlera for IP rotation
Step 2
![Page 6: PyData Paris 2015 - Track 4.2 Jean Maynier](https://reader034.vdocuments.net/reader034/viewer/2022042818/55aabbe71a28ab54138b45f3/html5/thumbnails/6.jpg)
Process raw data
� Clean and normalize data � Positions
� Port calls
� Contracts
� Prices
� Python PubSub system
� Persist data : sqlAlchemy & Postgresql
Step 3
![Page 7: PyData Paris 2015 - Track 4.2 Jean Maynier](https://reader034.vdocuments.net/reader034/viewer/2022042818/55aabbe71a28ab54138b45f3/html5/thumbnails/7.jpg)
Enrich data
Done
� Vessel dock/undock events
� Predict next destination
� Find buyer/seller
� Basic routing
Todo / WIP
� Estimate price
� Routing with weather
� Portfolio optimisation
� Fleet optimisation
Step 4
![Page 8: PyData Paris 2015 - Track 4.2 Jean Maynier](https://reader034.vdocuments.net/reader034/viewer/2022042818/55aabbe71a28ab54138b45f3/html5/thumbnails/8.jpg)
Architecture
Websites sources
API sources
Scraping Python/scrapy
Cache
Elasticseach
Data
ETL + models python
Web
REST API scala
client
mobile client
Excel
![Page 9: PyData Paris 2015 - Track 4.2 Jean Maynier](https://reader034.vdocuments.net/reader034/viewer/2022042818/55aabbe71a28ab54138b45f3/html5/thumbnails/9.jpg)
Metrics trends
Present
� DBs < 100Gb
� 50 sources
� 500 vessels
� Positions every 3min
Future
� 1-10 Tb
� 100 sources
� 10k to 100k vessels
� Position every 30s
Performance problem !
![Page 10: PyData Paris 2015 - Track 4.2 Jean Maynier](https://reader034.vdocuments.net/reader034/viewer/2022042818/55aabbe71a28ab54138b45f3/html5/thumbnails/10.jpg)
Batch to data streaming
� Parallelization
� Granularity : item vs source
� Handle failure, no data loss, monitoring
� Akka streaming, Spark streaming, Celery, Storm
Step 5
![Page 11: PyData Paris 2015 - Track 4.2 Jean Maynier](https://reader034.vdocuments.net/reader034/viewer/2022042818/55aabbe71a28ab54138b45f3/html5/thumbnails/11.jpg)
Storm at Kpler: POC
Websites sources
API sources
Scraping Python/scrapy
ETL + models
Position
Dock/undock
Custom prices Match prices
Next destination
Trades
Cargo volumes