shipping data science products! - bi...
TRANSCRIPT
Shipping Data Science Products!Turning raw data into valuable servicesBudapestBI Forum 2015License: CC By Attribution
Ian Ozsvald @IanOzsvald ModelInsight.io
[email protected] @IanOzsvald BudapestBI Forum October 2015
Who Am I?
● “Industrial Data Science” for 15 years● Data Product Builder● O'Reilly Author● Teacher at PyCons
[email protected] @IanOzsvald BudapestBI Forum October 2015
Who are you?
● Type A(nalysis) or B(building)● Robert Chang - “Doing Data Science at
Twitter”
[email protected] @IanOzsvald BudapestBI Forum October 2015
What frustrations do we share?
● Lack of useful data● Biggest time sink - cleaning & transforming
● Conservative management● How can we derisk projects?
● Medium Data● luckily we have Wes in the room
[email protected] @IanOzsvald BudapestBI Forum October 2015
Which projects succeed?
● Explain existing data (visualisation!)● Automate repetitive/slow processes (higher accuracy, more repeatable)
● Augment data to make new data (e.g. for search engines and ML)
● Predict the future (e.g. replace human intuition or use subtler relationships)
[email protected] @IanOzsvald BudapestBI Forum October 2015
Why is it valuable?
[email protected] @IanOzsvald BudapestBI Forum October 2015
Visualising data
● Most data isn't interesting...● Requires human curation + detective skills to get the good stuff
● Couple a researcher + a business person
[email protected] @IanOzsvald BudapestBI Forum October 2015
Medical data (anti-allergy)
Perceived complexity might make sign-off more difficult...
[email protected] @IanOzsvald BudapestBI Forum October 2015
Medical data (anti-allergy)
Predict using:● food● alcohol ● pollen● pollution● location● cats● ...
[email protected] @IanOzsvald BudapestBI Forum October 2015
Extracting data from binary files
● Copy/pasting PDF/PNG data is laborious● How can we scale it?● textract/Tika - unified interface● Specialised tools e.g. Sovren● This might take months!
[email protected] @IanOzsvald BudapestBI Forum October 2015
Augmenting data
● Identifying people, places, brands, sentiment
● “i love my apple phone” ● Context-sensitive (e.g movies vs products)
● Build custom machine-learned tools● Augment job titles● Reconcile the same order in 2 tables
[email protected] @IanOzsvald BudapestBI Forum October 2015
Machine Learning
● PyMC (Markov Chain Monte Carlo)Please cite these projects! (it helps their funding)
[email protected] @IanOzsvald BudapestBI Forum October 2015
Debugging Machine Learning?● Thoughts from you?● No obvious tools to show me:
● these examples were well-fitted● these always wrongly-fitted● these always uncertain
● No data-diagnostics to validate inputs (e.g. for Logistic Regression)
● No visualisers for most of the models● Your hard-won knowledge->new debug tools? (PLEASE!)
[email protected] @IanOzsvald BudapestBI Forum October 2015
Debugging Machine Learning?Roelof Pieters PyDataLondon2015
[email protected] @IanOzsvald BudapestBI Forum October 2015
Delivery: Keep It Simple (Stupid!)
● We're (probably) not publishing the best result
● Debuggability is key - 3am Sunday CTO beeper alert is no time for complexity
● “cult of the imperfect” Watson-Watt● Dumb models + clean data beat other combinations
[email protected] @IanOzsvald BudapestBI Forum October 2015
Don't Kill It!● Your data is missing, it is poor and it lies
● Missing data kills projects!● Log everything! ● Make data quality tools & reports● More data->desynchronisation
● R&D != Engineering● Discovery-based● Success and failure equally useful
engarde
[email protected] @IanOzsvald BudapestBI Forum October 2015
Internal deployment
● CSVs/Reports● Database updates● IPython Notebook
(not secure though!)
[email protected] @IanOzsvald BudapestBI Forum October 2015
Deploying live systems
● Spyre (locked-down)● Microservices
● Flask is my go-to tool● Swagger docs● (git pull / fabric / provisioned machines)● Docker + Amazon ECS
[email protected] @IanOzsvald BudapestBI Forum October 2015
Python Deployment● Make Python modules (setup.py)
● python setup.py develop # symlink● Unit tests + coverage● Use a config system (e.g.
github.com/ianozsvald/ python_template_with_config)
● Keep Separation of Concerns!● “12 Factor App” useful ideas
[email protected] @IanOzsvald BudapestBI Forum October 2015
Some common gotchas● MySQL UTF8 is 3 byte by default #sigh● JavaScript months are 0-based (not 1)● Never compromise on datetimes (ISO 8601)
● iOS NSDate's epoch is 2001● Windows CP1252 text (strongly prefer UTF8)● MongoDB no_timeout_cursor=True● Github's 100MB file limits (new Large File Support)● Never throw data away! Never overwrite original data! Always transform it (e.g. Luigi)
● Data duplication bites you in the end...
[email protected] @IanOzsvald BudapestBI Forum October 2015
(Perhaps) Avoid Big Data
● Don't be in a rush - 50,000 lines of good data will beat a pile of Bad Big Data
● 244GB RAM EC2+many Xeons $2.80/hr
[email protected] @IanOzsvald BudapestBI Forum October 2015
“Data Science Delivered”● New mini project / pamphlet● Includes dirty data strategies, ways to debug ML, thoughts on managing projects - 15 yrs experience (please critique and file bugs!)
● https://github.com/ianozsvald/
data_science_delivered ● Please give me your feedback
[email protected] @IanOzsvald BudapestBI Forum October 2015
Closing
● Tell me your dirty data stories, perhaps in a Ruin Pub? (I am automating some of this)
● Takehome - Keep it clean, keep it simple● Come talk on your projects at our PyDataLondon monthly meetup or start your own!