data pipelines for small, messy and tedious data

31
Data Pipeline Architect Data Pipeline Architect Data Pipelines For small, messy and tedious data. Vladislav Supalov, 27th October 2016

Upload: vladislav-supalov

Post on 16-Apr-2017

109 views

Category:

Technology


1 download

TRANSCRIPT

Page 1: Data pipelines for small, messy and tedious data

Data Pipeline ArchitectData Pipeline Architect

Data PipelinesFor small, messy and tedious data.

Vladislav Supalov, 27th October 2016

Page 2: Data pipelines for small, messy and tedious data

Data Pipeline ArchitectData Pipeline Architect

How to tell if this talk is for you?

22

● Big Data○ Pretty fascinating○ “Good problem to have”

● Most companies○ Not quite there○ Should not start at this level

● This is for you, if you are close to the data at a○ Startup○ Growing company○ Established company which is about to start an initiative

● Working with a new CDO, CAO, Head of BI

Page 3: Data pipelines for small, messy and tedious data

Data Pipeline ArchitectData Pipeline Architect

I want to help you achieve better results!

33

● What will help you to deal with …?○ small data (not much is needed to be valuable)○ messy data (multiple data sources, no overview)○ tedious-to-handle data (multiple data sources, lots of manual work)

● “Use <tech X> in <way Y> and you will be fine”. Nope.○ Just dealing with data is not a magic bullet○ This will not guarantee good results for your company○ You might get lucky of course. That’s not a safe bet.

● How can we improve your chances? Reduce risk.○ Focus on what matters

Page 4: Data pipelines for small, messy and tedious data

Data Pipeline ArchitectData Pipeline Architect

Jumping to tech we would dive too deep, too early.

44

● What people tend to think about first:○ Dashboards○ Tools○ Technical solutions, best practices & tricks

● That’s tactics

● We should not jump into implementation details right away.● Let’s not.

Page 5: Data pipelines for small, messy and tedious data

Data Pipeline ArchitectData Pipeline Architect

The Craft of Designing & Building Data PipelinesShould start with understanding the business.

Page 6: Data pipelines for small, messy and tedious data

Data Pipeline ArchitectData Pipeline Architect

Hi, I’m Vladislav!

66

● Data background○ Machine learning, computer vision, data mining

● Fascination with DevOps○ Efficient, reliable infrastructure setups○ Monitoring, automation, processes

● Currently: Co-founding a startup - Pivii Technologies○ Startup, accelerated by Axel Springer Plug and Play○ Artificial intelligence for content marketing○ AI, ML, CV, data!○ pivii.co

● Previously: Building a data engineering consulting business○ datapipelinearchitect.com

vsupalov

Page 7: Data pipelines for small, messy and tedious data

Data Pipeline ArchitectData Pipeline Architect

Preferred consulting situation:

77

● Mobile application marketing agency○ Not necessarily huge data○ Very valuable and worthwhile (from a certain point)

● “We built prototype analytics tools in-house and they are mostly functional”○ “We have seen the value!”○ But are painful to work with & broken○ “Time and money is still being wasted.”

● Tools were created out of an actual need○ Organically, little planning○ “How can we do better?”○ “Where do we go from here?”

Page 8: Data pipelines for small, messy and tedious data

Data Pipeline ArchitectData Pipeline Architect

Common Success Pattern: Business Value was Created.Already achieved visible and measurable impact for the company.Or have gotten VERY close to do so. Are thinking about ROI.

Page 9: Data pipelines for small, messy and tedious data

Data Pipeline ArchitectData Pipeline Architect

Business first. Tech follows.

99

● Key to successful data projects○ Especially with limited resources○ And small data

● Technical decisions should be informed by business needs and goals

● Handling data is a very small part of the whole○ Straightforward once business needs are clear

● It starts with the mindset○ Don't consider data plumbing in isolation

Page 10: Data pipelines for small, messy and tedious data

Data Pipeline ArchitectData Pipeline Architect

Key: being conscious and deliberate about the intention of creating business value.Let’s take a brief detour.

Page 11: Data pipelines for small, messy and tedious data

Data Pipeline ArchitectData Pipeline Architect

Consider sword fighting.

1111

● A great samurai sword master● 1584 - 1645● Miyamoto Musashi

○ Martial artist○ Tactician○ Strategist○ Artist○ Sculptor○ Calligrapher○ Writer○ Philosopher○ ...

Images: Miyamoto Musashi, self-portrait, http://sv-musashi1.com/about_Musashi.htm, Musashi Miyamoto with two Bokken, http://www.akinokai.org/images/Images.htm?Musashi.jpg

Page 12: Data pipelines for small, messy and tedious data

Data Pipeline ArchitectData Pipeline Architect

“The primary thing when you take a sword in your hands is your intention to cut the enemy, whatever the means.”

- Miyamoto Musashi, The Book of Five Rings

Page 13: Data pipelines for small, messy and tedious data

Data Pipeline ArchitectData Pipeline Architect

“Whenever you parry, hit, spring, strike or touch theenemy’s cutting sword, you must cut the enemyin the same movement.”

- Miyamoto Musashi, The Book of Five Rings

Page 14: Data pipelines for small, messy and tedious data

Data Pipeline ArchitectData Pipeline Architect

“It is essential to attain this.If you think only of hitting, springing, striking or touchingthe enemy, you will not be able actually to cut him.”

- Miyamoto Musashi, The Book of Five Rings

Page 15: Data pipelines for small, messy and tedious data

Data Pipeline ArchitectData Pipeline Architect

“More than anything, you must be thinkingof carrying your movement through to cutting him.You must thoroughly research this.”

- Miyamoto Musashi, The Book of Five Rings

Page 16: Data pipelines for small, messy and tedious data

Data Pipeline ArchitectData Pipeline Architect

The Goal of swordfighting is to cut the opponent.

1616

● Stating this makes it seem very obvious.○ Why the effort and emphasis?

● It’s not. Even for aspiring practitioners.○ Results suffer.

● Mindset is essential for mastery● The core advice (to my understanding):

○ Attain, cultivate and apply a goal-oriented mindset○ Aim every step you take towards the goal

Page 17: Data pipelines for small, messy and tedious data

Data Pipeline ArchitectData Pipeline Architect

Back to the world of data-handling businesses!

1717

● When working with company data○ Before starting out on a project○ Understand what you want and can achieve○ Aim to create a positive impact on the business○ Make it a constant, conscious goal

● The main tasks to do so are:○ Understand the business○ Understand the people

■ It’s about communication○ Understand current processes○ Be prepared to learn and revise

Page 18: Data pipelines for small, messy and tedious data

Data Pipeline ArchitectData Pipeline Architect

Use this process when approaching a new project:

1818

● Qualify client/project○ Does it make sense to get involved?○ Is it evident that we can create value?

● Perform conversations/interviews○ Find out more about the context

■ company, status, goals, limitations...○ Learn from first-hand experience

● Summarize information, learnings and plans in writing○ Roadmap document○ Depicting the situation and ways forward

Page 19: Data pipelines for small, messy and tedious data

Data Pipeline ArchitectData Pipeline Architect

Is there potentialfor a good fit?Do budget, topic and goals seem in order?

Page 20: Data pipelines for small, messy and tedious data

Data Pipeline ArchitectData Pipeline Architect

Qualifying considerations. Learning about the client and project.

2020

● What are you working on?● What part of the project would you like help with?

● What needs to happen to make this a success for you?

● Why was this project started? What are the business goals?● Is there an event that triggered it?● Why especially now?

● What’s the budget? (ballpark estimate)

● When are you looking to get started?

Page 21: Data pipelines for small, messy and tedious data

Data Pipeline ArchitectData Pipeline Architect

Still good? Let’s start a business relationship.Initial research and planning. Roadmapping consulting package.

Page 22: Data pipelines for small, messy and tedious data

Data Pipeline ArchitectData Pipeline Architect

Four people to talk to:

2222

● Project owner○ We want this guy to be successful

● Business owner or C-level perspective○ Knows what’s best for the business○ "What could the ceo ask you in the hallway"

● Data wrangler - tales from the trenches○ Insights into day-to-day business and data details

● Engineering Side○ Current tech stack○ Infos on constraints and preferences○ Last touches

● Conversation focus, questions and duration vary from person to person.

Page 23: Data pipelines for small, messy and tedious data

Data Pipeline ArchitectData Pipeline Architect

Interviews completed, situation understood and put into writing.

23

● A bit of focused communication, we have a great foundation!○ Project motivation○ Business goals○ Who should benefit○ How to make it happen

● Different perspectives on the project and business.

● Time for tech!○ Context clear (goals, constraints)

● Best case:○ Very few choices left to make

Page 24: Data pipelines for small, messy and tedious data

Data Pipeline ArchitectData Pipeline Architect

Here’s what I would have told myself when starting out:

24

● Learn about the company○ Easier with fresh eyes

● Understand the business○ Multiple perspectives

● Keep the goal in mind○ Helps learning the right things○ Cultivate a business mindset (help earn more/lose less)○ Aim for results

■ I will not stop saying this anytime soon :)

● Have a process laid out24

Page 25: Data pipelines for small, messy and tedious data

Data Pipeline ArchitectData Pipeline Architect

Finally: Tactical Advice Which Fits the Remaining Time.That’s the right proportion :)

Page 26: Data pipelines for small, messy and tedious data

Data Pipeline ArchitectData Pipeline Architect

Don’t roll your own home-baked scripts.

2626

● "Quick and easy" isn't

● Uniqueness is bad, boring is good○ Learning curve for others○ Original author leaving○ Maintenance time, tricky bugs, code duplication○ Unexpected failure modes

● Extensibility?● Growth?● Metadata?

Page 27: Data pipelines for small, messy and tedious data

Data Pipeline ArchitectData Pipeline Architect

You should know about workflow engines.

2727

● Workflow = “[..] orchestrated and repeatable pattern of business activity [..]” [1]● Data flow = “bunch of data processing tasks with inter-dependencies” [2]

● Pipelines of batch jobs○ complex, long-running

● Dependency management● Reusability of intermediate steps● Logging and alerting● Failure handling● Monitoring● Lots of effort went into them (Broken data? Crashes? Partial failures?)

[1] https://en.wikipedia.org/wiki/Workflow[2] Elias Freider, 2013, “Luigi - Batch Data Processing in Python“

Page 28: Data pipelines for small, messy and tedious data

Data Pipeline ArchitectData Pipeline Architect

If in doubt, try Luigi.

2828

● Spotify○ Lots of data!○ 10k+ Hadoop jobs every day [1]

● Battle hardened○ Published 2009○ Has been used in production by large companies for a while

● Python● Modular & extensible● Dependency graph● Not just for data tasks

[1] Erik Bernhardsson, 2013, “Building Data Pipelines with Python and Luigi”

Page 29: Data pipelines for small, messy and tedious data

Data Pipeline ArchitectData Pipeline Architect

Usually worthwhile pipeline properties:

2929

● Keep it small and lean● Make learning and iterating easy

○ Changes should be cheap to accommodate for (both time and money)● Build something to start learning● Get data into one place● Don’t reinvent the wheel

○ The tools are out there○ ETL and workflow engines

● Create quick positive results, be efficient (lazy)○ Many small improvements everywhere○ Instead of solving everything for one group○ More bang-for-the-buck

Page 30: Data pipelines for small, messy and tedious data

Data Pipeline ArchitectData Pipeline Architect

In conclusion:

30

● Don’t dive into tactics right away● Aim to create business value

○ Make it a conscious goal

● Understand the business, people and processes○ This will take some time. It’s a good investment.○ Have a process yourself○ Tech choices will follow

● Try to make it easy to learn and iterate● Get data in one place● Don’t go with home-baked scripts● Consider workflow engines

○ Luigi in particular30

Page 31: Data pipelines for small, messy and tedious data

Data Pipeline ArchitectData Pipeline Architect

Thanks! Want to learn more?

“What questions to ask? Am I missing something?”For your future interviews and planning:

I want to share my seed-question lists with you!

Just drop me your email address at:http://datapipelinearchitect.com/datanatives/