6 rules for designing recursive dataflows...4.lean on your legacy(dataflow #2) 5.stop! append,...

50

Upload: others

Post on 22-May-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

6 Rules for Designing Recursive DataFlowsBy Tyler Pugliese

WELCOME

Who am I and why am I here?

Tyler PuglieseFastly (Mar’2013)

BI Architect (Major Domo)

Fastly’s Edge Cloud Platform

Fastly’s Customers“Spotify’s users expect immediate access to their favorite songs, podcasts and playlists, at home, work, and anywhere in between. Fastly is a critical part of our toolkit, helping us deliver an amazing listening experience. Fastly helps us innovate on content delivery and ensure high quality performance for our users around the world.”

Niklas Gustavsson, Principal Engineer at Spotify.

“I’m a huge fan of Fastly. On election night, we have 100,000 requests per second, and Fastly performed flawlessly - we had no problem at all.”

Nick Rockwell, CTO New York Times

“When you do a Super Bowl ad, over 100 million people are tuning in. For us, working with Fastly was all about setting up the website to be scalable and secure, and to deliver a really seamless experience for anybody that came to the site that day. We had tons of traffic, and the site held up and scaled exactly how we hope it would thanks to Fastly.”

Michael Dublin, CEO Dollar Shave Club

64 Net Promoter Score

Fastly by the Numbers

70M+ Lines of Edge Code Deployed Monthly

400B+ Daily Internet Requests

Fastly’s Domo Ecosystem

Data CardsDataFlows

820 DataSets for 280 DataFlows (2 input 40+)

35 Sources2B Rows (1.8 B from

Data Warehouse)

2500 over 115 dashboards(3 have 60+)

Fastly’s Domo Challenges• External System Empowerment Imbalance vs. Domo

• Larger DataSets inspire more, longer Scrutiny and Runtime

• Over-reliance on one giant DataFlow’s output, prefer smaller, more manageable and logic specific DataFlows

• Data Lineage only looks Forward, wish it looked backwards

Before Recursive DataFlow:

After Recursive DataFlow:

How does a DataFlow reduce the time of an Update by 99%?

A ‘Dynamic’ Solution - Last 5 Days Append

What Makes a DataFlow Recursive?

RDFs:Defines a creation of interdependent DataFlows that are connected by their reliability of trust, dimensions of data and frequency of updates.

Recursion utilizes the output DataSets from these DataFlows to be input DataSets, redefining these connections.

Recursive DataFlows 101

Reliability

Your final output should be as trusted

as your first input

Frequency

Your business logic should define

your recursion

Dimensionality

Your data should be more than a

single point in time

6 Rules to Designing Recursive DataFlows1. Start with Data you trust

2. Define a Successful Update

3. Focus on your Frequency (DataFlow #1)

4. Lean on your Legacy (DataFlow #2)

5. Stop! Append, De-Duplicate and Iterate

6. End with Data you trust

“How do we trust this Data?”

-Everyone

“Trust means doing what you say you’re going to do… over and over again…”-My Boss

#1. Start with Data You TrustFor Fastly:• Verified Transaction Data from Accounting• Every Day we pull past 5 days of data - Dynamic Input• Every 15th Day we pull ALL objects - Historic Input

For your organization:• Verified Outputs of Complex DataFlows• Large DataSets with constant refreshes• Entire Imports that requires a new dimension

#2. Define a Successful UpdateFor Fastly:• Our Dynamic Input updates with a Line Unique Key & Date• Our Historic Input doesn’t fail after 14 hours• A null value means no new transactions

For your organization:• A new time period, a new historic set of records• Salesforce Accounts were Updated and pushed to Domo• A purchase order was successfully processed

Pause: Known Domo “Oh no…”s• Outputs of DataFlows can’t be their own inputs

• Naming things is hard, tagging our outputs is key

• There are multiple ways to solve problems, this is one solution

• Focus on the recursion, not supplementary logic

• Start simple, then iterate (make sure it works!)

Frequency / Dynamic

The new record of a new time / event partition

Our 5 Day Update

Legacy / Historic

The old records with all time / event data

Our All Time Update

Recursive DataFlows 201

Recursive DataFlows: Naming and Visual Help• Slides will be available afterwards, focus on process

• Mnemonics and color labeling:• Update • Static • Dashed Boxes = Replaceable• Solid Boxes = Final

Frequency

Legacy

#3. Focus on your Frequency Output (DataFlow #1)DataFlow #1: Frequency

#3. Create a copy of your current state of DataDataFlow #1: Frequency

#3. Turn your Arc into an InputDataFlow #1: Frequency

#3. Build the framework of logic and processDataFlow #1: Frequency

#3. Initialize our Production OutputDataFlow #1: Frequency

#3. Review our DataFlow’s temporary stateDataFlow #1: Frequency

#4. Lean on your Legacy (DataFlow #2)DataFlow #2: Legacy

#4. Create a copy of our Production so it can be an Input!DataFlow #2: Legacy

DataFlow #1:

#4. Replace the Arc Output with the Legacy OutputFrequency

#4. Finalize our RDF

#5. Stop! Append, De-Duplicate and Iterate• Recursions: Go! | Loops: Stop!

• First, Append ALL! Then Shared Rows!

• Determine duplicate Data, De-Duplicate your Data

• Iterate! (Some Variant Examples)

#5. Revisit our Recursion: How do we Stop! this from looping?

#5. Dynamic = Update, Production = Update, Legacy = Stop!DataFlow #2: Legacy

#5. Frequency Updates with our Dynamic InputDataFlow #1: Frequency

#5. Updated Visualization of Triggered Updates

#5. Append smartly!DataFlow #1: Frequency

#5. Defining Duplicate Duplicates• For Fastly Netsuite:

• Line Unique Key - Only one Transaction Detail should exist

• Composite Keys:• Identifier + _BATCH_ID_ + _BATCH_LAST_RUN_• Salesforce Account ID + SysModStamp• Indentifier + Yesterday’s Date

• Sanity Check: • Production Output # of rows == Historic Input

#5. Iterate! Make our Legacy Include Historic Refresh!DataFlow #2: Legacy

#5. More Reliabilitly! I trust this data!DataFlow #2: Legacy

#5. Double Update, Double Resiliency

#6. Ending with Data We Trust

Parallel Thinking

Label everything, simultaneously

Trust

Consistency is reliability is trustworthy

Automated Updates

When your inputs update, so too

should your RDFs

Fastly Successes with RDF• Netsuite Saved Search Import

• ~500k Rows + ~30k Rows a Month• Critical Financial Data• Was: 13+ Hours, failed often, no error explanation• Now: < 13 min, 2 Imports, 2 DataFlows

• Fastly Service Configuration Imports• 1.7M Rows + 3k Rows a Day• Saves on time / stress with incremental updates• Allows to see changes in configurations over time

Take Home Work from Professor Pugliese:• Build your own Dynamic DataSet:

• Define a Successful Update (Time / Event)• Create an additional Output DataSet Object from DataFlow• Make the update the Focus of your DataFlows, not everything!

• Categorically Manage your DataFlows and DataSets• Use Tags to define Input Sources• Use Nomenclature to define types of DataFlows

• Think broader, smarter about your Domo and how its utilized in your company

Key Takeaways• Process time is a function of DataSet size, input connector and

Domo; DataSet size is the largest one you have control over.

• Recursive DataFlows are a challenging, yet rewarding way to improve the consistency and trust in your DataSets.

• Multiple, logic distinct DataFlows give you more control and insight but burden your mental capacity. Trade offs versus one giant DataFlow that does everything.

--Outro SlideCASE WHEN

(`presentation_rating` = ‘Amazing’ ANDCOUNT(DISTINCT `key_takeaways` > 0)THEN ‘Applause’ELSE `Questions?`

END

THANK YOU