scalding at etsy

87

Upload: dan-mckinley

Post on 27-Jan-2015

121 views

Category:

Technology


0 download

DESCRIPTION

A description of how Scalding came to be used for analysis at Etsy.

TRANSCRIPT

Page 1: Scalding at Etsy
Page 2: Scalding at Etsy

So hey everybody, my name is Dan McKinley

Page 3: Scalding at Etsy

I’m visiting from LA

Page 4: Scalding at Etsy

I worked for Etsy for 6.5 years, mostly from Brooklyn. In an office considerably less sparsethan this one, I assure you. Mea culpa, that’s “worked” in the past tense. I quit to join a startuplast month. After signing up to give this talk. But I left on very good terms so I’m still doing it.

Page 5: Scalding at Etsy

This talk’s about Scalding, and how we wound up using it at Etsy.

Page 6: Scalding at Etsy

When I was writing this talk this passage from Douglas Adams kept popping into my brain. Ido feel like we had scalding thrust upon us at Etsy, rather than choosing it intentionally. Whichis not the same as saying that I was personally unhappy with it, exactly. I was not. This is thecharacter that went on to try to insult every being in the cosmos in alphabetical order. So I’mnot sure if it was intended as intentional allegory about the scala community.

Page 7: Scalding at Etsy

The first thing I wanted to do was give an overview of how Etsy uses scalding now.

Page 8: Scalding at Etsy

This is hopefully the only strata-esque slide in the talk. Don’t run for the exits or anything.What I want to communicate with it is that in abstract, we aggregate logs from the live site, putthem on hdfs. Then from there we crunch them to build internal tooling and features. For livefeatures we’re putting job outputs into mysql shards; for backend tools we typically use a BIdatabase (vertica) to fill the same need.

Page 9: Scalding at Etsy

Scalding gets used at all points on the hadoop side. Parsing logs, generatingrecommendations and ranking datasets, and business intelligence is all either done inScalding or will be ported to Scalding very shortly.

Page 10: Scalding at Etsy

There are a bunch of ways that people use analytics at Etsy. The way you get your answersdepends on the kind of question you’re asking.

Page 11: Scalding at Etsy

I’ll go through some examples. This is a simple one. Let’s say you just want to know howmany shops open up a day.

Page 12: Scalding at Etsy

That’s a pretty common question. And so somebody’s thought of it way before you, andthey’ve put it on a dashboard. So you can just go look at the dashboard.

Page 13: Scalding at Etsy

Another kind of question is one about how an A/B test you’re running is doing.

Page 14: Scalding at Etsy

We do a lot of A/B testing at Etsy, so much so that we’ve built our own A/B analyzer frontedcalled Catapult. So for most questions relating to variants in A/B tests you can go to that.

Page 15: Scalding at Etsy

Then there are slightly more complicated questions. Like, how many of the top sellers sellvintage goods? Maybe you’re the first person to ever ask such a question.

Page 16: Scalding at Etsy

But, people have thought of questions that are kind of similar to it before. And in most of thosecases you can go ask the BI database.

Page 17: Scalding at Etsy

And then there are questions that are even farther out there. Cases where you’re probably thefirst person to ask not just this specifically, but you’re also probably the first person to ask anyquestion even similar to it. Like this one. Etsy gets traffic to items that are sold. How oftencould we redirect that traffic to items that have close tags and titles?

Page 18: Scalding at Etsy

That’s the kind of thing you’d use scalding to answer today. We have the data in theory, butwe haven’t normalized it and put it in BI. Or maybe it’s too big to fit in BI.

Page 19: Scalding at Etsy

A very common kind of novel question relates to debugging A/B tests.

Page 20: Scalding at Etsy

We do a ton of that with scalding too.

Page 21: Scalding at Etsy

I conceptualize our data universe as having three domains.

Page 22: Scalding at Etsy

There are questions we’ve anticipated, questions we didn’t anticipate, and then there arepermanent systems.

Page 23: Scalding at Etsy

Like I said, we have tooling support for the first domain. And we use scalding for the secondtwo.

Page 24: Scalding at Etsy

That’s questions where the data needed to get an answer is in a relatively raw form, which I’llwave my hands and call analysis. And then we also build features and systems with scalding,which is more like what I’d call “engineering.” We do work for ranking, for recommendations,and so on in scalding.

Page 25: Scalding at Etsy

Let me give you some idea for how big of a thing this is.

Page 26: Scalding at Etsy

It’s pretty big, I guess. When I quit we had about 800 scalding jobs in source control. And ifeveryone is like me, there are probably twice as many in working directories, not committed.Only about 90 of those, though, run as part of our nightly batch process.

Page 27: Scalding at Etsy

58 people had written scalding jobs

Page 28: Scalding at Etsy

And 14 of them figured out how to use Algebird. Etsy’s engineering team, by the way, is like150 programmers.

Page 29: Scalding at Etsy

This histogram showing how many jobs people have written is about what you’d expect.There’s a small group of people like me who have written a ton of jobs. And most people havewritten one or two jobs.

Page 30: Scalding at Etsy

And the way it breaks down across the domains is like this. Most of the people using scaldingare using to answer analytics questions. The experts tend to be the people building systemswith scalding.

Page 31: Scalding at Etsy

So why would we pick scalding?

Page 32: Scalding at Etsy

Well, we didn’t really pick it on purpose. It was an accident.

Page 33: Scalding at Etsy

To explain how that accident happened I guess I first have to explain how we got started withanalytics

Page 34: Scalding at Etsy

And that was kind of an accident too. We didn’t necessarily set out to build something toreplace Google Analytics.

Page 35: Scalding at Etsy

What we did do was buy an advertising startup called Adtuitive back in 2009.

Page 36: Scalding at Etsy

And those guys brought something with them called cascading.jruby. For our purposes youcan consider this to be pretty close to Pig, but using JRuby.

Page 37: Scalding at Etsy

This is a really simple example of a job written in cascading.jruby. Hopefully you’ll just believeme that the Java equivalent would be Byzantine.

Page 38: Scalding at Etsy

The thing we wanted to get out of that acquisition this feature. Paid promoted listings that yousee when you search on Etsy. In the beginning we pretty much just wanted to build whateverwe needed to have this.

Page 39: Scalding at Etsy

But do that we needed things like impression logging and fronted feedback. So we startedcollecting event beacons from our frontend.

Page 40: Scalding at Etsy

And shipped those beacon logs to hdfs and turned them into event logs.

Page 41: Scalding at Etsy

And we sessionized the event logs and made visit logs out of them.

Page 42: Scalding at Etsy

That decision to make a table for visits, with a row per user session, turned out to beimportant. Our data is stored as serialized sequences of events inside cascading tuples.

Page 43: Scalding at Etsy

So even though we just wanted this feature, well, what the hell did we just do. We just startedbuilding an analytics system I guess.

Page 44: Scalding at Etsy

The next thing we knew we had a proprietary tool for analyzing AB tests. Go figure.

Page 45: Scalding at Etsy

By 2013 we definitely had our own giant analytics stack. It was built, racked, and debugged.And It was right about then that scalding blew the whole thing to smithereens.

Page 46: Scalding at Etsy

The thing that caused this was that we had hired Avi Bryant, who some of you may know asone of the authors of scalding. And something of a group theory crank. And just an all-aroundamazing smart guy.

Page 47: Scalding at Etsy

And as an amazing smart guy, when Avi joined Etsy he had some cover to get a little roguewith things.

Page 48: Scalding at Etsy

And what he did with that cover was that he added scalding to the build. And then he startedtrying to make things with it. Etsy’s not bureaucratic in any way I understand the word. But intheory there’s supposed to be at least some discussion before you start using a newframework. That didn’t happen at all with Scalding.

Page 49: Scalding at Etsy

And immediately after this, he up and quit. So the force of his intellect and personality doesn’texplain scalding’s runaway success. If that’s all it was about everyone would have stoppedusing it the minute he left. But the opposite of that happened.

Page 50: Scalding at Etsy

About a year ago we had this giant cascading.jruby system, which was starting to get mature.

Page 51: Scalding at Etsy

But by last October the official policy was to rewrite the few pieces that were left in Scalding.

Page 52: Scalding at Etsy

There’s a technical reason this happened, which I think is interesting, but at the same time it’spretty simple.

Page 53: Scalding at Etsy

I think it’s simple enough that I can show it to you in a couple of examples. Let’s say that wewant to count how many visits searched for any given search term.

Page 54: Scalding at Etsy

In other words we want to find every search and every visit, and produce a table like this.Search terms to the number of visits that entered them.

Page 55: Scalding at Etsy

The cascading.jruby job is really simple and straightforward. It looks like this. Don’t worryabout understanding it or anything, the point is that it’s short and easy.

Page 56: Scalding at Etsy

And the equivalent scalding job is also really short and simple.

Page 57: Scalding at Etsy

Conceptually they’re both just doing this.

Page 58: Scalding at Etsy

You unroll the search events, then you grab the search terms out of them, then you just groupand count.

Page 59: Scalding at Etsy

And both scalding and cascading.jruby manage to factor that into one mapreduce step. And inthis case they both perform identically.

Page 60: Scalding at Etsy

But you can start to see the difference if you add just one more layer of complexity. Let’s saythat we wanted to count up the search terms again, but this time relate them to purchases thathappen after them in visits.

Page 61: Scalding at Etsy

Like this. We want a table showing how many visits searched for a thing, and another columngiving how many of those visits bought something.

Page 62: Scalding at Etsy

In this case the scalding job is not that much more complicated. It’s still just about this long.

Page 63: Scalding at Etsy

And scalding manages to get this done in one mapreduce step again. It’s just unrolling thesearches out of the visits like it was before, and grouping with a sum.

Page 64: Scalding at Etsy

The jruby job, on the other hand, no longer fits on the slide. It’s in this gist if anyone wants tolook at it.

Page 65: Scalding at Etsy

I can show you what it does schematically. You make two branches, one for the searches andone for the purchases. Then you cross join them and filter that shit down. And then you windup with a branch for conversions per search term and a branch for visits per term, and youjoin those back together to get your answer.

Page 66: Scalding at Etsy

So the pure cascading.jruby solution is more complicated. And it also turns out to be a lotslower, too. Cascading doesn’t have a query optimizer, and this might be a lot closer if it did.But it doesn’t, so jruby winds up being done in many more mapreduce steps and takes likeeight times longer.

Page 67: Scalding at Etsy

If we go back to the scalding code for a second

Page 68: Scalding at Etsy

This here is the feature that killed cascading jruby. We just wrote a cascading user-definedfunction without even having to realize that that’s what we were doing.

Page 69: Scalding at Etsy

Now it’s not impossible to fix this in cascading.jruby, or in other frameworks that don’t give youeasy access to UDF’s.

Page 70: Scalding at Etsy

You can indeed go write a cascading operation to do the same thing and use it from those.

Page 71: Scalding at Etsy

But in reality, even though it comes up constantly, nobody wants to do that. You have tochange files, and you have to change programming languages. Those hurdles are enough tomake people write slower jobs.

Page 72: Scalding at Etsy

For example we had one job that was a major resource problem in JRuby, which was takingseven hours to run every night. Someone rewrote it in scalding in a day or two and got it downto 20 minutes. The problem wasn’t that anything was impossible in cascading.jruby. The pointis merely that scalding makes doing it the right way feel natural.

Page 73: Scalding at Etsy

So easy user defined functions swept all before them.

Page 74: Scalding at Etsy

But I don’t think scalding is all peaches and cream.

Page 75: Scalding at Etsy

You could say we only have two complaints.

Page 76: Scalding at Etsy

This is a talk about scalding. So I’m going to spare you my list of cascading gripes. Youprobably have your own. I will say that if you do, using a DSL on top of cascading doesn’t helpwith any of them.

Page 77: Scalding at Etsy

Very flippantly, this is basically the problem. Scala is too far from what most of our engineersare using on a daily basis. It’s too weird. I assure you Kellan’s not this crotchety in reality. Andhe’s probably mad at me for paraphrasing this from memory.

Page 78: Scalding at Etsy

I firmly believe that analytics is for everyone. I don’t mean statistical modeling, or machinelearning, or things like that. But I do think that asking straightforward questions about the thingyou’re tasked with building should be for everyone.

Page 79: Scalding at Etsy

What I mean by that is, let’s say we have a project to do.

Page 80: Scalding at Etsy

Etsy’s a relatively enlightened place, by software industry standards anyway. So everyonegets some time at the beginning and the end of that project to do quote-unquote “analysis.” It's"thinking time." And the stuff called "work" gets done in the middle.

Page 81: Scalding at Etsy

But I think this more accurately describes reality. We’re all still carrying the baggage of 20thcentury software around with us. So analysis up front, which you’d do to see if you can makea case for doing the feature at all, feels like you’re not working. And the stuff in the middlefeels like you’re really making progress. Even if it’s progress on something that could neveractually work.

Page 82: Scalding at Etsy

That’s how it is everywhere, more or less. This is the social framework everybody’s workinginside of. So as somebody who really believes that analytics up front is powerful, I want togive everyone the best chance possible.

Page 83: Scalding at Etsy

And scala is just too different from what other Etsy programmers are using day to day. Don’tmistake this as me saying they’re not smart enough, because they are. And it's not thatlearning FP wouldn't be good for everyone, because I think it is. And it's not that functionalprogramming is fundamentally too hard, or anything like that. It’s just a statement of fact. Mostprogrammers I know are not experienced with functional programming, and scala sharesmany functional idioms.

Page 84: Scalding at Etsy

So the analysis process winds up looking like this. Between asking the question and gettingan answer there’s this weird period in the middle where you have to learn a bunch of categorytheory. Sure it’s good for them, or something. But it’s also going to stop them from gettingtheir answer.

Page 85: Scalding at Etsy

Ideally things would look more like this.

Page 86: Scalding at Etsy

So someone should go build that.

Page 87: Scalding at Etsy