sds podcast episode 279: embedding data science in … · somebody should study, i would rather see...

SDS PODCAST

EPISODE 279:

EMBEDDING DATA

SCIENCE IN

BUSINESS

http://www.superdatascience.com/279

Kirill Eremenko: This is episode 279 with Head of Data Science at

Scribd, Kevin Perko.

Kirill Eremenko: Welcome to the SuperDataScience podcast. My name

is Kirill Eremenko, Data Science Coach and Lifestyle

Entrepreneur. Each week we bring you inspiring

people and ideas to help you build your successful

career in data science. Thanks for being here today

and now, let's make the complex, simple.

Kirill Eremenko: This episode is brought to you by our very own data

science conference, DataScienceGO 2019. There are

plenty of data science conferences out there.

DataScienceGO is not your ordinary data science

event. This is a conference dedicated to career

advancement. We have three days of immersive talks,

panels and training sessions designed to teach,

inspire, and guide you. There are three separate career

tracks involved, so whether you're a beginner, a

practitioner or a manager you can find a career track

for you and select the right talks to advance your

career.

Kirill Eremenko: We're expecting 40 speakers, that’s four, zero, 40

speakers to join us for DataScienceGO 2019. And just

to give you a taste of what to expect, here are some of

the speakers that we had in the previous years:

Creator of Makeover Monday Andy Kriebel, AI Thought

Leader Ben Taylor, Data Science Influencer Randy Lao,

Data Science Mentor Kristen Kehrer, Founder of Visual

Cinnamon Nadieh Bremer, Technology Futurist Pablos

Holman, and many, many more.


Kirill Eremenko: This year we will have over 800 attendees from

beginners to data scientists to managers and leaders.

So there will be plenty of networking opportunities

with our attendees and speakers, and you don't want

to miss out on that. That's the best way to grow your

data science network and grow your career. And as a

bonus there will be a track for executives. So if you're

an executive listening to this, check this out. Last year

at DataScienceGO X, which is our special track for

executives, we had key business decision makers from

Ellie Mae, Levi Strauss, Dell, Red Bull, and more.

Kirill Eremenko: So whether you're a beginner, practitioner, manager or

executive, DataScienceGO is for you. DataScienceGO

is happening on the 27th, 28th, 29th of September

2019 in San Diego. Don't miss out. You can get your

tickets at www.datasciencego.com. I would personally

love to see you there, network with you and help

inspire your career or progress your business into the

space of data science. Once again, the website is

www.datasciencego.com, and I'll see you there.

Kirill Eremenko: Welcome back to the SuperDataScience podcast, ladies

and gentlemen. Today I've got a super exciting guest,

another speaker who will be joining us for

DataScienceGO 2019, at the end of September, this

year. If you haven't gotten your tickets yet, check out

www.datasciencego.com. Today we have Kevin Perko.

Kevin is the head of data science at Scribd, and he is

leading a team of approximately 13 data scientists,

between San Francisco, and Toronto. We had a

fantastic chat today, so here are a couple things that

you will take away from this conversation.


Kirill Eremenko: You will learn what it's like to be a data science

manager, or a data science leader, and what it's like to

manage a team, and more so two teams, in two

different locations, and how that is different to actually

doing the technical work. If you're thinking of

progressing as a data scientist to a data science

manager, or to a head of data science, this will be very

valuable for you. Also, you'll learn about the Book

Genome Project, that they're doing at Scribd, which is

a very exciting undertaking. You'll learn what it's like

when a company sees data science as a product, as

opposed to an auxiliary function.

Kirill Eremenko: If you're a business owner or an executive, you'll learn

a very valuable concept of decentralized, or embedded

teams, versus core data science teams. What's the

difference when your data scientists or machine

learning experts are embedded throughout your

organization, versus when they're in one core

centralized team of data scientists, what are the

advantages and disadvantages of each approach, and

what stage of the business should you be doing each

one in, and what should you be aiming for.

Kirill Eremenko: Finally, if you are in Toronto, or San Francisco, and

you are looking for a job or considering a new role in

data science, then stay tuned for this podcast, because

Kevin will announce that they're hiring, and you might

just like this company, and might just want to check

them out. On that note, very exciting podcast coming

up. Can't wait for you to check it out. Let's get straight

into it. Without further ado, I bring to you, Kevin

Perko, Head of Data Science at Scribd.


Kirill Eremenko: Welcome back to the SuperDataScience podcast, ladies

and gentlemen. Super excited to have you on the show

here today, with my lovely guest, Kevin Perko, calling

in from San Francisco. Kevin, how are you doing?

Kevin Perko: Doing great. I'm doing great.

Kirill Eremenko: It was fun chatting just now about like, a book, and

you haven't written one yet. If you were to write a

book, what would it be about?

Kevin Perko: Oh, that's a great question. If I was going to write a

book, I think I would focus on kind of how

interdisciplinary data science is, and how that is really

kind of what makes it come alive. You've got elements

from psychology, you've got these general things

around just being curious, and you've got to really

program, and build models, and sort of represent in

the world, and I think all of those things kind of come

together in this nice sort of like, systems thinking,

complex systems type fields of study, that people don't

usually study who do data science. I also think it's

why people do study something like physics, which is

literally the building blocks to the universe, tend to do

really well in data science.

Kevin Perko: I think my book would be try to capture more of these

elements and kind of interweaving them, and showing

how these things are building on each other, and why

neural networks are kind of something that's really

interesting, comes out of something from '70s really,

even before that. It's not like a new thing, but just to

give people a sense of understanding on how

everything is interrelated, and it's all towards


understanding how we model these things, and that

while people like to talk about AI, there isn't really

anything that approaches general intelligence yet. Still

really mapping these functions to output values.

Kevin Perko: I think understanding the systems in which these

operate are really, really interesting.

Kirill Eremenko: Very, very true. Do you feel that data science kind of

came together as a chain of development ... not even

chain, like a group of developments in different fields.

You know, there's elements of data science that come

from economics, there's elements that come from

physics, as you mentioned, there's elements that come

from neural networks and IT, there's elements that

come from mathematics, even biology. Some of the

statistical apparatus, especially in R, originally came

from AB testing, and random sampling in biology, or in

medicine.

Kirill Eremenko: Do you have this feeling that data science kind of right

now it's a separate science, there's arguments to

support that, but originally it independently grew in all

these different fields?

Kevin Perko: Right, right, absolutely. I think a good correlator for

this is that if I was going to recommend what

somebody should study, I would rather see them study

computational biology, mathematics, physics, as

opposed to data science itself, because then you're

kind of removing yourself from the actual subject

you're studying, and data science is always applied.

We're never just, at least in industry, thinking about

how to make [inaudible 00:08:06] to sound more


efficient, think about how to apply it to solve a

problem. You come up from a computational X area,

that's really what you're going to be doing.

Kevin Perko: I see sometimes people come out of this sort of generic

data science programs, like, I want to go do NLP, and

it's like, what problems do you want to solve with that,

like why do you care about having this tool, so you can

leverage it for solving a problem, whether it's in health

care, or physics, or business. I think that's where it

gets really exciting, is when you mix those applied

fields together. Somebody, I'm kind of remembering

here, somebody was studying glaciology and they were

actually applying data science methods, and they were

able to map how glaciers are moving, where people

previously hadn't been able to. It's like, that's where

data science really shines.

Kevin Perko: That's where it gets really exciting. Yeah, I think that

that's kind of like my ... My thing is that I almost think

it shouldn't ... It can't be a separate thing. It has to be

in all of these things, because it can help all of these

various fields move forward faster, as opposed to just

itself.

Kirill Eremenko: Wow, very interesting perspective. Applied data

science, great way to get started into the field. I guess

if you combine it with something that you're

passionate about, something that like somebody who's

doing glaciology has to be excited about glaciers, and

there has to be some story behind it, why they're doing

it, I guess if you do it that way, you get the extra boost

of seeing how applying data science to this field that


you're very interested in, can make massive progress

and massive impact in that field.

Kevin Perko: Absolutely, absolutely. I think that's really where

people drive breakthroughs, is when they bring a

couple different fields together. Data science is a great

one that you can bring it to almost any field. It can

help you rather infer, compute, figure out what is the

true structure of all of these different areas, and that's

really powerful. There's not a lot of fields that do that,

but if you're just focusing on where do I run around

and how to apply data science and algorithms to, you

get a lot of interesting things. You see a lot of the voice

to face, or the deepfakes, and all this stuff.

Kevin Perko: There's people that, well, there's social media, and I

can get a lot of press if I do this thing that's going to

freak people out. Then that's what happens, and we

end up building something that kind of scares people

about AI, and also has a debatable social value, rather

than like really pursuing trying to building up

breakthroughs in hard sciences, which is really

exciting and really valuable to the world. That's where

I see the trade off.

Kirill Eremenko: What's your story? How did you get into the space of

data science? What did you study?

Kevin Perko: I actually studied finance. For me, data science really

happened to me. I was always interested in numbers,

and thinking about numbers, and I had picked up

programming when I was younger, sort of on and off.

Then in school, kind of switched over to like, I really

just want to do banking, because I love the stock


market because it had so many numbers associated

with it, and all this future value of money, and all

these kinds of things that are really interesting. I

didn't have ... For me, it didn't click, like oh, I should

do computer science yet. Then I got out and I was like,

I definitely should have done computer science.

Kevin Perko: Just ended up, I was like I just got to work at a tech

company, which I did. Did a variety of roles there. I got

into building and application. It was powered by data

though, so I got to interact with the data. I was

building what we call ETL pipelines now, but nobody

really had a name for it then. [crosstalk 00:11:38]-

Kirill Eremenko: [crosstalk 00:11:38], right?

Kevin Perko: Hmm?

Kirill Eremenko: Extract, transform, load.

Kevin Perko: Exactly, exactly. Nobody really knew what the next

thing, nobody was like, let's do analysis on top of it.

We did a little, like a very light statistical analysis. I

did a little work with SEO, because we had more of a

long tail application. From there I basically knew that I

wanted to do more of this, but I still didn't really have

a name for it. People are thinking it was FP&A, which

is definitely not what it was, because it was much

more computer science oriented. Roughly around this

time, Facebook started to come out. The term data

scientist got popularized, but it was only for PhDs at

this point, for the most part.

Kevin Perko: They were solving these really, really massive problems

at scale, that didn't previously exist. They also had a


ton of users, and so all these unique problems that

most start ups didn't have, they couldn't really do this.

I kind of went from there to the next company. I again

did something similar, but I was closer to the analytics

this time. That kind of gave me the freedom to do all

this analyses, finally get into building some models,

doing some fraud modeling, some graph analysis, and

that's really where I was like, "Ah, this is incredible."

Kevin Perko: Booting up Gephi first time, and loading a graph in

there, and really seeing the representation of these

relationships, and how you could walk down the node,

and see how people are related and how fraud circles

form, fascinating stuff. This kind of hooked me and

then I was like, I again need to do more. I'm going in

the right area, even though I'm not really sure what it

is now. It's finally called data science, really diving in

to learning Python, everything else I need to.

Kevin Perko: From there then I worked for a gaming company. I

was, all right, it's like in a lab. It's like a science lab for

running experiments. Really interesting. Don't

necessarily feel that great, but you learn a ton about

how people respond very quickly to incentives, and

game play function, and game play economies, and all

of these really interesting areas. That's kind of in my

path and then from there I've continued on at Scribd.

For me, it was kind of this route that I was sort of on,

and I didn't know it. Then the industry just showed

up, and I was like, "This is exactly what I want to do."

Kirill Eremenko: That's awesome. Right place at the right time.

Kevin Perko: Absolutely.


Kirill Eremenko: Yeah, very interesting story. You've been now at Scribd

for what, like over five years?

Kevin Perko: That's right, five and a half years.

Kirill Eremenko: That's really cool. You start off as a data scientist, data

science manager, and now you're head of data science.

Tell us what that feels like.

Kevin Perko: It's great, it's great. I mean, it's both exciting feel to

grow in a company and watch the company grow while

you're there. It's been a total mindset shift when you're

going in and doing the ground level work versus

having a team of people. We're in San Francisco and

Toronto, in terms of the data science team, and that's

just ... kind of have to ... Most of my career has been

sort of figuring out how to do things while I'm doing

them, and so managing a team is no different. You

really have to sort of change your job every six months

to a year. Nobody tells you that you're supposed to do

that, but you definitely are. Otherwise, you're going to

get stuck. [crosstalk 00:14:44]-

Kirill Eremenko: What do you mean by change a job?

Kevin Perko: What I mean is like, as a data scientist, you're really

thinking about the models, and the business problems

you're solving, and as a manger now you have to think

about how you help people solve those problems, and

what the communication around that looks like, and

how you're setting expectations, and what you're

delivering. Then once you're kind of managing the

whole team, you have to think like, what are we not

even thinking about, what's the culture, how do I kind

of delegate, so I have more people on the team who are


aligned with me and thinking the same way, and I can

be a multiplier effect, because I can't be everywhere

anymore.

Kevin Perko: Most of my day is kind of like sitting in meetings from

10:30 to 3:30, very typical day, and whether I'm doing

interviewing, or meeting with other PMs, or meeting

with other executives, all of those things kind of add

up, plus one on ones for the team, and so the day just

kind of fly by, so I can't really be there providing any

sort of technical leadership. I have to build that out on

the teams so the team has some senior people who can

do that. These are sort of things are like, okay, well

now I had to change my job. Previously I was much

more involved in this. Now I'm not involved at all.

Kevin Perko: Now I'm working with the team in Toronto, really

making sure that they get up and running, and we're

working on newer things, like we're working on

building a machine learning platform internally. Now

we're going to use some tools for this. We're not going

to write the whole things ourselves. That's like a whole

new area. Okay, okay, now we really have to think

about this, and we really want to focus on getting

everybody more into the full stack data science side.

We've always sort of had the full stack data science

term that we've used internally, of like how we think

about we kind of go end to end, but this is like we

want to go, take that to the next level where we're

working with Scala, and we're really being able to

productionalize anything at any point. Really kind of

pushing the team in that direction, to enable new

opportunities for us.


Kirill Eremenko: Very cool. How big is the team right now?

Kevin Perko: The team including myself is 13 people right now.

Kirill Eremenko: Oh, okay, gotcha. 13 across Toronto was it, and San

Francisco?

Kevin Perko: That's right, Toronto and San Francisco.

Kirill Eremenko: Very cool. I think it would be a good segue or a time to

mention a few things about Scribd, I guess. Tell us a

bit, what is Scribd, and what kind of product services

does that company offer?

Kevin Perko: Right. Scribd is a reading subscription service. It's

$8.99 a month, and you get access to books, audio

books, sheet music, articles, as well as user uploaded

content, which could be really anything, letters of

recommendations, people's physics theses that they've

published, and just a wide collection. Game strategy

guides of content that people have decided to upload

on the internet, and so Scribd enables you to get

access to that.

Kirill Eremenko: Oh, nice. What kind of data would you be working

with, or does your team work with on a daily basis?

Kevin Perko: We really work with I think of it like a couple different

types of data. One is sort of like the application level

data, whereas who's paying us, where are they from,

all of that kind of demographic type information, what

devices are they on, etc ... Then you have this sort of

user interaction event stream data of what did they do,

what did we show them, and how did they interact

with that, and how does that mix with what we know

about whether or not they're logged in, or logged out,


or a paying subscriber, what they've done in the past,

what other people have done. That's kind of like one

part of it, and then the other part is understanding the

content.

Kevin Perko: All of the books, and audio books, and user generated

content that we have that people have uploaded, really

understanding what that is, what language is that in,

what categories are they. For books, publishers

typically provide us with categories, whereas for

documents, users do not provide us with any

information, so it's up to us to decide, okay, what is

this document actually about, and how should we use

that information when we're building a search index,

to search search results, or showing

recommendations.

Kirill Eremenko: Okay, very, very diverse. Two diverse areas, user

interaction, and understanding how they use the

platform, and also understanding the content. What

would a typical project look like for your data science

team?

Kevin Perko: That's a great question. I'm actually going to just use a

project that we're doing right now. Somebody identified

our success metric, which is our target for our GBM.

For search, for re-ranking items, once all the candidate

sets are generated, so think rows of books, audio

books, documents, then it kind of goes into this GMB,

and it decides how should I actually rank these items,

within each module. Today a lot of that is, after all the

routing and candidate generation happens, it's all

based on historical data for the most part.


Kevin Perko: Our best approximation is to try to understand how

those interactions correlate with retention. For our

business, that's what we want to optimize for right

now. One of the data scientists said, "You know, this

previous success metric, we did really great work on it,

and I think we can make it better." They kind of

mapped out the project, what should that be. They did

like the whole analyses, they presented to the team,

they got a bunch of feedback, they continued to

improve the success metric, and they're continuing

now to get it into production. Once they do that, then

the next step will be to retrain the GBM, so that we

can actually see, is this better, because obviously it's

easy to say, "Offline looks better." Like we've reduced

our main [inaudible 00:20:24], but that doesn't really

mean anything if we didn't make the user experience

better.

Kevin Perko: That's kind of why I say, "There's not really a typical

project, but this would be a good representation of

like, okay, there is some clearer variables that you

want to optimize for." Maybe somebody is giving you

the project, or you're creating it, and then you need to

kind of go down, break them down, figure out how to

represent them. A pretty collaborative environment, so

you're going to go present. You're definitely going to

take some feedback. I think that always sort of

hardens the project, gets you to question your

assumptions, and then you've got to go an write the

code to get it shipped out, so we can actually use it in

the product.


Kirill Eremenko: Okay, got you. GBM is Gradient Boosting Machine. Is

that right?

Kevin Perko: Right.

Kirill Eremenko: Why do you use a GBM in this specific example? Is

there any reasons for that?

Kevin Perko: You know I would say honestly there's not a great

reason. We sort of inherited this model. There's a

previous search team, and the model place, we traded

it, it worked the best way doing this. Kind of our bigger

contributions have been to improve the success metric

that it gets trained against.

Kirill Eremenko: Okay, gotcha. If a team of 13 data scientists, including

yourself, do you find that you have multiple projects

going on at the same time? How many projects is the

team involved in approximately?

Kevin Perko: There's so many people, it's hard for me to even pull

that number out of the air. There is a lot going on at

any one time.

Kirill Eremenko: How do you keep track of everything?

Kevin Perko: Yeah, that's a great question. We have a couple

support structures for that. We have squads. People

that are working on product facing squads, they have

somebody that they're working with, like a product

manager, and a technical project manager, who are

working on what's the task flow, what are we shipping,

what are the deadlines on all that kind of stuff. We get

some similar apparatus on the search and

recommendations teams, so that I don't have to be

responsible for all of it, because it's too much for one


person to make sure everything is on track, and all the

deadlines are being met. That really helps a lot.

Kevin Perko: The other thing is to just ... beyond individual projects,

is having higher level goals that you're ... or higher

level targets for the quarter that you want to move

towards. Those are easier to check in on rather than a

specific, okay, did we analyze this test, did we learn

from this test.

Kirill Eremenko: Okay, gotcha. You probably have like managers in the

team as well who take on some of the responsibility

that then report to you?

Kevin Perko: Right, right. We've got a manger out in Toronto.

Kirill Eremenko: Gotcha. Okay, okay. Very interesting. You mentioned

you guys are hiring at this stage, so if anybody

listening is interested, what's the best way for them to

apply?

Kevin Perko: Yes, we're hiring in San Francisco and Toronto, and

the best way to apply is to go to the jobs page, I would

say. Just in the cover letter mention that you listened

to the podcast, and I'll see that. I actually review all of

the applications that come in. I'm very passionate

about hiring the best people I possibly can. I'm

reviewing all the applications that come in. I can kind

of take more risks, and really see if somebody is

showing something that someone else who was looking

for a very specific profile, may not be able to pick up

on.

Kirill Eremenko: Nice. By listening to this podcast obviously people are

already ahead of the game.


Kevin Perko: Exactly.

Kirill Eremenko: Okay, cool. Well, thank you for that, and guys, girls,

everybody, ladies and gentlemen listening, if you're

interested to go in Toronto or San Francisco, make

sure to check out Scribd. Let's shift gears a little bit.

You're coming, which is very exciting, I'm excited to

announce this to our listeners, you're coming to

DataScienceGO this September, 2019, in 27th, 28th,

29th, September, and you're doing Keynote. Super

pumped about that. Congrats. I can't wait to hear your

Keynote and to meet you in person over there.

Kevin Perko: Thank you, thank you. I'm definitely looking forward to

that. This will be my first Keynote, so it's a very

exciting experience for me as well.

Kirill Eremenko: That's awesome. Tell us what is this Keynote going to

be about? Can you give us like a quick, I don't know,

maybe preview or some spoilers about what you're

going to be talking?

Kevin Perko: Well, I can't give any spoilers of course. In terms of a

preview, what I'm going to focus on is kind of two

things. I want to get people generally excited about

what's happening in data science, as well as how

that's intersecting with what we're doing in Scribd. I

think one of the best ways I can do that is to talk

about an initiative we have internally, around learning

how to represent our content better, which we're

calling the Book Genome. That's really obviously

taking from like the Music Genome from Pandora,

from way back, and applying that to books. It's scaled

what some companies have done. I don't know if


anybody's used the term, book genome, but we really

want to think about how we represent our content.

Kevin Perko: I want to talk about how we're doing, how that's going

to enable really amazing things for our users, and for

data science in general, as well as how that intersects

with like a curiosity culture. Eric Colson at Stitch Fix,

totally has written lots of very good articles on this,

and I really am trying to bring that into my team, into

my organization, and intersect these things, because

there's so much opportunity in data science, that

there's no way that top down you can see all the

opportunities and correctly allocate all the resources.

Kevin Perko: You want people on the ground, being curious, asking

questions, saying, "Hey, I actually have a couple extra

hours, and I'm going to see if this variable is correlated

with this variable, or if I can map this out with a

regression, or a neural network, or whatever it

happens to be, and if we can learn something new,

and I really believe that that'll add much more value to

the business than us trying to pick the best projects

every single time."

Kirill Eremenko: All right. I haven't heard of this music genome project.

Can you tell us, what is the end goal of the book

genome? What does it look like?

Kevin Perko: The end goal is for us to really understand books on a

deep level. When you talk about a book, you talk about

books that you enjoy. You say things like, "It moved

really slow," or maybe it was really dense, like very ...

when I say [inaudible 00:26:51] Slavic words, lots of

technical jargon going on. You don't necessarily say,


like, "Okay, well, it was like a front list book." For

anybody who's not familiar with that, that's a book

that's come out in the last year. Publishers, they care

a lot about that. That's where a lot of their money

comes from. We think about a lot of this internally in

all of these things, but readers don't think about that

necessarily.

Kevin Perko: They're thinking about, "I'm reading a book that people

are talking about. I'm reading a book that is relevant

in the media, or that my friends recommended, or

that's a murder mystery and I love murder mysteries.

It has these elements that I like." We want to take

those, when people are saying these kind of

ambiguous and vague words, these elements that I

like, well, what are those elements, is it dystopia. 1984

is definitely a dystopia, so if you read that, what are

you interested in learning. If we can represent dystopia

as an embedding, how can we relate that to other

books, and then understand that you're not just going

to read dystopias. You'll have a very depressed outlook

on the world if you do that.

Kevin Perko: That's just like a ... not a thing, because lots of

recommender systems, they want to find similar to

items, but we need to introduce this serendipity. It's

really going to become like a sequence type model,

because people, even if you read a data science book

and you're getting into data science, you don't only

read data science books, because that again will kind

of drain your brain power there. You have to sort of

recharge with something else, whether it's a biography,

or a science fiction book. When you read those, not


only do they kind of go together in a sequence, but you

have specific elements you like about your science

fiction books.

Kevin Perko: To you it's less about science fiction, and maybe it's

more about dystopia plus science fiction, plus a

futuristic setting. We want to be able to represent that

in words that we can both share with our readers, on

why they were recommended this book, and what we

know about this book, and to help them find other

books. Whereas today you may browse by genre,

perhaps in the future you could browse by something

more stylistic like books set in London, or fast pace

books, or easy reads for the weekend.

Kirill Eremenko: Gotcha. For instance, like your example with science

fiction, somebody might be interested in like they're

picking up science fiction book after science fiction

book, but really deep down inside what they like might

be a certain type of character, like the lead character

has a certain background, or they are passionate

about certain things, or the manner that they ... how

they are heroic, or things like that. Really, the reader

might not even know this about themselves. They just

happen to be picking up these books, and liking them

based on other people's recommendations. You can't

really express that in words.

Kirill Eremenko: I guess what I was going to ask is, are you going to

look for this information from people? Are you going to

get people to complete a quick survey after they finish

a book, what did they like about it? Or are you going

to have natural linguistics language processing, some

AI, or machine learning, that's going to go through the


book, and actually look for these gems, or these

parameters inside, autonomously?

Kevin Perko: Right, right. The current approach that we're thinking

is given that we get a lot of good publisher data, we'll

start to build it. This includes some kind of human

curated keywords, like dystopia, that's associated with

1984. We can start to train on those words, and kind

of build, and understand how that represents across,

we'll call them words, because we want to kind of get it

more into a tree. It's much more of a graph system. We

don't want to think of it as a flat system. Dystopia has

a relationship to the environment, and cooking, so it's

not very related to cooking, but if you just have a flat

group bank of words, it doesn't really mean anything,

but when you start putting them in a graph, and it's a

little bit more directed, then oh, you can see cooking is

way over here, and you've got your werewolf romance

way over here, and those things aren't really related.

Kevin Perko: Actually your dystopia which could kind of go either

way, is maybe much closer to this hypothetical

werewolf romance, for whatever reason. Being able to

understand those things is much more valuable,

because that's how people think about the books.

They're not putting these hard boundaries on them,

like we tend to do when we mull them out. We're like,

oh, this is cooking, or that's not that, and so they

would never want that. It's like, okay, well, the world is

a little bit more complicated and subtle than that. By

bringing this out, we'll really be able to get at the heart

of what people want.


Kevin Perko: I think you kind of brought it up, it's going to be a two

step process. We're going to be boot strapping it. We

haven't planned on doing a survey, but that's a great

idea. Honestly, I might steal that.

Kirill Eremenko: Sure.

Kevin Perko: Because like you're saying, we don't necessarily have

the language to represent the things that we want to

today, so we're going to have to go figure out what

that's going to look like. It makes a lot of sense it'll be

a collaboration with the data we get, the data we're

able to acquire, how we're able to learn things

internally as well as what our users tell us.

Kirill Eremenko: Gotcha. Is it going to be similar to the Netflix

recommender system?

Kevin Perko: I would say, "No." At a high level all recommender

systems have this ... they share similarities. Given that

we're in the process of building, and I wouldn't really

be able to say, I think that the bigger goal of extracting

the metadata, and learning how to represent it, that's

very similar to what Netflix did. I think they actually

had like rooms of people watching movies at one point,

like labeling them. We're not there yet to have rooms of

people reading books. It also takes a lot longer time, so

I'm not sure if that's feasible. We're going to continue

to try to increase our sophistication, so yes, I'm sure

we'll be using similar methods that Netflix has

pioneered.

Kirill Eremenko: Okay, very interesting. Yeah. It looks like you're going

to have a lot of algorithms that you're going to be

trying out. What's your view on that? How's your


approach going to be? Which model, which algorithm

is going to be the best? Are you just going to try out a

lot of things, or do you already have some things in

mind?

Kevin Perko: Yeah, that's a great question. I feel like I sort of have

two views. One is that I'm agnostic. If you use CFIF,

and that represents the problem, and solves it, then

you should always use the simplest tool for the job. My

second view is that a lot of the things we're seeing with

these kind of next generation language models, that's

coming out with like BERT, and [Inaudible 00:33:34],

and I haven't even had enough time to dig into them,

as much as I'd like, but I can see that their ability to

represent language is incredible, as well as opening

eyes.

Kevin Perko: A model they only released a small version of it, that

was ... I believe it was writing articles, it did too good

of a job of producing fake news basically, so they

didn't want to release the full model, but then they

understood within a certain amount of time, people

would be able to recreate it. They're just sort of buying

some time hopefully, before they unleash this thing on

the world. Which is nice to see somebody having a

thoughtfulness, that hey, this thing could actually be

used, or bad at things.

Kevin Perko: I think a lot of those models will definitely come in

here, because they will enable us to represent things

in really interesting ways, that we may not think

about. I think the simpler approach is nicer in the

sense that it lets you actually say, "Hey, we extracted

this part of the book, and that needs this." That's


really valuable, that interpretability piece. That being

said, neural networks are starting to get that. People

are doing active research. They're starting to say,

"Okay, this is actually what it learned, this is how it

represented it, this is your pixels that it took out and

learned."

Kevin Perko: Then you start to understand, oh, this is why when we

turn a bus on its side, now it may think that it's a

zebra instead of a bus, because it just learned like two

pixels in the image, and so there's a huge risk that

when things change slightly you get very, very wrong

outcomes from these neural network type models.

That's why I like this idea of having a mix of us really

deeply understand the model, not as sophisticated

plus something that's really pushing the edge, and

they'll also can act as like a check on each other. You

can sort of see when the bus is on its side, or if a book

is clearly about romance, and this is saying, "It's

science fiction," and we have people look at it and it's

like, oh, this is science fiction. Then we understand

what's going on.

Kirill Eremenko: Wow. Very cool. Well, if anybody wants to find out how

this story ends, DataScienceGO 2019, end of

September, in San Diego. That's where you can catch

Kevin. I wanted to ask you, Kevin, you mentioned

neural networks. What's your view in terms of the

work you guys do ... There's a lot of ... especially in the

part of understanding the content, I'm assuming

there's a lot of working with text, and language

processing. What is your view on neural networks

versus machine learning approaches?


Kevin Perko: I think that for the most part, they complement each

other, and that really, neural networks uses a lot of

machine learning. They're not these separate worlds of

things. When you're setting up a neural network,

people have kind of said it's much more like

differentiable programming. It's like a config file,

especially if you're working with Keras, you're sort of

setting up, okay, like what are my activation units,

how many layers do I want. You're deciding these

things and it's like, what are you deciding when you're

thinking about this. Okay, well you're thinking about

maybe a linear model, or a logistic model, in terms of

how you want to represent a thing.

Kevin Perko: The difference is that what you're thinking about is

one part of the model. You're not thinking about the

whole model anymore. The neural network kind of

takes all that. It adds its hidden layers, and it does

extra things that aren't really represented here, but

you're kind of guiding it, so you're more of a guide

rather than like, oh, rather than logistic regression, I

learn these features, however I learn them, and I put

them in all and it gives me something very

interpretable. Outputting probabilities, which are very

understandable and that's what the model is, versus

neural networks just trying to kind of map something

really probably non-linear, and understanding that

without ...

Kevin Perko: It's not going to give you that nice interpretability

component yet, but it uses the same I would say

mathematical approaches under the hood. Then it

kind of adds on its own layer. I think that like I was


saying, they really complement each other, and there's

no like, this is better than this. It just depends on the

use case. The truth is in industry most of the time you

don't actually need anything neural networks. Like I

was saying, it's better to say on the old stuff that

people have proven out, that works really well, that

you can actually communicate with, because it's really

hard to talk to somebody about neural networks given

their ... It's like, all the machine learning stuff

combined into this other box, and then put that inside

another box, and then you kind of shift that out.

Kevin Perko: Then people ask you, "Well, how did this decision get

made?," and you don't really have a good answer for

them. Whereas if you're using random forest, or

logistic or linear regression, you can say something

much more confident about, "Oh hey, this is how this

model made this policy or this decision, and I really

understand what that means, and what it's trained on.

We can debate if that's right or wrong." This is how we

go there. That doesn't exist with neural networks. That

why I think they're a balance, when you think about

traditional machine learning techniques.

Kevin Perko: Same thing with support vector machines. Given its a

margin with classifier, you pretty much understand

how it's making these decisions. Whereas with

something like neural networks, you really ... That's

kind of the core thing today, you don't. I think in the

future, people are going to sort of break through that

wall and we will understand these decisions, well

enough anyway that people will get much more

confidence in the models. That's proving to be


increasingly important, is these things get

incorporated, like doing facial recognition for all sorts

of use cases. When a model's impacting sentencing

guidelines, you really want to have a lot of

interpretability behind that model.

Kevin Perko: These are things that I definitely worry about, that

people use these kinds of tools without understanding

like, oh wow, people, there is a lot of ambiguousness

between how this model is working, and there's lots of

opportunity for this to go awry, when you don't have a

good kind of interpretability, and a good transparency

layer. I think that was sort of a big thing for data

science in general is to get much better at that,

especially as data science permeates all parts of

business, and culture. People want to know, "Hey, how

did this happen? If we're going to delegate this to an

algorithm, how did it make the decision?"

Kevin Perko: In the past it was just, if we can make a good decision,

then we'll go do it. In the future it's like, if we can

make a good decision that we can explain, and people

will agree with it, we'll go do it. Sometimes we'll make

a less good, but perhaps a more societally fair decision

that people agree with. We'll have the ability to adjust

the knob and do that, whereas today we may not.

Kirill Eremenko: That's a whole explainable AI. [inaudible 00:39:56]

becoming more of a trend we're seeing that even this

year, more questions are being raised, more companies

or agencies, government agencies including, are asking

the question, "Is this explainable AI? Do we know how

it's making these decisions?," because as you

mentioned, with data science becoming more and more


part of our daily lives, and society, there's so much

that can go wrong in terms of recognition of even facial

recognition, and any kind of associated racism that

can be incorporated in that, or sexism, and when you

can explain how the model works, you can point that.

When you can't explain, then you've got a whole

different can of worms that you're going to open.

Kirill Eremenko: A lot of it also comes, especially in neural networks,

comes from labeled data. Like, the AI might be the

neural network is ... just the architecture is very

neutral, but then the data that it was labeled already

has some kind of bias, so has some sort of

discrimination in it. Then the AI learns that, and try go

in there and make it unlearn that if you can't get ...

You don't know which neuron responds ... correlates

to which features. It's pretty insane.

Kevin Perko: Exactly. Exactly. That's a great point, that the

algorithms are just representing a bias, and when we

have bias as society, that is represented in the data

sets. The algorithms don't ... they're immoral. They

don't know that that's not the ideal outcome. They

actually think that's the outcome they're supposed to

learn and reinforce.

Kirill Eremenko: Yeah. Then you've got that whole trend. Have you seen

those images when people take like a stop sign, and

they put some stickers on it, and self driving car

doesn't recognize it as a stop sign anymore.

Kevin Perko: I have not, but that does not surprise me at all,

because I see those self driving cars around San

Francisco all the time, and they really struggle.


Kirill Eremenko: Oh wow. Where is it ... I haven't been in San Francisco

for a while. What company is that through, Uber or

self driving Ubers?

Kevin Perko: Typically what I'm seeing are the Cruise vehicles.

Kirill Eremenko: Okay. What do they do?

Kevin Perko: Cruise, I think GM bought Cruise, and-

Kirill Eremenko: Oh okay, gotcha. It's like a [inaudible 00:42:24]

transportation company.

Kevin Perko: Right, right. They have SUVs drive around San

Francisco with a ton of sensors, and they're logging in

an incredible number of miles in the city. You can see

how much they struggle at intersections, and it's like a

bike goes by, then they're like suddenly swerving, and

you're just like, technology is not right. People talk

about level five, in like 10 years. I'm like, level five is

just like, we can't even think about that. This is, these

cars are just ... they are not ready. I mean, I get it.

Urban environments are really hard, but the core thing

is you can't learn everything and advance, and I think

that's where we're just kind of pushing the current

limits of what we have with vision and AI, is that we're

trying to.

Kevin Perko: We're trying to have incredible lidar that can respond

super fast, instead of a general intelligence that

understands how to value different objects. These cars

can't do that, so they treat a cat the same as a

bicyclist, the same as a semi truck. It's just an object,

and there's not association or learning with it. Now,

I'm sure that's changing. I think that's kind of the key


problem, is until you do that, then you're going to

react the same to a cat, or a squirrel, that you are

going to reach to a semi truck, which is a problem.

Kevin Perko: The other thing is if you just had like a whole network

and it was all autonomous, then you'd be kind of fine.

The machines could do weird things, but you'd figure

out how to solve that. When you're interacting those

with humans, and the machines don't have a way of

relating to the humans, then you get all these new

problems. My favorite one was they had to make the

driving system more aggressive at intersections in

California, because we all do the rolling stop,

especially in San Francisco. The car would just sit

there waiting for its turn to go, and it would never go,

because there was never a point where all four cars

came to a 100% complete stop.

Kirill Eremenko: Okay, gotcha. Okay, yeah, okay, because the rules are

kind of different. It's following strict rules, whereas

humans are more flexible with the rules I guess.

Kevin Perko: Right. We think about the spirit of the rule, are we

causing harm, and try to interpret that within the

context of the situation, like is it sunny, or raining, or

am I surrounded by bikers or little kids, whereas

literally the machines, they don't have any of that

context. They're just like, this is the rule. If the speed

limit is this, and it says this, then I do this.

Kirill Eremenko: Yeah, wow. Okay, very, very interesting observation.

Must be pretty scary dodging these cars.

Kevin Perko: It is, it is. Sometimes it concerns me to think that they

are actually going to try to have that ready to go. I


think that they do have some in ... maybe it's in

Arizona, but it's on kind of like a closed track, where

they know exactly what the variables are going to be,

and that works fine. It's just urban environments are

really hard, even for human drivers who have a lot of

experience. They're very challenging. For machines,

they're incredibly difficult, because the number of

things you have to learn each second, it changes every

second.

Kirill Eremenko: Yeah. Well, technology, data, it's interesting to see how

they are coming. Data is becoming more and more

recognized as something that's driving business, and

these two things, technology and data, are coming

closer together. They've always been propelling one

another, but now we're trying to use data everywhere

where we can, and technology as well. Then what I

notice about your background is that it looks like

you've changed careers very consciously I would say,

that you've selected different companies, or different

roles, in data science to work, but they've never been

along the same line. Let's say, developing self driving

cars, or in the case of Scribd, like working with

recommender system, or understanding content.

Kirill Eremenko: It feels like you've moved around the space quite a bit.

Can you comment on that? Why these choices of roles

and careers? Were you searching for something? Did

you consciously decide on what you want to learn next

before progressing further?

Kevin Perko: Right. I think it's easy to look back historically and see

a narrative. I'll say at the time it was really kind of like

an exploration, give it much more like a gradient


descent. I'm taking these steps, some of them good,

some of them not as good, and just learning, and

gathering more information. They're all really valuable

steps, because now I know if I'm walking up the hill, or

down the hill. What they've kind of given me in

aggregate is this really unique view of all of the

different parts of the system, in terms of how

companies actually can use data science, how we

think about this idea of a full stack data scientist, kind

of comes from my past experience of seeing, well okay,

somebody can't ingest this data right, then there's no

data science.

Kevin Perko: If you don't have good data that's clean, then you

spend all your time doing that and so you spend very

little time applying it to a model. These are the kind of

the key systems of like, oh, if you can't deploy your

model, then you're just beholden to another group,

and you're not like a data science business unit.

You're not shipping product, you're really more of a

support function if you're constantly bound by

somebody else, to go put the thing that you made into

a product. That really limits your scope and your

ability.

Kevin Perko: That's kind of what I've seen across my experience

across all these organizations, getting to see how

different organizations treat data science. It's really

kind of a key thing, that you have an organization that

the executives believe in data science. They believe

that you can use experimentation and machine

learning, not just to make their product better, but to

be the product. That's something I very much see has


to come from the top. When it does, it makes your life

much, much easier, and the company is on board, and

you're pushing the edge more than just trying to say,

"This is why we should exist."

Kevin Perko: Kind of having my experience in hindsight has given

me a lot of these really unique perspectives. Going

forward as I build it, I just thought this is a really

interesting opportunity, let's try this, let's try this. I

didn't really see how it was going to connect. Looking

back, I can kind of see that it's been a really nice

connection by working these different companies,

seeing different approaches, how all this works

together, seeing different organizational structures

where you have it really split up, where data science

doesn't have access to any systems, and how limiting

and suboptimal that it, is for a data science group.

Kevin Perko: To have those restrictions, whereas if you think about

the other side, of well, what if they have engineers with

data scientists, and they're shipping product. That is

really where you want to be for every data science

team, because then you get really to this true full

stack data science org, that can ship product, that can

support change, that can do whatever it needs to do

within the business, rather than having something

that's very kind of boxed in, into its very specific niche.

Kevin Perko: It does that and maybe it does it really well, and

creates a ton of value for the business, but in my

opinion it's always going to be suboptimal to structure

it that way.


Kirill Eremenko: How would you advise somebody who is looking from

without an organization? From externally, and maybe

looking for a job, or looking to move into that

organization, change career, how would you

recommend for a person like that to determine the

answer to that question? Is data science seen by the

executive as a product or not, because when you're

inside it might be quite obvious, but when you're

outside, and you're trying to understand if this is the

right company for you to work in, it might be difficult

to see.

Kevin Perko: That's a great question. I think that it is always going

to be difficult to judge something like that from the

outside. What you can do are like little ... You have to

look for signals, kind of build your own pattern

recognition system, and ask questions, really simple

things like, do they have a blog, does it get updated, is

the company ... are any executives talking about data

science or machine learning, any public interviews

ever, do they have maybe a chief data officer, or a VP

of data science. If you're able to talk to people, if you're

in the interview process and you're talking to someone

who's maybe director, executive level, what do they

think about data science, how do they think it's

driving the business, and really listen to how they

answer that question, and what they say.

Kevin Perko: Do they have a vision? Have they thought about it at

all? Or is this like, we don't know, we want you to

come in and do it, and we're open. How they answer

those questions will tell you a lot about how the

organization views it. Most people will be pretty honest


there and say, "Okay, we really think that this will

help us increase our lead generation by 3X for our

business, if we're B2B SAAS, and that's more money,

and that's how we see it. That's the end of our data

science at the company.

Kevin Perko: Then you can make your own decision, once you get

that. I think it's really kind of being able to talk to

somebody from a more senior leadership position, and

getting good answers on, have they thought about this

deeply, and they actually believe in it, or they see

everybody, and they just want to hire, because it's

usually pretty clear, when somebody's trying to hire a

data scientist, because they think they should have a

data scientist, and yet they have no idea what the data

scientist will do. They won't actually be able to tell you

what any of the projects are, or any of the vision is for

data science in that job unit, or what have you.

Kevin Perko: I think those are kind of the key signals. You can kind

of start parsing out. You can also just sort of ask

people how the teams are organized, is it in

engineering, which might be really important at a

smaller company, is it in product, is it in marketing, is

it in finance. I've seen all of these structures. They

mean really different things for the data science group.

Is it a science group totally decentralized and

everybody's embedded within a specific team, that's a

really different data science experience rather than

joining a data science team, and then working within

different areas of the business.

Kevin Perko: I think all of these things are areas you can look for,

and questions you can ask to try to assess that out.


Kirill Eremenko: I love what you mentioned about the decentralized

embedded data science team, where you've got data

scientists, or machine learning engineers, in different

functions of the business, versus a stand alone data

science team, something what you have at Scribd.

What would you say are the advantages,

disadvantages of either of the approaches?

Kevin Perko: Right. At Scribd we would have something

approaching a hybrid model of this. I think that ... The

advantage of having a core data science team is that it

really has to think of itself as a business unit, and go

around, and connect itself to the business, and

understand what the priorities are, and where it can

drive value, and what opportunities exist. Then can

kind of track those out into near term, medium term,

long term initiatives, whereas your long term initiative

is like trying to ship really exciting state of the art

products, and then short term is something very

clearly defined.

Kevin Perko: You're working with GBM you can re-rank something

better, or represent something better with some vision

that's already been solved, using a pre-trained model,

and you know you can ship that in a month, and help

the business in this way. The kind of key thing right

there is you have to align yourself really tightly with

the business. When you're embedded, it's really easy

to say, "Okay, well a product manager brought a road

map, or somebody brought a road map, and we're

executing on it. You told me to build this algorithm,

and so I'm going to go build it."


Kevin Perko: We have a recommendation system, and we're going to

try to make it three percent better, rather than asking

if we even have the right system, and then taking three

to six months to rebuild it, which is what you're going

to get from the business unit approach. Where as the

embedded approach is much more likely to be

iterative. There's going to be other factors in there, but

that's sort of what I've seen, is that it drives this

iterative approach, which makes it hard to make

bigger gains. It's certainly valuable for the business to

have iterative gains in the near term. However, it kind

of limits your ability longer term to sort of go after

bigger opportunities [crosstalk 00:54:38].

Kirill Eremenko: When you say you have a hybrid model, what do you

mean by that?

Kevin Perko: Right. When we have a hybrid model at Scribd, we

have data scientists that are embedded on product

facing squads, as well as searching recommendations.

They work with those squads really tightly. Those

squads have road maps. They are doing some of the

iterative thing, and what we're doing now is to really

pair that more with like, well, let's drive road map, let's

think how we can kind of reimagine the system instead

of just making an existing system we inherited a little

bit better. Maybe we can actually make it a lot better,

however we're still working within the constraints of

that system, without really deeply questioning if that

system should exist. Which just is that, like I said, is a

function of being embedded.

Kevin Perko: In Toronto, we're really focusing on the more business

unit type approach. I'm going to bring that approach to


San Francisco as well, so we're really thinking about

how to reimagine the system, in addition to driving

iterative improvements.

Kirill Eremenko: Okay. Very useful information, especially for business

owners, or executives listening to this. Would you say

there is kind of like a threshold, when a company

should maybe for instance as a smaller firm, a smaller

organization, a start up start with embedded

approach, and then at some threshold, switch over to

the core data science team approach, or the stand

alone data science team approach? Would you say

there's a time in the life of any business when that

should happen, or this really depends on the type of

business nature of the industry?

Kevin Perko: It depends on the business. My personal view is that

going to the business unit sooner is always going to be

better. The trade off with that is you don't get the

really embedded focus that that brings. If you're trying

to ship something, say you're a start up, and you need

to raise your next [inaudible 00:56:41], you need to hit

very specific goals in the next six months. The

embedded unit can really help you align everybody,

really clearly, assuming you already know it needs to

happen. When you're in much more of the greenfield

space, the embedded units, it's harder for them to

deliver that kind of work, especially from the data

science side, because data scientists really kind of

shine when they're working with other data science

consistently, and bouncing their ideas off, and

thinking about things.


Kevin Perko: They're like outside of the bounds of what people are

envisioning of the next version. That's where it just

becomes, you could have a product manager who

totally gets data science and they can do that in that

model, and it works, and you don't actually need this.

You can get the same gains that you would get. What

I've seen practically is that there's not very many

people like that. Depending on your organization, and

what you're solving for, there is kind of like a point

where you want to think, okay, when ... am I getting

enough out of the data science team. If not, the kind of

the business unit approach and now it's important to

pair that approach with a full stack approach. It's data

scientists, engineers, whoever they need to ship their

product.

Kevin Perko: Maybe in your company it's front end engineers and

designers. They should have it, and they should be

accountable just like a product org, and run it the

same way. It's no longer a support function. It's now a

unit shipping product that's driving your company

forward, and you can't have them sort of constrained

by other parts of the organization, because then you're

not going to really get to see what they can do. I think

that it's simply a trade off for the business, depending

on what you want to achieve. It's not like, oh, you have

to do this, or you have to do this. It really depends on

the goals of the business.

Kevin Perko: I'm always going to say that the business unit is going

to be really more powerful. Longer term it's going to

create more value. I feel very strongly about that. I

think though in the short term, and in the medium


term, that can be very iffy. If those are really where the

business is focused, they can have different ways of

approaching that.

Kirill Eremenko: Okay, gotcha. Thank you for that overview. I got like a

question, a philosophical question for you, where do

you think the field of data science is going, and what

should our listeners prepare for to be ready for the

future that's coming in the next three to five years?

Kevin Perko: I think we talked about this a little bit earlier, where

data science is starting to pervade every part of our

daily lives, and so people are now asking these big

questions about, hey, how does it impact my privacy,

how did the model make this decision. I think privacy

and interpretability are going to become increasingly

important. I think you see this a little bit with Android

and iOS, and you can do some on device training, or

serving, depending on how you set it up, that can

really actually drive user privacy, and machine

learning. Those two things used to be opposed. Now

they can be united. I think privacy is generally

becoming a big worldwide thing as people realize the

value of the data, and the value of their privacy that

they've just kind of given over to corporations and

governments, so they want it back.

Kevin Perko: I don't think that's going away. You have things like

the blockchain, which is high level, sort of a universal

trust in verification system. It's really exciting to think

how can data science intersect with that, can we

actually write contracts with Ethereum, that are social

enforceable, and build models, and have all of these

sort of units served where we have general ledgers of


trust, and where does data science play in that, like

how can we think about what kind a society one have,

and what data science can enable within that. These

are really big questions for us to ask, because I think

the models, it's sort of the both, they're already there

and the incredible things they can do, and they're

really far away in the things that we think that get

hyped a lot, like actually having autonomy in self

driving cars.

Kevin Perko: Computer vision is still very, very early. I think that it's

going to get deployed in a lot more situations where it's

actually making decisions for classifying people, where

it's probably not ready. That's just going to happen.

The best thing that we can do is to really push the

interpretability, so people can say, "Oh, it's kind of

clear this algorithm isn't ready, but we can pair it with

humans." That's what a lot of businesses that use AI,

do. They pair it with huge amounts of people labeling

the data, and evaluating the decisions the model

made, and understanding if it's right. We need to

continue to do that same thing as it gets out into

society in general.

Kevin Perko: Everybody needs to be able to evaluate a model, and

understand if the decision it made based on this

information is reasonable, and have debates about it,

as it comes into society. I think that's real exciting,

because people are now building ... You have

[inaudible 01:01:20] processing units, but this

computer is specifically dedicated for serving and in

some cases training models, and that's real exciting,

because most of the limit I think of like, there's


machine learning and neural networks, and general AI

has really been on the compute, this is like pushing it

back to the algorithm. Then you see once that

happens, every kind of six months people are sort of

pushing the state of the art, and that's going to

continue to happen, as long as we don't run into

another compute wall.

Kevin Perko: I think the future can be sort of whatever we make it.

It can be a dystopia 1984 type situation, where we're

all getting bound by this facial recognition that we

don't know how it works, and the government's using

it, or we can create this real incredible future where we

can be revolutionizing how food is grown, and how

water gets preserved, and how we're tackling climate

change, and data science can move into all of these

fields, and it should, and it can help. We can help

people understand what's actually behind all these

decisions, and make better allocations of our

resources, using data science models, and using a lot

of models that already exist today.

Kevin Perko: It's kind of getting them into government, getting them

into these really large companies that move really

slowly. That's sort of a really big piece, is kind of the

pervasiveness as much as pushing the state of the art

of data science. That's really exciting work, can open

up new implications and new technologies, and new

products. I think that there's also a lot of gains to be

made on just increasing the pervasiveness of data

science among existing industries like schools, and

governments. That can have a very large positive

effect.


Kirill Eremenko: Gotcha. It seems like we've gone full circle here on the

podcast, that we came back to where we started from,

that applied data science is kind of the answer, don't

just learn data science for the sake of learning data

science, but see what impact you can make in the

world, whether it's through various industries, and

exciting projects, or it is through bringing data science

to government, and society, in a very understandable,

secure way, that respects people's privacy.

Kevin Perko: Absolutely. I think that's a great summary, because

you can solve a lot of problems with regressions, better

than they're being solved today, and people can

understand those decisions, and can actually improve

the world doing that, which is really exciting.

Kirill Eremenko: Fantastic. Well, thank you, Kevin. This brings us to

the end of today's episode. Before I let you go, what's

the best way for people to contact you, get in touch,

follow your career, learn more about what you're

doing?

Kevin Perko: People can follow me on Twitter, at croatiankp. We've

got a data science blog, Scribd data science and

engineering blog on Medium. Obviously there's

LinkedIn, feel free to follow me there, although I don't

post very much material on LinkedIn. I think those are

all great places.

Kirill Eremenko: Nice job. Obviously people can apply for positions that

you're looking to fill on the Scribd website, right, you

said?


Kevin Perko: Right, right. You can go to Scribd.com/jobs, and we

have some data science openings, you can apply there

as well.

Kirill Eremenko: Fantastic. Well, we'll share all those links in the show

notes. Make sure, guys, and everybody listening to get

in touch with Kevin, follow Kevin. Kevin, one more

question for you before we finish up, what's a book

that you can recommend to our listeners, that will

help them in their careers, or in life?

Kevin Perko: I recently read Bad Blood, which is about the Theranos

founder, Elizabeth Holmes, and I think it's a really

incredible book, because it sort of shows this

intersection of building a future, and how you can

kind of go over the line with that. You get kind of

caught up in your own, you go in your own potential

too much, building the future's actually really hard.

When you're dealing with something like health care, if

you get caught up in those things, you can create very

bad outcomes for people. It's kind of a good sort of

message for data scientists, of like, we can take this

incredible tool we have and use it for bad, or we can

kind of say, "How do we leverage this thing," and really

kind of think about how we drive new, amazing

systems, and strengthen the world in a better way,

using it.

Kirill Eremenko: Yeah, I actually watched a documentary about that on

the plane recently, and indeed, extremely interesting

and very educational story for anybody in technology

and data science, that the things that as you said,

could be used for good or for bad, and even trying to

use it for good you can get really caught up in the


promise that it has, that technology. Sometimes we're

not there yet, like with the whole self driving cars.

Right? We need to navigate our way to get there first.

Kevin Perko: Exactly, exactly.

Kirill Eremenko: Gotcha. Okay, well, Kevin, thanks so much. Looking

forward to seeing you in person at DataScienceGo. All

right.

Kevin Perko: Absolutely. I can't wait either.

Kirill Eremenko: There you have it, ladies and gentlemen. That was

Kevin Perko, Head of Data Science at Scribd. Thank

you so much for joining us for this conversation today.

I hope you enjoyed the chat that we had, and probably

for me, one of the favorite parts was what Kevin

mentioned about the different types of data science

teams that you can have. You can have a decentralized

team where all your data sciences or machine learning

experts are embedded within the different divisions of

your business, or you can have a centralized team of

data scientists, a stand along core data science team.

There are advantages and disadvantages to both, but

it's important to understand that it is a conscious

decision on how a business should do that.

Kirill Eremenko: If you're a business owner, or entrepreneur, so that's

something to think about. If you're a data scientist,

that's also something to think about into the sense

like, how does your business do it at the moment, or

how does the business that you're applying for do it.

That's a question that you might want to ask at an

interview, to understand better what your role is going

to be about. If you enjoyed this conversation with


Kevin, I am 100% sure you're going to enjoy his

Keynote at DataScienceGOo 2019. If you haven't

gotten your tickets yet, head on over to

www.datasciencego.com, and join us this September

27th, 28th, 29th, in San Diego. Wonderful city,

wonderful conference.

Kirill Eremenko: Get to network with Kevin, lots of other amazing,

insightful speakers. We have over 30 speakers

attending, and of course we're going to have between

600 and 800 data scientists coming to

DataScienceGO. You don't want to miss this

opportunity to expand your network. We had people fly

all the way from Brazil on 27 hour flights, on 20 plus

hour flights from Europe in the previous years, so

distance is not an excuse. I look forward to seeing you

at DataScienceGO, and networking with you

personally.

Kirill Eremenko: On that note, thank you so much for being here today,

and I'll see you next time. Until then, happy analyzing.


sds podcast episode 279: embedding data science in … · somebody should study, i would rather see...

Documents