sds podcast episode 279: embedding data science in … · somebody should study, i would rather see...
TRANSCRIPT
Kirill Eremenko: This is episode 279 with Head of Data Science at
Scribd, Kevin Perko.
Kirill Eremenko: Welcome to the SuperDataScience podcast. My name
is Kirill Eremenko, Data Science Coach and Lifestyle
Entrepreneur. Each week we bring you inspiring
people and ideas to help you build your successful
career in data science. Thanks for being here today
and now, let's make the complex, simple.
Kirill Eremenko: This episode is brought to you by our very own data
science conference, DataScienceGO 2019. There are
plenty of data science conferences out there.
DataScienceGO is not your ordinary data science
event. This is a conference dedicated to career
advancement. We have three days of immersive talks,
panels and training sessions designed to teach,
inspire, and guide you. There are three separate career
tracks involved, so whether you're a beginner, a
practitioner or a manager you can find a career track
for you and select the right talks to advance your
career.
Kirill Eremenko: We're expecting 40 speakers, that’s four, zero, 40
speakers to join us for DataScienceGO 2019. And just
to give you a taste of what to expect, here are some of
the speakers that we had in the previous years:
Creator of Makeover Monday Andy Kriebel, AI Thought
Leader Ben Taylor, Data Science Influencer Randy Lao,
Data Science Mentor Kristen Kehrer, Founder of Visual
Cinnamon Nadieh Bremer, Technology Futurist Pablos
Holman, and many, many more.
Kirill Eremenko: This year we will have over 800 attendees from
beginners to data scientists to managers and leaders.
So there will be plenty of networking opportunities
with our attendees and speakers, and you don't want
to miss out on that. That's the best way to grow your
data science network and grow your career. And as a
bonus there will be a track for executives. So if you're
an executive listening to this, check this out. Last year
at DataScienceGO X, which is our special track for
executives, we had key business decision makers from
Ellie Mae, Levi Strauss, Dell, Red Bull, and more.
Kirill Eremenko: So whether you're a beginner, practitioner, manager or
executive, DataScienceGO is for you. DataScienceGO
is happening on the 27th, 28th, 29th of September
2019 in San Diego. Don't miss out. You can get your
tickets at www.datasciencego.com. I would personally
love to see you there, network with you and help
inspire your career or progress your business into the
space of data science. Once again, the website is
www.datasciencego.com, and I'll see you there.
Kirill Eremenko: Welcome back to the SuperDataScience podcast, ladies
and gentlemen. Today I've got a super exciting guest,
another speaker who will be joining us for
DataScienceGO 2019, at the end of September, this
year. If you haven't gotten your tickets yet, check out
www.datasciencego.com. Today we have Kevin Perko.
Kevin is the head of data science at Scribd, and he is
leading a team of approximately 13 data scientists,
between San Francisco, and Toronto. We had a
fantastic chat today, so here are a couple things that
you will take away from this conversation.
Kirill Eremenko: You will learn what it's like to be a data science
manager, or a data science leader, and what it's like to
manage a team, and more so two teams, in two
different locations, and how that is different to actually
doing the technical work. If you're thinking of
progressing as a data scientist to a data science
manager, or to a head of data science, this will be very
valuable for you. Also, you'll learn about the Book
Genome Project, that they're doing at Scribd, which is
a very exciting undertaking. You'll learn what it's like
when a company sees data science as a product, as
opposed to an auxiliary function.
Kirill Eremenko: If you're a business owner or an executive, you'll learn
a very valuable concept of decentralized, or embedded
teams, versus core data science teams. What's the
difference when your data scientists or machine
learning experts are embedded throughout your
organization, versus when they're in one core
centralized team of data scientists, what are the
advantages and disadvantages of each approach, and
what stage of the business should you be doing each
one in, and what should you be aiming for.
Kirill Eremenko: Finally, if you are in Toronto, or San Francisco, and
you are looking for a job or considering a new role in
data science, then stay tuned for this podcast, because
Kevin will announce that they're hiring, and you might
just like this company, and might just want to check
them out. On that note, very exciting podcast coming
up. Can't wait for you to check it out. Let's get straight
into it. Without further ado, I bring to you, Kevin
Perko, Head of Data Science at Scribd.
Kirill Eremenko: Welcome back to the SuperDataScience podcast, ladies
and gentlemen. Super excited to have you on the show
here today, with my lovely guest, Kevin Perko, calling
in from San Francisco. Kevin, how are you doing?
Kevin Perko: Doing great. I'm doing great.
Kirill Eremenko: It was fun chatting just now about like, a book, and
you haven't written one yet. If you were to write a
book, what would it be about?
Kevin Perko: Oh, that's a great question. If I was going to write a
book, I think I would focus on kind of how
interdisciplinary data science is, and how that is really
kind of what makes it come alive. You've got elements
from psychology, you've got these general things
around just being curious, and you've got to really
program, and build models, and sort of represent in
the world, and I think all of those things kind of come
together in this nice sort of like, systems thinking,
complex systems type fields of study, that people don't
usually study who do data science. I also think it's
why people do study something like physics, which is
literally the building blocks to the universe, tend to do
really well in data science.
Kevin Perko: I think my book would be try to capture more of these
elements and kind of interweaving them, and showing
how these things are building on each other, and why
neural networks are kind of something that's really
interesting, comes out of something from '70s really,
even before that. It's not like a new thing, but just to
give people a sense of understanding on how
everything is interrelated, and it's all towards
understanding how we model these things, and that
while people like to talk about AI, there isn't really
anything that approaches general intelligence yet. Still
really mapping these functions to output values.
Kevin Perko: I think understanding the systems in which these
operate are really, really interesting.
Kirill Eremenko: Very, very true. Do you feel that data science kind of
came together as a chain of development ... not even
chain, like a group of developments in different fields.
You know, there's elements of data science that come
from economics, there's elements that come from
physics, as you mentioned, there's elements that come
from neural networks and IT, there's elements that
come from mathematics, even biology. Some of the
statistical apparatus, especially in R, originally came
from AB testing, and random sampling in biology, or in
medicine.
Kirill Eremenko: Do you have this feeling that data science kind of right
now it's a separate science, there's arguments to
support that, but originally it independently grew in all
these different fields?
Kevin Perko: Right, right, absolutely. I think a good correlator for
this is that if I was going to recommend what
somebody should study, I would rather see them study
computational biology, mathematics, physics, as
opposed to data science itself, because then you're
kind of removing yourself from the actual subject
you're studying, and data science is always applied.
We're never just, at least in industry, thinking about
how to make [inaudible 00:08:06] to sound more
efficient, think about how to apply it to solve a
problem. You come up from a computational X area,
that's really what you're going to be doing.
Kevin Perko: I see sometimes people come out of this sort of generic
data science programs, like, I want to go do NLP, and
it's like, what problems do you want to solve with that,
like why do you care about having this tool, so you can
leverage it for solving a problem, whether it's in health
care, or physics, or business. I think that's where it
gets really exciting, is when you mix those applied
fields together. Somebody, I'm kind of remembering
here, somebody was studying glaciology and they were
actually applying data science methods, and they were
able to map how glaciers are moving, where people
previously hadn't been able to. It's like, that's where
data science really shines.
Kevin Perko: That's where it gets really exciting. Yeah, I think that
that's kind of like my ... My thing is that I almost think
it shouldn't ... It can't be a separate thing. It has to be
in all of these things, because it can help all of these
various fields move forward faster, as opposed to just
itself.
Kirill Eremenko: Wow, very interesting perspective. Applied data
science, great way to get started into the field. I guess
if you combine it with something that you're
passionate about, something that like somebody who's
doing glaciology has to be excited about glaciers, and
there has to be some story behind it, why they're doing
it, I guess if you do it that way, you get the extra boost
of seeing how applying data science to this field that
you're very interested in, can make massive progress
and massive impact in that field.
Kevin Perko: Absolutely, absolutely. I think that's really where
people drive breakthroughs, is when they bring a
couple different fields together. Data science is a great
one that you can bring it to almost any field. It can
help you rather infer, compute, figure out what is the
true structure of all of these different areas, and that's
really powerful. There's not a lot of fields that do that,
but if you're just focusing on where do I run around
and how to apply data science and algorithms to, you
get a lot of interesting things. You see a lot of the voice
to face, or the deepfakes, and all this stuff.
Kevin Perko: There's people that, well, there's social media, and I
can get a lot of press if I do this thing that's going to
freak people out. Then that's what happens, and we
end up building something that kind of scares people
about AI, and also has a debatable social value, rather
than like really pursuing trying to building up
breakthroughs in hard sciences, which is really
exciting and really valuable to the world. That's where
I see the trade off.
Kirill Eremenko: What's your story? How did you get into the space of
data science? What did you study?
Kevin Perko: I actually studied finance. For me, data science really
happened to me. I was always interested in numbers,
and thinking about numbers, and I had picked up
programming when I was younger, sort of on and off.
Then in school, kind of switched over to like, I really
just want to do banking, because I love the stock
market because it had so many numbers associated
with it, and all this future value of money, and all
these kinds of things that are really interesting. I
didn't have ... For me, it didn't click, like oh, I should
do computer science yet. Then I got out and I was like,
I definitely should have done computer science.
Kevin Perko: Just ended up, I was like I just got to work at a tech
company, which I did. Did a variety of roles there. I got
into building and application. It was powered by data
though, so I got to interact with the data. I was
building what we call ETL pipelines now, but nobody
really had a name for it then. [crosstalk 00:11:38]-
Kirill Eremenko: [crosstalk 00:11:38], right?
Kevin Perko: Hmm?
Kirill Eremenko: Extract, transform, load.
Kevin Perko: Exactly, exactly. Nobody really knew what the next
thing, nobody was like, let's do analysis on top of it.
We did a little, like a very light statistical analysis. I
did a little work with SEO, because we had more of a
long tail application. From there I basically knew that I
wanted to do more of this, but I still didn't really have
a name for it. People are thinking it was FP&A, which
is definitely not what it was, because it was much
more computer science oriented. Roughly around this
time, Facebook started to come out. The term data
scientist got popularized, but it was only for PhDs at
this point, for the most part.
Kevin Perko: They were solving these really, really massive problems
at scale, that didn't previously exist. They also had a
ton of users, and so all these unique problems that
most start ups didn't have, they couldn't really do this.
I kind of went from there to the next company. I again
did something similar, but I was closer to the analytics
this time. That kind of gave me the freedom to do all
this analyses, finally get into building some models,
doing some fraud modeling, some graph analysis, and
that's really where I was like, "Ah, this is incredible."
Kevin Perko: Booting up Gephi first time, and loading a graph in
there, and really seeing the representation of these
relationships, and how you could walk down the node,
and see how people are related and how fraud circles
form, fascinating stuff. This kind of hooked me and
then I was like, I again need to do more. I'm going in
the right area, even though I'm not really sure what it
is now. It's finally called data science, really diving in
to learning Python, everything else I need to.
Kevin Perko: From there then I worked for a gaming company. I
was, all right, it's like in a lab. It's like a science lab for
running experiments. Really interesting. Don't
necessarily feel that great, but you learn a ton about
how people respond very quickly to incentives, and
game play function, and game play economies, and all
of these really interesting areas. That's kind of in my
path and then from there I've continued on at Scribd.
For me, it was kind of this route that I was sort of on,
and I didn't know it. Then the industry just showed
up, and I was like, "This is exactly what I want to do."
Kirill Eremenko: That's awesome. Right place at the right time.
Kevin Perko: Absolutely.
Kirill Eremenko: Yeah, very interesting story. You've been now at Scribd
for what, like over five years?
Kevin Perko: That's right, five and a half years.
Kirill Eremenko: That's really cool. You start off as a data scientist, data
science manager, and now you're head of data science.
Tell us what that feels like.
Kevin Perko: It's great, it's great. I mean, it's both exciting feel to
grow in a company and watch the company grow while
you're there. It's been a total mindset shift when you're
going in and doing the ground level work versus
having a team of people. We're in San Francisco and
Toronto, in terms of the data science team, and that's
just ... kind of have to ... Most of my career has been
sort of figuring out how to do things while I'm doing
them, and so managing a team is no different. You
really have to sort of change your job every six months
to a year. Nobody tells you that you're supposed to do
that, but you definitely are. Otherwise, you're going to
get stuck. [crosstalk 00:14:44]-
Kirill Eremenko: What do you mean by change a job?
Kevin Perko: What I mean is like, as a data scientist, you're really
thinking about the models, and the business problems
you're solving, and as a manger now you have to think
about how you help people solve those problems, and
what the communication around that looks like, and
how you're setting expectations, and what you're
delivering. Then once you're kind of managing the
whole team, you have to think like, what are we not
even thinking about, what's the culture, how do I kind
of delegate, so I have more people on the team who are
aligned with me and thinking the same way, and I can
be a multiplier effect, because I can't be everywhere
anymore.
Kevin Perko: Most of my day is kind of like sitting in meetings from
10:30 to 3:30, very typical day, and whether I'm doing
interviewing, or meeting with other PMs, or meeting
with other executives, all of those things kind of add
up, plus one on ones for the team, and so the day just
kind of fly by, so I can't really be there providing any
sort of technical leadership. I have to build that out on
the teams so the team has some senior people who can
do that. These are sort of things are like, okay, well
now I had to change my job. Previously I was much
more involved in this. Now I'm not involved at all.
Kevin Perko: Now I'm working with the team in Toronto, really
making sure that they get up and running, and we're
working on newer things, like we're working on
building a machine learning platform internally. Now
we're going to use some tools for this. We're not going
to write the whole things ourselves. That's like a whole
new area. Okay, okay, now we really have to think
about this, and we really want to focus on getting
everybody more into the full stack data science side.
We've always sort of had the full stack data science
term that we've used internally, of like how we think
about we kind of go end to end, but this is like we
want to go, take that to the next level where we're
working with Scala, and we're really being able to
productionalize anything at any point. Really kind of
pushing the team in that direction, to enable new
opportunities for us.
Kirill Eremenko: Very cool. How big is the team right now?
Kevin Perko: The team including myself is 13 people right now.
Kirill Eremenko: Oh, okay, gotcha. 13 across Toronto was it, and San
Francisco?
Kevin Perko: That's right, Toronto and San Francisco.
Kirill Eremenko: Very cool. I think it would be a good segue or a time to
mention a few things about Scribd, I guess. Tell us a
bit, what is Scribd, and what kind of product services
does that company offer?
Kevin Perko: Right. Scribd is a reading subscription service. It's
$8.99 a month, and you get access to books, audio
books, sheet music, articles, as well as user uploaded
content, which could be really anything, letters of
recommendations, people's physics theses that they've
published, and just a wide collection. Game strategy
guides of content that people have decided to upload
on the internet, and so Scribd enables you to get
access to that.
Kirill Eremenko: Oh, nice. What kind of data would you be working
with, or does your team work with on a daily basis?
Kevin Perko: We really work with I think of it like a couple different
types of data. One is sort of like the application level
data, whereas who's paying us, where are they from,
all of that kind of demographic type information, what
devices are they on, etc ... Then you have this sort of
user interaction event stream data of what did they do,
what did we show them, and how did they interact
with that, and how does that mix with what we know
about whether or not they're logged in, or logged out,
or a paying subscriber, what they've done in the past,
what other people have done. That's kind of like one
part of it, and then the other part is understanding the
content.
Kevin Perko: All of the books, and audio books, and user generated
content that we have that people have uploaded, really
understanding what that is, what language is that in,
what categories are they. For books, publishers
typically provide us with categories, whereas for
documents, users do not provide us with any
information, so it's up to us to decide, okay, what is
this document actually about, and how should we use
that information when we're building a search index,
to search search results, or showing
recommendations.
Kirill Eremenko: Okay, very, very diverse. Two diverse areas, user
interaction, and understanding how they use the
platform, and also understanding the content. What
would a typical project look like for your data science
team?
Kevin Perko: That's a great question. I'm actually going to just use a
project that we're doing right now. Somebody identified
our success metric, which is our target for our GBM.
For search, for re-ranking items, once all the candidate
sets are generated, so think rows of books, audio
books, documents, then it kind of goes into this GMB,
and it decides how should I actually rank these items,
within each module. Today a lot of that is, after all the
routing and candidate generation happens, it's all
based on historical data for the most part.
Kevin Perko: Our best approximation is to try to understand how
those interactions correlate with retention. For our
business, that's what we want to optimize for right
now. One of the data scientists said, "You know, this
previous success metric, we did really great work on it,
and I think we can make it better." They kind of
mapped out the project, what should that be. They did
like the whole analyses, they presented to the team,
they got a bunch of feedback, they continued to
improve the success metric, and they're continuing
now to get it into production. Once they do that, then
the next step will be to retrain the GBM, so that we
can actually see, is this better, because obviously it's
easy to say, "Offline looks better." Like we've reduced
our main [inaudible 00:20:24], but that doesn't really
mean anything if we didn't make the user experience
better.
Kevin Perko: That's kind of why I say, "There's not really a typical
project, but this would be a good representation of
like, okay, there is some clearer variables that you
want to optimize for." Maybe somebody is giving you
the project, or you're creating it, and then you need to
kind of go down, break them down, figure out how to
represent them. A pretty collaborative environment, so
you're going to go present. You're definitely going to
take some feedback. I think that always sort of
hardens the project, gets you to question your
assumptions, and then you've got to go an write the
code to get it shipped out, so we can actually use it in
the product.
Kirill Eremenko: Okay, got you. GBM is Gradient Boosting Machine. Is
that right?
Kevin Perko: Right.
Kirill Eremenko: Why do you use a GBM in this specific example? Is
there any reasons for that?
Kevin Perko: You know I would say honestly there's not a great
reason. We sort of inherited this model. There's a
previous search team, and the model place, we traded
it, it worked the best way doing this. Kind of our bigger
contributions have been to improve the success metric
that it gets trained against.
Kirill Eremenko: Okay, gotcha. If a team of 13 data scientists, including
yourself, do you find that you have multiple projects
going on at the same time? How many projects is the
team involved in approximately?
Kevin Perko: There's so many people, it's hard for me to even pull
that number out of the air. There is a lot going on at
any one time.
Kirill Eremenko: How do you keep track of everything?
Kevin Perko: Yeah, that's a great question. We have a couple
support structures for that. We have squads. People
that are working on product facing squads, they have
somebody that they're working with, like a product
manager, and a technical project manager, who are
working on what's the task flow, what are we shipping,
what are the deadlines on all that kind of stuff. We get
some similar apparatus on the search and
recommendations teams, so that I don't have to be
responsible for all of it, because it's too much for one
person to make sure everything is on track, and all the
deadlines are being met. That really helps a lot.
Kevin Perko: The other thing is to just ... beyond individual projects,
is having higher level goals that you're ... or higher
level targets for the quarter that you want to move
towards. Those are easier to check in on rather than a
specific, okay, did we analyze this test, did we learn
from this test.
Kirill Eremenko: Okay, gotcha. You probably have like managers in the
team as well who take on some of the responsibility
that then report to you?
Kevin Perko: Right, right. We've got a manger out in Toronto.
Kirill Eremenko: Gotcha. Okay, okay. Very interesting. You mentioned
you guys are hiring at this stage, so if anybody
listening is interested, what's the best way for them to
apply?
Kevin Perko: Yes, we're hiring in San Francisco and Toronto, and
the best way to apply is to go to the jobs page, I would
say. Just in the cover letter mention that you listened
to the podcast, and I'll see that. I actually review all of
the applications that come in. I'm very passionate
about hiring the best people I possibly can. I'm
reviewing all the applications that come in. I can kind
of take more risks, and really see if somebody is
showing something that someone else who was looking
for a very specific profile, may not be able to pick up
on.
Kirill Eremenko: Nice. By listening to this podcast obviously people are
already ahead of the game.
Kevin Perko: Exactly.
Kirill Eremenko: Okay, cool. Well, thank you for that, and guys, girls,
everybody, ladies and gentlemen listening, if you're
interested to go in Toronto or San Francisco, make
sure to check out Scribd. Let's shift gears a little bit.
You're coming, which is very exciting, I'm excited to
announce this to our listeners, you're coming to
DataScienceGO this September, 2019, in 27th, 28th,
29th, September, and you're doing Keynote. Super
pumped about that. Congrats. I can't wait to hear your
Keynote and to meet you in person over there.
Kevin Perko: Thank you, thank you. I'm definitely looking forward to
that. This will be my first Keynote, so it's a very
exciting experience for me as well.
Kirill Eremenko: That's awesome. Tell us what is this Keynote going to
be about? Can you give us like a quick, I don't know,
maybe preview or some spoilers about what you're
going to be talking?
Kevin Perko: Well, I can't give any spoilers of course. In terms of a
preview, what I'm going to focus on is kind of two
things. I want to get people generally excited about
what's happening in data science, as well as how
that's intersecting with what we're doing in Scribd. I
think one of the best ways I can do that is to talk
about an initiative we have internally, around learning
how to represent our content better, which we're
calling the Book Genome. That's really obviously
taking from like the Music Genome from Pandora,
from way back, and applying that to books. It's scaled
what some companies have done. I don't know if
anybody's used the term, book genome, but we really
want to think about how we represent our content.
Kevin Perko: I want to talk about how we're doing, how that's going
to enable really amazing things for our users, and for
data science in general, as well as how that intersects
with like a curiosity culture. Eric Colson at Stitch Fix,
totally has written lots of very good articles on this,
and I really am trying to bring that into my team, into
my organization, and intersect these things, because
there's so much opportunity in data science, that
there's no way that top down you can see all the
opportunities and correctly allocate all the resources.
Kevin Perko: You want people on the ground, being curious, asking
questions, saying, "Hey, I actually have a couple extra
hours, and I'm going to see if this variable is correlated
with this variable, or if I can map this out with a
regression, or a neural network, or whatever it
happens to be, and if we can learn something new,
and I really believe that that'll add much more value to
the business than us trying to pick the best projects
every single time."
Kirill Eremenko: All right. I haven't heard of this music genome project.
Can you tell us, what is the end goal of the book
genome? What does it look like?
Kevin Perko: The end goal is for us to really understand books on a
deep level. When you talk about a book, you talk about
books that you enjoy. You say things like, "It moved
really slow," or maybe it was really dense, like very ...
when I say [inaudible 00:26:51] Slavic words, lots of
technical jargon going on. You don't necessarily say,
like, "Okay, well, it was like a front list book." For
anybody who's not familiar with that, that's a book
that's come out in the last year. Publishers, they care
a lot about that. That's where a lot of their money
comes from. We think about a lot of this internally in
all of these things, but readers don't think about that
necessarily.
Kevin Perko: They're thinking about, "I'm reading a book that people
are talking about. I'm reading a book that is relevant
in the media, or that my friends recommended, or
that's a murder mystery and I love murder mysteries.
It has these elements that I like." We want to take
those, when people are saying these kind of
ambiguous and vague words, these elements that I
like, well, what are those elements, is it dystopia. 1984
is definitely a dystopia, so if you read that, what are
you interested in learning. If we can represent dystopia
as an embedding, how can we relate that to other
books, and then understand that you're not just going
to read dystopias. You'll have a very depressed outlook
on the world if you do that.
Kevin Perko: That's just like a ... not a thing, because lots of
recommender systems, they want to find similar to
items, but we need to introduce this serendipity. It's
really going to become like a sequence type model,
because people, even if you read a data science book
and you're getting into data science, you don't only
read data science books, because that again will kind
of drain your brain power there. You have to sort of
recharge with something else, whether it's a biography,
or a science fiction book. When you read those, not
only do they kind of go together in a sequence, but you
have specific elements you like about your science
fiction books.
Kevin Perko: To you it's less about science fiction, and maybe it's
more about dystopia plus science fiction, plus a
futuristic setting. We want to be able to represent that
in words that we can both share with our readers, on
why they were recommended this book, and what we
know about this book, and to help them find other
books. Whereas today you may browse by genre,
perhaps in the future you could browse by something
more stylistic like books set in London, or fast pace
books, or easy reads for the weekend.
Kirill Eremenko: Gotcha. For instance, like your example with science
fiction, somebody might be interested in like they're
picking up science fiction book after science fiction
book, but really deep down inside what they like might
be a certain type of character, like the lead character
has a certain background, or they are passionate
about certain things, or the manner that they ... how
they are heroic, or things like that. Really, the reader
might not even know this about themselves. They just
happen to be picking up these books, and liking them
based on other people's recommendations. You can't
really express that in words.
Kirill Eremenko: I guess what I was going to ask is, are you going to
look for this information from people? Are you going to
get people to complete a quick survey after they finish
a book, what did they like about it? Or are you going
to have natural linguistics language processing, some
AI, or machine learning, that's going to go through the
book, and actually look for these gems, or these
parameters inside, autonomously?
Kevin Perko: Right, right. The current approach that we're thinking
is given that we get a lot of good publisher data, we'll
start to build it. This includes some kind of human
curated keywords, like dystopia, that's associated with
1984. We can start to train on those words, and kind
of build, and understand how that represents across,
we'll call them words, because we want to kind of get it
more into a tree. It's much more of a graph system. We
don't want to think of it as a flat system. Dystopia has
a relationship to the environment, and cooking, so it's
not very related to cooking, but if you just have a flat
group bank of words, it doesn't really mean anything,
but when you start putting them in a graph, and it's a
little bit more directed, then oh, you can see cooking is
way over here, and you've got your werewolf romance
way over here, and those things aren't really related.
Kevin Perko: Actually your dystopia which could kind of go either
way, is maybe much closer to this hypothetical
werewolf romance, for whatever reason. Being able to
understand those things is much more valuable,
because that's how people think about the books.
They're not putting these hard boundaries on them,
like we tend to do when we mull them out. We're like,
oh, this is cooking, or that's not that, and so they
would never want that. It's like, okay, well, the world is
a little bit more complicated and subtle than that. By
bringing this out, we'll really be able to get at the heart
of what people want.
Kevin Perko: I think you kind of brought it up, it's going to be a two
step process. We're going to be boot strapping it. We
haven't planned on doing a survey, but that's a great
idea. Honestly, I might steal that.
Kirill Eremenko: Sure.
Kevin Perko: Because like you're saying, we don't necessarily have
the language to represent the things that we want to
today, so we're going to have to go figure out what
that's going to look like. It makes a lot of sense it'll be
a collaboration with the data we get, the data we're
able to acquire, how we're able to learn things
internally as well as what our users tell us.
Kirill Eremenko: Gotcha. Is it going to be similar to the Netflix
recommender system?
Kevin Perko: I would say, "No." At a high level all recommender
systems have this ... they share similarities. Given that
we're in the process of building, and I wouldn't really
be able to say, I think that the bigger goal of extracting
the metadata, and learning how to represent it, that's
very similar to what Netflix did. I think they actually
had like rooms of people watching movies at one point,
like labeling them. We're not there yet to have rooms of
people reading books. It also takes a lot longer time, so
I'm not sure if that's feasible. We're going to continue
to try to increase our sophistication, so yes, I'm sure
we'll be using similar methods that Netflix has
pioneered.
Kirill Eremenko: Okay, very interesting. Yeah. It looks like you're going
to have a lot of algorithms that you're going to be
trying out. What's your view on that? How's your
approach going to be? Which model, which algorithm
is going to be the best? Are you just going to try out a
lot of things, or do you already have some things in
mind?
Kevin Perko: Yeah, that's a great question. I feel like I sort of have
two views. One is that I'm agnostic. If you use CFIF,
and that represents the problem, and solves it, then
you should always use the simplest tool for the job. My
second view is that a lot of the things we're seeing with
these kind of next generation language models, that's
coming out with like BERT, and [Inaudible 00:33:34],
and I haven't even had enough time to dig into them,
as much as I'd like, but I can see that their ability to
represent language is incredible, as well as opening
eyes.
Kevin Perko: A model they only released a small version of it, that
was ... I believe it was writing articles, it did too good
of a job of producing fake news basically, so they
didn't want to release the full model, but then they
understood within a certain amount of time, people
would be able to recreate it. They're just sort of buying
some time hopefully, before they unleash this thing on
the world. Which is nice to see somebody having a
thoughtfulness, that hey, this thing could actually be
used, or bad at things.
Kevin Perko: I think a lot of those models will definitely come in
here, because they will enable us to represent things
in really interesting ways, that we may not think
about. I think the simpler approach is nicer in the
sense that it lets you actually say, "Hey, we extracted
this part of the book, and that needs this." That's
really valuable, that interpretability piece. That being
said, neural networks are starting to get that. People
are doing active research. They're starting to say,
"Okay, this is actually what it learned, this is how it
represented it, this is your pixels that it took out and
learned."
Kevin Perko: Then you start to understand, oh, this is why when we
turn a bus on its side, now it may think that it's a
zebra instead of a bus, because it just learned like two
pixels in the image, and so there's a huge risk that
when things change slightly you get very, very wrong
outcomes from these neural network type models.
That's why I like this idea of having a mix of us really
deeply understand the model, not as sophisticated
plus something that's really pushing the edge, and
they'll also can act as like a check on each other. You
can sort of see when the bus is on its side, or if a book
is clearly about romance, and this is saying, "It's
science fiction," and we have people look at it and it's
like, oh, this is science fiction. Then we understand
what's going on.
Kirill Eremenko: Wow. Very cool. Well, if anybody wants to find out how
this story ends, DataScienceGO 2019, end of
September, in San Diego. That's where you can catch
Kevin. I wanted to ask you, Kevin, you mentioned
neural networks. What's your view in terms of the
work you guys do ... There's a lot of ... especially in the
part of understanding the content, I'm assuming
there's a lot of working with text, and language
processing. What is your view on neural networks
versus machine learning approaches?
Kevin Perko: I think that for the most part, they complement each
other, and that really, neural networks uses a lot of
machine learning. They're not these separate worlds of
things. When you're setting up a neural network,
people have kind of said it's much more like
differentiable programming. It's like a config file,
especially if you're working with Keras, you're sort of
setting up, okay, like what are my activation units,
how many layers do I want. You're deciding these
things and it's like, what are you deciding when you're
thinking about this. Okay, well you're thinking about
maybe a linear model, or a logistic model, in terms of
how you want to represent a thing.
Kevin Perko: The difference is that what you're thinking about is
one part of the model. You're not thinking about the
whole model anymore. The neural network kind of
takes all that. It adds its hidden layers, and it does
extra things that aren't really represented here, but
you're kind of guiding it, so you're more of a guide
rather than like, oh, rather than logistic regression, I
learn these features, however I learn them, and I put
them in all and it gives me something very
interpretable. Outputting probabilities, which are very
understandable and that's what the model is, versus
neural networks just trying to kind of map something
really probably non-linear, and understanding that
without ...
Kevin Perko: It's not going to give you that nice interpretability
component yet, but it uses the same I would say
mathematical approaches under the hood. Then it
kind of adds on its own layer. I think that like I was
saying, they really complement each other, and there's
no like, this is better than this. It just depends on the
use case. The truth is in industry most of the time you
don't actually need anything neural networks. Like I
was saying, it's better to say on the old stuff that
people have proven out, that works really well, that
you can actually communicate with, because it's really
hard to talk to somebody about neural networks given
their ... It's like, all the machine learning stuff
combined into this other box, and then put that inside
another box, and then you kind of shift that out.
Kevin Perko: Then people ask you, "Well, how did this decision get
made?," and you don't really have a good answer for
them. Whereas if you're using random forest, or
logistic or linear regression, you can say something
much more confident about, "Oh hey, this is how this
model made this policy or this decision, and I really
understand what that means, and what it's trained on.
We can debate if that's right or wrong." This is how we
go there. That doesn't exist with neural networks. That
why I think they're a balance, when you think about
traditional machine learning techniques.
Kevin Perko: Same thing with support vector machines. Given its a
margin with classifier, you pretty much understand
how it's making these decisions. Whereas with
something like neural networks, you really ... That's
kind of the core thing today, you don't. I think in the
future, people are going to sort of break through that
wall and we will understand these decisions, well
enough anyway that people will get much more
confidence in the models. That's proving to be
increasingly important, is these things get
incorporated, like doing facial recognition for all sorts
of use cases. When a model's impacting sentencing
guidelines, you really want to have a lot of
interpretability behind that model.
Kevin Perko: These are things that I definitely worry about, that
people use these kinds of tools without understanding
like, oh wow, people, there is a lot of ambiguousness
between how this model is working, and there's lots of
opportunity for this to go awry, when you don't have a
good kind of interpretability, and a good transparency
layer. I think that was sort of a big thing for data
science in general is to get much better at that,
especially as data science permeates all parts of
business, and culture. People want to know, "Hey, how
did this happen? If we're going to delegate this to an
algorithm, how did it make the decision?"
Kevin Perko: In the past it was just, if we can make a good decision,
then we'll go do it. In the future it's like, if we can
make a good decision that we can explain, and people
will agree with it, we'll go do it. Sometimes we'll make
a less good, but perhaps a more societally fair decision
that people agree with. We'll have the ability to adjust
the knob and do that, whereas today we may not.
Kirill Eremenko: That's a whole explainable AI. [inaudible 00:39:56]
becoming more of a trend we're seeing that even this
year, more questions are being raised, more companies
or agencies, government agencies including, are asking
the question, "Is this explainable AI? Do we know how
it's making these decisions?," because as you
mentioned, with data science becoming more and more
part of our daily lives, and society, there's so much
that can go wrong in terms of recognition of even facial
recognition, and any kind of associated racism that
can be incorporated in that, or sexism, and when you
can explain how the model works, you can point that.
When you can't explain, then you've got a whole
different can of worms that you're going to open.
Kirill Eremenko: A lot of it also comes, especially in neural networks,
comes from labeled data. Like, the AI might be the
neural network is ... just the architecture is very
neutral, but then the data that it was labeled already
has some kind of bias, so has some sort of
discrimination in it. Then the AI learns that, and try go
in there and make it unlearn that if you can't get ...
You don't know which neuron responds ... correlates
to which features. It's pretty insane.
Kevin Perko: Exactly. Exactly. That's a great point, that the
algorithms are just representing a bias, and when we
have bias as society, that is represented in the data
sets. The algorithms don't ... they're immoral. They
don't know that that's not the ideal outcome. They
actually think that's the outcome they're supposed to
learn and reinforce.
Kirill Eremenko: Yeah. Then you've got that whole trend. Have you seen
those images when people take like a stop sign, and
they put some stickers on it, and self driving car
doesn't recognize it as a stop sign anymore.
Kevin Perko: I have not, but that does not surprise me at all,
because I see those self driving cars around San
Francisco all the time, and they really struggle.
Kirill Eremenko: Oh wow. Where is it ... I haven't been in San Francisco
for a while. What company is that through, Uber or
self driving Ubers?
Kevin Perko: Typically what I'm seeing are the Cruise vehicles.
Kirill Eremenko: Okay. What do they do?
Kevin Perko: Cruise, I think GM bought Cruise, and-
Kirill Eremenko: Oh okay, gotcha. It's like a [inaudible 00:42:24]
transportation company.
Kevin Perko: Right, right. They have SUVs drive around San
Francisco with a ton of sensors, and they're logging in
an incredible number of miles in the city. You can see
how much they struggle at intersections, and it's like a
bike goes by, then they're like suddenly swerving, and
you're just like, technology is not right. People talk
about level five, in like 10 years. I'm like, level five is
just like, we can't even think about that. This is, these
cars are just ... they are not ready. I mean, I get it.
Urban environments are really hard, but the core thing
is you can't learn everything and advance, and I think
that's where we're just kind of pushing the current
limits of what we have with vision and AI, is that we're
trying to.
Kevin Perko: We're trying to have incredible lidar that can respond
super fast, instead of a general intelligence that
understands how to value different objects. These cars
can't do that, so they treat a cat the same as a
bicyclist, the same as a semi truck. It's just an object,
and there's not association or learning with it. Now,
I'm sure that's changing. I think that's kind of the key
problem, is until you do that, then you're going to
react the same to a cat, or a squirrel, that you are
going to reach to a semi truck, which is a problem.
Kevin Perko: The other thing is if you just had like a whole network
and it was all autonomous, then you'd be kind of fine.
The machines could do weird things, but you'd figure
out how to solve that. When you're interacting those
with humans, and the machines don't have a way of
relating to the humans, then you get all these new
problems. My favorite one was they had to make the
driving system more aggressive at intersections in
California, because we all do the rolling stop,
especially in San Francisco. The car would just sit
there waiting for its turn to go, and it would never go,
because there was never a point where all four cars
came to a 100% complete stop.
Kirill Eremenko: Okay, gotcha. Okay, yeah, okay, because the rules are
kind of different. It's following strict rules, whereas
humans are more flexible with the rules I guess.
Kevin Perko: Right. We think about the spirit of the rule, are we
causing harm, and try to interpret that within the
context of the situation, like is it sunny, or raining, or
am I surrounded by bikers or little kids, whereas
literally the machines, they don't have any of that
context. They're just like, this is the rule. If the speed
limit is this, and it says this, then I do this.
Kirill Eremenko: Yeah, wow. Okay, very, very interesting observation.
Must be pretty scary dodging these cars.
Kevin Perko: It is, it is. Sometimes it concerns me to think that they
are actually going to try to have that ready to go. I
think that they do have some in ... maybe it's in
Arizona, but it's on kind of like a closed track, where
they know exactly what the variables are going to be,
and that works fine. It's just urban environments are
really hard, even for human drivers who have a lot of
experience. They're very challenging. For machines,
they're incredibly difficult, because the number of
things you have to learn each second, it changes every
second.
Kirill Eremenko: Yeah. Well, technology, data, it's interesting to see how
they are coming. Data is becoming more and more
recognized as something that's driving business, and
these two things, technology and data, are coming
closer together. They've always been propelling one
another, but now we're trying to use data everywhere
where we can, and technology as well. Then what I
notice about your background is that it looks like
you've changed careers very consciously I would say,
that you've selected different companies, or different
roles, in data science to work, but they've never been
along the same line. Let's say, developing self driving
cars, or in the case of Scribd, like working with
recommender system, or understanding content.
Kirill Eremenko: It feels like you've moved around the space quite a bit.
Can you comment on that? Why these choices of roles
and careers? Were you searching for something? Did
you consciously decide on what you want to learn next
before progressing further?
Kevin Perko: Right. I think it's easy to look back historically and see
a narrative. I'll say at the time it was really kind of like
an exploration, give it much more like a gradient
descent. I'm taking these steps, some of them good,
some of them not as good, and just learning, and
gathering more information. They're all really valuable
steps, because now I know if I'm walking up the hill, or
down the hill. What they've kind of given me in
aggregate is this really unique view of all of the
different parts of the system, in terms of how
companies actually can use data science, how we
think about this idea of a full stack data scientist, kind
of comes from my past experience of seeing, well okay,
somebody can't ingest this data right, then there's no
data science.
Kevin Perko: If you don't have good data that's clean, then you
spend all your time doing that and so you spend very
little time applying it to a model. These are the kind of
the key systems of like, oh, if you can't deploy your
model, then you're just beholden to another group,
and you're not like a data science business unit.
You're not shipping product, you're really more of a
support function if you're constantly bound by
somebody else, to go put the thing that you made into
a product. That really limits your scope and your
ability.
Kevin Perko: That's kind of what I've seen across my experience
across all these organizations, getting to see how
different organizations treat data science. It's really
kind of a key thing, that you have an organization that
the executives believe in data science. They believe
that you can use experimentation and machine
learning, not just to make their product better, but to
be the product. That's something I very much see has
to come from the top. When it does, it makes your life
much, much easier, and the company is on board, and
you're pushing the edge more than just trying to say,
"This is why we should exist."
Kevin Perko: Kind of having my experience in hindsight has given
me a lot of these really unique perspectives. Going
forward as I build it, I just thought this is a really
interesting opportunity, let's try this, let's try this. I
didn't really see how it was going to connect. Looking
back, I can kind of see that it's been a really nice
connection by working these different companies,
seeing different approaches, how all this works
together, seeing different organizational structures
where you have it really split up, where data science
doesn't have access to any systems, and how limiting
and suboptimal that it, is for a data science group.
Kevin Perko: To have those restrictions, whereas if you think about
the other side, of well, what if they have engineers with
data scientists, and they're shipping product. That is
really where you want to be for every data science
team, because then you get really to this true full
stack data science org, that can ship product, that can
support change, that can do whatever it needs to do
within the business, rather than having something
that's very kind of boxed in, into its very specific niche.
Kevin Perko: It does that and maybe it does it really well, and
creates a ton of value for the business, but in my
opinion it's always going to be suboptimal to structure
it that way.
Kirill Eremenko: How would you advise somebody who is looking from
without an organization? From externally, and maybe
looking for a job, or looking to move into that
organization, change career, how would you
recommend for a person like that to determine the
answer to that question? Is data science seen by the
executive as a product or not, because when you're
inside it might be quite obvious, but when you're
outside, and you're trying to understand if this is the
right company for you to work in, it might be difficult
to see.
Kevin Perko: That's a great question. I think that it is always going
to be difficult to judge something like that from the
outside. What you can do are like little ... You have to
look for signals, kind of build your own pattern
recognition system, and ask questions, really simple
things like, do they have a blog, does it get updated, is
the company ... are any executives talking about data
science or machine learning, any public interviews
ever, do they have maybe a chief data officer, or a VP
of data science. If you're able to talk to people, if you're
in the interview process and you're talking to someone
who's maybe director, executive level, what do they
think about data science, how do they think it's
driving the business, and really listen to how they
answer that question, and what they say.
Kevin Perko: Do they have a vision? Have they thought about it at
all? Or is this like, we don't know, we want you to
come in and do it, and we're open. How they answer
those questions will tell you a lot about how the
organization views it. Most people will be pretty honest
there and say, "Okay, we really think that this will
help us increase our lead generation by 3X for our
business, if we're B2B SAAS, and that's more money,
and that's how we see it. That's the end of our data
science at the company.
Kevin Perko: Then you can make your own decision, once you get
that. I think it's really kind of being able to talk to
somebody from a more senior leadership position, and
getting good answers on, have they thought about this
deeply, and they actually believe in it, or they see
everybody, and they just want to hire, because it's
usually pretty clear, when somebody's trying to hire a
data scientist, because they think they should have a
data scientist, and yet they have no idea what the data
scientist will do. They won't actually be able to tell you
what any of the projects are, or any of the vision is for
data science in that job unit, or what have you.
Kevin Perko: I think those are kind of the key signals. You can kind
of start parsing out. You can also just sort of ask
people how the teams are organized, is it in
engineering, which might be really important at a
smaller company, is it in product, is it in marketing, is
it in finance. I've seen all of these structures. They
mean really different things for the data science group.
Is it a science group totally decentralized and
everybody's embedded within a specific team, that's a
really different data science experience rather than
joining a data science team, and then working within
different areas of the business.
Kevin Perko: I think all of these things are areas you can look for,
and questions you can ask to try to assess that out.
Kirill Eremenko: I love what you mentioned about the decentralized
embedded data science team, where you've got data
scientists, or machine learning engineers, in different
functions of the business, versus a stand alone data
science team, something what you have at Scribd.
What would you say are the advantages,
disadvantages of either of the approaches?
Kevin Perko: Right. At Scribd we would have something
approaching a hybrid model of this. I think that ... The
advantage of having a core data science team is that it
really has to think of itself as a business unit, and go
around, and connect itself to the business, and
understand what the priorities are, and where it can
drive value, and what opportunities exist. Then can
kind of track those out into near term, medium term,
long term initiatives, whereas your long term initiative
is like trying to ship really exciting state of the art
products, and then short term is something very
clearly defined.
Kevin Perko: You're working with GBM you can re-rank something
better, or represent something better with some vision
that's already been solved, using a pre-trained model,
and you know you can ship that in a month, and help
the business in this way. The kind of key thing right
there is you have to align yourself really tightly with
the business. When you're embedded, it's really easy
to say, "Okay, well a product manager brought a road
map, or somebody brought a road map, and we're
executing on it. You told me to build this algorithm,
and so I'm going to go build it."
Kevin Perko: We have a recommendation system, and we're going to
try to make it three percent better, rather than asking
if we even have the right system, and then taking three
to six months to rebuild it, which is what you're going
to get from the business unit approach. Where as the
embedded approach is much more likely to be
iterative. There's going to be other factors in there, but
that's sort of what I've seen, is that it drives this
iterative approach, which makes it hard to make
bigger gains. It's certainly valuable for the business to
have iterative gains in the near term. However, it kind
of limits your ability longer term to sort of go after
bigger opportunities [crosstalk 00:54:38].
Kirill Eremenko: When you say you have a hybrid model, what do you
mean by that?
Kevin Perko: Right. When we have a hybrid model at Scribd, we
have data scientists that are embedded on product
facing squads, as well as searching recommendations.
They work with those squads really tightly. Those
squads have road maps. They are doing some of the
iterative thing, and what we're doing now is to really
pair that more with like, well, let's drive road map, let's
think how we can kind of reimagine the system instead
of just making an existing system we inherited a little
bit better. Maybe we can actually make it a lot better,
however we're still working within the constraints of
that system, without really deeply questioning if that
system should exist. Which just is that, like I said, is a
function of being embedded.
Kevin Perko: In Toronto, we're really focusing on the more business
unit type approach. I'm going to bring that approach to
San Francisco as well, so we're really thinking about
how to reimagine the system, in addition to driving
iterative improvements.
Kirill Eremenko: Okay. Very useful information, especially for business
owners, or executives listening to this. Would you say
there is kind of like a threshold, when a company
should maybe for instance as a smaller firm, a smaller
organization, a start up start with embedded
approach, and then at some threshold, switch over to
the core data science team approach, or the stand
alone data science team approach? Would you say
there's a time in the life of any business when that
should happen, or this really depends on the type of
business nature of the industry?
Kevin Perko: It depends on the business. My personal view is that
going to the business unit sooner is always going to be
better. The trade off with that is you don't get the
really embedded focus that that brings. If you're trying
to ship something, say you're a start up, and you need
to raise your next [inaudible 00:56:41], you need to hit
very specific goals in the next six months. The
embedded unit can really help you align everybody,
really clearly, assuming you already know it needs to
happen. When you're in much more of the greenfield
space, the embedded units, it's harder for them to
deliver that kind of work, especially from the data
science side, because data scientists really kind of
shine when they're working with other data science
consistently, and bouncing their ideas off, and
thinking about things.
Kevin Perko: They're like outside of the bounds of what people are
envisioning of the next version. That's where it just
becomes, you could have a product manager who
totally gets data science and they can do that in that
model, and it works, and you don't actually need this.
You can get the same gains that you would get. What
I've seen practically is that there's not very many
people like that. Depending on your organization, and
what you're solving for, there is kind of like a point
where you want to think, okay, when ... am I getting
enough out of the data science team. If not, the kind of
the business unit approach and now it's important to
pair that approach with a full stack approach. It's data
scientists, engineers, whoever they need to ship their
product.
Kevin Perko: Maybe in your company it's front end engineers and
designers. They should have it, and they should be
accountable just like a product org, and run it the
same way. It's no longer a support function. It's now a
unit shipping product that's driving your company
forward, and you can't have them sort of constrained
by other parts of the organization, because then you're
not going to really get to see what they can do. I think
that it's simply a trade off for the business, depending
on what you want to achieve. It's not like, oh, you have
to do this, or you have to do this. It really depends on
the goals of the business.
Kevin Perko: I'm always going to say that the business unit is going
to be really more powerful. Longer term it's going to
create more value. I feel very strongly about that. I
think though in the short term, and in the medium
term, that can be very iffy. If those are really where the
business is focused, they can have different ways of
approaching that.
Kirill Eremenko: Okay, gotcha. Thank you for that overview. I got like a
question, a philosophical question for you, where do
you think the field of data science is going, and what
should our listeners prepare for to be ready for the
future that's coming in the next three to five years?
Kevin Perko: I think we talked about this a little bit earlier, where
data science is starting to pervade every part of our
daily lives, and so people are now asking these big
questions about, hey, how does it impact my privacy,
how did the model make this decision. I think privacy
and interpretability are going to become increasingly
important. I think you see this a little bit with Android
and iOS, and you can do some on device training, or
serving, depending on how you set it up, that can
really actually drive user privacy, and machine
learning. Those two things used to be opposed. Now
they can be united. I think privacy is generally
becoming a big worldwide thing as people realize the
value of the data, and the value of their privacy that
they've just kind of given over to corporations and
governments, so they want it back.
Kevin Perko: I don't think that's going away. You have things like
the blockchain, which is high level, sort of a universal
trust in verification system. It's really exciting to think
how can data science intersect with that, can we
actually write contracts with Ethereum, that are social
enforceable, and build models, and have all of these
sort of units served where we have general ledgers of
trust, and where does data science play in that, like
how can we think about what kind a society one have,
and what data science can enable within that. These
are really big questions for us to ask, because I think
the models, it's sort of the both, they're already there
and the incredible things they can do, and they're
really far away in the things that we think that get
hyped a lot, like actually having autonomy in self
driving cars.
Kevin Perko: Computer vision is still very, very early. I think that it's
going to get deployed in a lot more situations where it's
actually making decisions for classifying people, where
it's probably not ready. That's just going to happen.
The best thing that we can do is to really push the
interpretability, so people can say, "Oh, it's kind of
clear this algorithm isn't ready, but we can pair it with
humans." That's what a lot of businesses that use AI,
do. They pair it with huge amounts of people labeling
the data, and evaluating the decisions the model
made, and understanding if it's right. We need to
continue to do that same thing as it gets out into
society in general.
Kevin Perko: Everybody needs to be able to evaluate a model, and
understand if the decision it made based on this
information is reasonable, and have debates about it,
as it comes into society. I think that's real exciting,
because people are now building ... You have
[inaudible 01:01:20] processing units, but this
computer is specifically dedicated for serving and in
some cases training models, and that's real exciting,
because most of the limit I think of like, there's
machine learning and neural networks, and general AI
has really been on the compute, this is like pushing it
back to the algorithm. Then you see once that
happens, every kind of six months people are sort of
pushing the state of the art, and that's going to
continue to happen, as long as we don't run into
another compute wall.
Kevin Perko: I think the future can be sort of whatever we make it.
It can be a dystopia 1984 type situation, where we're
all getting bound by this facial recognition that we
don't know how it works, and the government's using
it, or we can create this real incredible future where we
can be revolutionizing how food is grown, and how
water gets preserved, and how we're tackling climate
change, and data science can move into all of these
fields, and it should, and it can help. We can help
people understand what's actually behind all these
decisions, and make better allocations of our
resources, using data science models, and using a lot
of models that already exist today.
Kevin Perko: It's kind of getting them into government, getting them
into these really large companies that move really
slowly. That's sort of a really big piece, is kind of the
pervasiveness as much as pushing the state of the art
of data science. That's really exciting work, can open
up new implications and new technologies, and new
products. I think that there's also a lot of gains to be
made on just increasing the pervasiveness of data
science among existing industries like schools, and
governments. That can have a very large positive
effect.
Kirill Eremenko: Gotcha. It seems like we've gone full circle here on the
podcast, that we came back to where we started from,
that applied data science is kind of the answer, don't
just learn data science for the sake of learning data
science, but see what impact you can make in the
world, whether it's through various industries, and
exciting projects, or it is through bringing data science
to government, and society, in a very understandable,
secure way, that respects people's privacy.
Kevin Perko: Absolutely. I think that's a great summary, because
you can solve a lot of problems with regressions, better
than they're being solved today, and people can
understand those decisions, and can actually improve
the world doing that, which is really exciting.
Kirill Eremenko: Fantastic. Well, thank you, Kevin. This brings us to
the end of today's episode. Before I let you go, what's
the best way for people to contact you, get in touch,
follow your career, learn more about what you're
doing?
Kevin Perko: People can follow me on Twitter, at croatiankp. We've
got a data science blog, Scribd data science and
engineering blog on Medium. Obviously there's
LinkedIn, feel free to follow me there, although I don't
post very much material on LinkedIn. I think those are
all great places.
Kirill Eremenko: Nice job. Obviously people can apply for positions that
you're looking to fill on the Scribd website, right, you
said?
Kevin Perko: Right, right. You can go to Scribd.com/jobs, and we
have some data science openings, you can apply there
as well.
Kirill Eremenko: Fantastic. Well, we'll share all those links in the show
notes. Make sure, guys, and everybody listening to get
in touch with Kevin, follow Kevin. Kevin, one more
question for you before we finish up, what's a book
that you can recommend to our listeners, that will
help them in their careers, or in life?
Kevin Perko: I recently read Bad Blood, which is about the Theranos
founder, Elizabeth Holmes, and I think it's a really
incredible book, because it sort of shows this
intersection of building a future, and how you can
kind of go over the line with that. You get kind of
caught up in your own, you go in your own potential
too much, building the future's actually really hard.
When you're dealing with something like health care, if
you get caught up in those things, you can create very
bad outcomes for people. It's kind of a good sort of
message for data scientists, of like, we can take this
incredible tool we have and use it for bad, or we can
kind of say, "How do we leverage this thing," and really
kind of think about how we drive new, amazing
systems, and strengthen the world in a better way,
using it.
Kirill Eremenko: Yeah, I actually watched a documentary about that on
the plane recently, and indeed, extremely interesting
and very educational story for anybody in technology
and data science, that the things that as you said,
could be used for good or for bad, and even trying to
use it for good you can get really caught up in the
promise that it has, that technology. Sometimes we're
not there yet, like with the whole self driving cars.
Right? We need to navigate our way to get there first.
Kevin Perko: Exactly, exactly.
Kirill Eremenko: Gotcha. Okay, well, Kevin, thanks so much. Looking
forward to seeing you in person at DataScienceGo. All
right.
Kevin Perko: Absolutely. I can't wait either.
Kirill Eremenko: There you have it, ladies and gentlemen. That was
Kevin Perko, Head of Data Science at Scribd. Thank
you so much for joining us for this conversation today.
I hope you enjoyed the chat that we had, and probably
for me, one of the favorite parts was what Kevin
mentioned about the different types of data science
teams that you can have. You can have a decentralized
team where all your data sciences or machine learning
experts are embedded within the different divisions of
your business, or you can have a centralized team of
data scientists, a stand along core data science team.
There are advantages and disadvantages to both, but
it's important to understand that it is a conscious
decision on how a business should do that.
Kirill Eremenko: If you're a business owner, or entrepreneur, so that's
something to think about. If you're a data scientist,
that's also something to think about into the sense
like, how does your business do it at the moment, or
how does the business that you're applying for do it.
That's a question that you might want to ask at an
interview, to understand better what your role is going
to be about. If you enjoyed this conversation with
Kevin, I am 100% sure you're going to enjoy his
Keynote at DataScienceGOo 2019. If you haven't
gotten your tickets yet, head on over to
www.datasciencego.com, and join us this September
27th, 28th, 29th, in San Diego. Wonderful city,
wonderful conference.
Kirill Eremenko: Get to network with Kevin, lots of other amazing,
insightful speakers. We have over 30 speakers
attending, and of course we're going to have between
600 and 800 data scientists coming to
DataScienceGO. You don't want to miss this
opportunity to expand your network. We had people fly
all the way from Brazil on 27 hour flights, on 20 plus
hour flights from Europe in the previous years, so
distance is not an excuse. I look forward to seeing you
at DataScienceGO, and networking with you
personally.
Kirill Eremenko: On that note, thank you so much for being here today,
and I'll see you next time. Until then, happy analyzing.