a mathematical perspective on data science - large (86.7 mb)
TRANSCRIPT
Copyright©2016SplunkInc.
AMathematicalPerspectiveonDataScience
Dr.TomLaGattaStaffSalesEngineer(previouslyStaffDataScientist)
AboutMe• MathPhDfromUniversityofArizona
– ”Geodesics ofRandomRiemannianMetrics”w/Janek Wehr– Probability+DifferentialGeometry+FunctionalAnalysis
• PostdocatCourantInstitute@NYU– FinishedGeodesicspaper,published inCommunications inMath.Physics– CollaboratedwithPoliticalScientists onheterogeneousvotingbehavior
• Was:StaffDataScientistatSplunk– HelpedcustomerswithadvancedusecasesinBusinessAnalytics, Internetof
Things,MachineLearning,DataScience
• Now:StaffSalesEngineeratSplunk– Helpingbigbigcustomerssolvebigbigbusinessproblems
Abstract● Aswithallthings,theprocessofanalyzingdataadmitsamathematicaldescription.Asa
mathematician-turned-data-scientist,Iwilldescribemyapproachtoproblemsolving,andattempttolooselyformalize"stakeholders","usecases","data"and"deliverables"inmathematicallanguagefortheenjoymentofthismostlyacademicaudience.Inparticular,Iwilldescribehowquerylanguagesareinherentlyfunctional,actingasfunctionaltransformationsofDataintoData,whichobeytheusualfunctionalcomposition law.Theprocessofanalyzingdataresultsinaniterativesequenceofqueries,convergingtoafinalquerywhichissatisfactorytotheusecase.Thesequeriesarethenorganizedintodeliverables,whichcanbe"dashboards" (webpageswithvisualizations)or"dataproducts"(withscheduledjobs&analysesrunninginthebackground).Whenthisprocessisdoneright,itresultsintheextractionof"value"forstakeholders,whichcanbemeasuredtangiblyintermsofrevenue,costsorriskmetrics.Sometimesthishasafancynamelike"datascience",butmoreoftenthannot, isjustthenormaloperationalworkofagooddata-savvyIT,Security,TechorBusinessdepartmentinmodernenterprisesandgovernmentalagencies.Therewillbenoproofs,but Iwouldbeveryinterestedtodiscussrigorousapproachestosocialorganization&problemsolvingafterthetalk.
Agenda• BasicDefinitions:
– Data,Stakeholders,UseCases,Deliverables
• Querylanguagesasfunctionalprogramming– Everyqueryisamapf:Data->Data– Exampleproblemsolving
• PuttingItAllTogether:DoingDataScience– Emphasizeactionable insights– Tieittogethertodeliver”value”tostakeholders
Copyright©2016SplunkInc.
BasicDefinitions
WhatisData?• ”Data”isanyinformationalartifactofreal-worldphenomena• A”metric”or”KPI”isanyaggregatefunctionoflow-leveldata• Examples:
– Semi-structuredtimestamped events/metrics– Structuredrelationaldata(rows&columns)– Graphdata(nodes&edges)– ”Unstructured”data(images,video,text)
• Howtomodeldata:– Events:markedpointprocesses×eries(Skorokhod space)– Relationalschema:categories(seeDavidSpivak’swork)– Otherdata:dependsontheusecase,mightneednewdatastructuresto
representit(incl.vectors,graphs,etc.)
WhatisaStakeholder?• A”stakeholder”isaperson,groupororganizationwhoisinvestedintheoutcomeofaninitiative.
• Example:Acompanybuyssoftware– StakeholderorgsareIT&theBusiness– Individualstakeholders includeIndividual
Contributors,Managers,Directors&Executives.
• ITstakeholdershaveperformancemetrics(num.outages,meantimetoresolution,etc)
• Businessstakeholdershavedifferentmetrics(revenue,cost,risk)• Customerstakeholdersdownstreamalsohavevalue/impactmetrics
WhatisaStakeholder?(cont.)• Howtomodelstakeholders?
– FollowGameTheoryforinspiration(butdon’tworryabout”equilibrium”)– Createanindexset Iwithallstakeholders.Variousactions,outcomes&
objectiveswillhavesubscripts i basedonstakeholders– E.g.,personi choosesactionai,t attimet– Icanbehierarchical(Personi containedinorgA,soparent(i)=A)– ”Value”modeledbyobjectivefunctions(Ui,n =objective#nforpersoni)– Istheaction”pivotal”fortheoutcome?(ie 𝔼[U|do(a)]>𝔼[U|do(nota)]?)
• Keeptrackofstakeholdersdata:– Mightbehigh-level(email,Powerpoints)– contextiskey– Mightbeindatabases(transactions,customerrecords,ticketsdata)– Mightbegranulareventsdata(webclickstream,logs,mobile,wiredata)
WhatisaUseCase?• A”usecase”consistsofabusinessproblem,astrategytoalleviatetheproblem,metricstoevaluatetheoutcome,datatopowerasolution,andstakeholderswhoareinvolvedinitsdevelopment.
• UseCase:ProblemForecasting.– Companyhascostlynetwork/systemoutages– IThiresaDataScientisttohelpbuildsolution.– Dataincludes Infrastructure(CPU,Memory),
Operations(OutageReports),Applogs,etc– Metric:costofoutages≈
numoutages*timetoresolution*costoflabor– StakeholdersincludeIT&Business, andimpactedcustomers
WhatisaDeliverable?• A”deliverable”isathingproducedtosolveausecase
– Caninclude”dashboards”:informationalwebpagesbuiltfromdataqueries– Or”workflows”:notableeventsdeliberatedtooperationsanalysts– Orfull-fledged”dataproduct”:applicationwhich*does*stuffautomatically
• Deliverable:ProblemForecastingSystem– Goal:forecastproblemsbeforetheystart,
deliver”proximaterootcause”toITtoinvestigate– Data:CPU,Memory,Latency,ServiceTickets– Buildmachine learningmodeltocorrelate
InfrastructuredatawithServiceimpact– Applymodeltoincomingevents,create
”predicted(Risk_Score)”– Surfacehigh-riskeventstoITOperations
Copyright©2016SplunkInc.
QueriesasFunctionalProgramming
QueryLanguages• Querylanguagesprovideaformulaicapproachtoworkingwithdata• A”query”isastringwhichtellswheretogetthedata,whattodowiththedata,andwheretoputthedata(incl visualizationorDB)
• Mathematically,everyqueryisaFUNCTIONf:Data->Data• Queriescanbecomposed(with|symbol),andanalysisisiterative• Example1:SuccessfulPurchaseActionsfromWebLogs
sourcetype=access_combined action=purchase status=200
• Example2:PlotPurchaseValueasMetricTimeseriessourcetype=access_combined action=purchase status=200| timechart partial=f span=5m sum(price) as value
1:SuccessfulPurchaseActionsfromWebLogs
13
2:PlotPurchaseValueasMetricTimeseries
14
Wait!ProblemIdentified!Whyarepurchasesdecreasing?
3:InvestigateDatabaseErrors
15
4:PlotDatabaseErrors
16
5:CorrelateDBproblemswithpurchasevalue
17
6:SaveasDeliverable,SendtoStakeholders
18
7:Movetowardproactivemonitoringstance
19
Copyright©2016SplunkInc.
PuttingItAllTogether:DoingDataScience
EmphasizeActionable Insights• Avoideye-candyvisualizations
– “Laserbeam”threatdashboardslookcoolbutareuseless
• “Howdoesthishelpmesolvemyproblem?”• Guidetheviewertodrilldown&actquickly
Confusingviz:notactionable Goodviz:actionable
DoingDataScience• DataScientistsresisteasycharacterization.Abitof:
– Statistician– SoftwareEngineer– Business Analyst– Spockonthebridge
• Manyscalesofaction:– Getyourhandsdirtywhenneeded– Butstepbackandseethebigpicture– Emphasize”actionable insights”throughoutthewholeorg– Alsohavepoliticalcredibilitytosay,”Thisisabaddecision,don’tdoit”
• DataScientistsguideorganizationsthroughbuildingdataproductstosolvebigproblemsanddelivervalue