mike carey · structure •format of a dataset’s records and fields •highly regular (or...

49
Mike Carey [email protected] Michael Carey/Padhraic Smyth, UC Irvine: Stats 170A/B, Winter 2018 0

Upload: voxuyen

Post on 29-Jul-2018

214 views

Category:

Documents


0 download

TRANSCRIPT

[email protected]

Michael Carey/Padhraic Smyth, UC Irvine: Stats 170A/B, Winter 2018 0

Announcements

• Remembertotrackthecoursewikipage:• https://grape.ics.uci.edu/wiki/asterix/wiki/stats170ab-2018

• Anddon’tforgetaboutthePiazzapage:• http://piazza.com/uci/winter2018/stats170a/home

• ThefirstHWassignmentisdueNOW:• https://grape.ics.uci.edu/wiki/asterix/attachment/wiki/stats170ab-2018/HW1.pdf

• Today:PrinciplesofDataWrangling(fromtheO’ReillybookbyRattenbury etal)

Michael Carey/Padhraic Smyth, UC Irvine: Stats 170A/B, Winter 2018 1

DataStagesinDataWrangling

Raw Data RefinedData ProductionData

Ingestdata Createcanonicaldataforwidespreadconsumption

Create production-qualitydata

Data discoveryandmetadatacreation

Conduct analyses,modeling,andforecasting

Buildregularreportingandautomateddataproducts/services

Michael Carey/Padhraic Smyth, UC Irvine: Stats 170A/B, Winter 2018 2

FunnelofWranglingEffort

Michael Carey/Padhraic Smyth, UC Irvine: Stats 170A/B, Winter 2018 3

DataProductWorkflowFramework

Michael Carey/Padhraic Smyth, UC Irvine: Stats 170A/B, Winter 2018 4

(Note:Inreality,therewillbeloop-backsanditeration…)

DataProductWorkflowFramework

Michael Carey/Padhraic Smyth, UC Irvine: Stats 170A/B, Winter 2018 5

IngestingKnown&UnknownData

• Relationalenterprisedatawarehouseworld• “Schemaonwrite”(eager)• Transformincomingdataintowarehouseschemaform

• ETL(extract/transform/load)• Canbeappend-onlyormayalsoinvolveupdates

• Today’smoreflexibleworld• NoSQLdatabases(Mongo,Cassandra,AsterixDB,…)offerschemaflexibility• DistributedfilesystemslikeHDFSorS3allowdatadepositstobefilesforlaterprocessing• “Schemaonread”(lazy)

Michael Carey/Padhraic Smyth, UC Irvine: Stats 170A/B, Winter 2018 6

CreatingMetadata

• Datasetsarecomposedofrecordswithfields• “Recordsoftenrepresentorcorrespondtopeople,objects,relationships,orevents”• “Thefieldswithinarecordrepresentorcorrespondtomeasureableaspectsoftheperson,object,relationship,orevent”• Q:Soundatallfamiliar(fromlastlecture)?

• Keydimensionstounderstand(anddocument)• Structure• Granularity• Accuracy• Temporality• Scope

Michael Carey/Padhraic Smyth, UC Irvine: Stats 170A/B, Winter 2018 7

Structure

• Formatofadataset’srecordsandfields• Highlyregular(or“rectangular”)endofthespectrum

• Tablewithfixedrows(records)andcolumns(fields)• Recordswithvariant(or“jagged”)structure

• XMLorJSONformatsarepopularexampleshere• Heterogeneouscollectionsofrecords

• Mixesofinformationaboutmultipleentities

• Dataencoding• Fielddetails(e.g.,measurementunits,timezone,…)• Low-levelfieldvalueencoding

• Plantext,binary,zipcompressed,....

Michael Carey/Padhraic Smyth, UC Irvine: Stats 170A/B, Winter 2018 8

StructureQuestions

• Doallrecordscontainthesamefields?• Howarefieldsaccessed?(Byposition?Byname?)• Howarerecordsdelimited/separated?Isparsingneeded?• Howarerecordfieldsdelimited?Isparsingneeded?• Howarerecordfieldsencoded?Strings?Binary?Enumeratedcodes?Compressed?• Howcomplexistheencoding?(Primitivesvs.hashmapsorarrays)• Whatarethesemantics,andarechecksneeded?• Whatarethe“relationshiptypes”betweenrecordsandfields(atomicvs.nestedsets/arrays)

Michael Carey/Padhraic Smyth, UC Irvine: Stats 170A/B, Winter 2018 9

Granularity

• Kindsofentitiesthateachdatarecordrepresents• Finegranularity:e.g.,arecordrepresentsasinglesales

transactionbyasinglecustomerataparticularstore• Coarsegranularity:e.g.,arecordrepresentsthetotalsalesinastoreforanentireday• Subtleties:e.g.,contactsvs.actualy payingcustomers

• GranularityQuestions:• Whatkindofthingdotherecordsrepresent?• Dotherecordsrepresentthesamekindsofthings?• Whatalternativeinterpretationsoftherecordsarethere?

Michael Carey/Padhraic Smyth, UC Irvine: Stats 170A/B, Winter 2018 10

Accuracy

• Qualityofthedataset• Awidevarietyoftype-specificissuesarepossible• Processissues(indataproduction)arealsopossible• Inaccuraciesofvariouskindscanarise

• Misspellings(e.g.,namesorcategoricalattributes)• Lackofappropriatecategories(e.g.,ethnicitylabels)• Missingfieldcomponents(e.g.,AM/PM)

• Frequencyoutlierscanindicatedataproblems

• Rememberthephrase“garbagein,garbageout”…!

Michael Carey/Padhraic Smyth, UC Irvine: Stats 170A/B, Winter 2018 11

AccuracyQuestions

• Type-specificissues• Timeformat(s),timezones,possibleambiguities,…?• Areaddresscomponentscompleteandconsistent?• Aredigits/componentsofphone#sandUPCcodesmissing?• Aretheremisspellingsormissingnamefields?• Aree-maildomainsvalid?• Arecurrencyamountsinthesamecurrencyandsensible?

• Process-relatedissues• Sensordrift• Peoples’(mis)spellingsandabbreviations

• Inaccuracydistribution(s)• Whatisthemeasureabledistributionofinaccuracies?• Aremanyrecordseffected?• Arethereconcentrationsofinaccuracies?

Michael Carey/Padhraic Smyth, UC Irvine: Stats 170A/B, Winter 2018 12

Temporality

• Recordsrepresentanentityatapointintime,so…• TemporalityQuestions:• Whenwasthedatasetcollected?• Wereallrecordsandtheirfieldscollected/measuredatthesametime?• Arethetimestampsofthedataknownoravailable,eitherinthedataorasassociatedmetadata?• Haveanyoftherecords/fieldsbeenmodifiedaftertheircreationtime?Arethemodificationtimestampsavailable?• Canthe“staleness”ofthedatabedetermined,ifapplicable?Andifso,how?(Ex:purchasesandreturns)

Michael Carey/Padhraic Smyth, UC Irvine: Stats 170A/B, Winter 2018 13

Scope

• Twodimensionsofscopeofadataset• Numberofdistinctattributes(breadth/detail)• Populationcoverage(intentional/unintentional,sample,…)

• ScopeQuestions:• Whatentitycharacteristicsarecaptured?Notcaptured?• Aretherecordfieldsconsistent?(E.g.,agevs.DOB,itemsvs.total?)• Canyouinferwhatyouneedfromthedataavailable?• Arethesamefieldsavailableforalltherecords?• Dotherecordsrepresenttheentirepopulationofthings?• Aretheremultiplerecordsperthing?(à de-duplication)• Isthedatasetheterogenous,andifso,how?

Michael Carey/Padhraic Smyth, UC Irvine: Stats 170A/B, Winter 2018 14

DataProductWorkflowFramework

Michael Carey/Padhraic Smyth, UC Irvine: Stats 170A/B, Winter 2018 15

DesigningRefinedData

• Addressstructuralissues• Tabularize,convert(e.g.,categoriesà indicators),…

• Addressgranularityissues• Maywanttostoremultipleversions/levelsofadataset

• Addressaccuracyissues• Maychooseto

• Removerecordswithinaccuratevalues(ifdetectable)• Retainthembutmarkthemasbeinginaccurate• Imputation:replaceinaccuratevalues(defaults/estimates)

• Insomecasestimecanhelp(e.g.,multipleaddresses)• Addressscopeissues• Criticaltounderstandpopulationcoverage,possiblebiases

Michael Carey/Padhraic Smyth, UC Irvine: Stats 170A/B, Winter 2018 16

RefinedStageAnalyticalFunctions

• Reportinganalyses• Historicaldataà answerquestionsaboutpast/present• Ex:Useofbusinessintelligence(BI)toolsanddashboards• Simplequestions:HowmanycustomersboughtAmazonEchoslastweek,orwhatwerethetopthreemostpopularin-home”listener”devices?

• Complexquestions:Whatwerethekeyfactorsdrivingthepopularityofin-home”listeners”(e.g.,AmazonEcho,GoogleHome)thelasttwoyears?

• Modelingandforecastinganalyses• Historicaltrendsà futuretrends• Ex:Predictionofcustomerretention(e.g.,licenserenewal)• Maywantaprediction,ormaywantthemodelitself• Causalanalyseswillrequirecarefullydesignedexperimentation

Michael Carey/Padhraic Smyth, UC Irvine: Stats 170A/B, Winter 2018 17

DataProductWorkflowFramework

Michael Carey/Padhraic Smyth, UC Irvine: Stats 170A/B, Winter 2018 18

ProductionDataandAutomation

• Creatingoptimizeddata• Puttingdatainidealformfordownstreamconsumption• Constraints:availableprocessingpowerandstorage

• Designingregularandautomatedreports• Monitordatatoensureongoingconstraintsatisfaction• Handle(acceptable)variationswithgeneralizedlogic• Evolutionovertimew.r.t.schemaordataavailability

Michael Carey/Padhraic Smyth, UC Irvine: Stats 170A/B, Winter 2018 19

DataProductWorkflowSummary

Michael Carey/Padhraic Smyth, UC Irvine: Stats 170A/B, Winter 2018 20

• Datawrangling=processinvolvedintransformingorpreparingdataforanalysis• Occursbetweenthestages(tomovetothenextone)

DynamicsofDataWrangling

• Accessing thedata• Permissions,infrastructure,crawling,replicating,…

• Transforming thedata• Manipulatingstructure,granularity,accuracy,temporality,andscopeofdatatoalignwiththeanalysisgoals• Iterationbetweentransformingandprofiling thedata

• Publishing thedatasets(ortransformationlogic)andprofilingmetadataaboutthedatasets

Michael Carey/Padhraic Smyth, UC Irvine: Stats 170A/B, Winter 2018 21

WranglingDynamics(cont.)

• Additionalaspectsforwranglingrealdata• Subsetting• Sampling

• Subsetting• Heterogeneousdatasetswillrequirecreationofhomogeneoussubsetsforefficient/effectivewrangling• Canmergeagainatend,ifneeded

• Sampling• Neededtodealwithverylargedatasets,eitherduetohumanlimitationsortimelimitations• Notsimple:samplesneedtoincludeextremevalues,distributionaltrends,valuevarieties(e.g.,currencies),…

Michael Carey/Padhraic Smyth, UC Irvine: Stats 170A/B, Winter 2018 22

TransformationActions

• Structuring transformations• Reorderingfields• Breakingdownorcombiningfields(e.g.,addresses)• Aggregatingsubsetsofrecords• Pivoting(recordsà fieldsorviceversa)

• Enriching transformations• Joins tocombinedatasets(e.g.,toattachinformation)• (Outer)Unions tocombinerecordsacrossdatasets• Metadatainsertionintothedata(e.g.,editinginfo)• Newvaluecomputation(e.g.,geo-coding,sentiment,…)

• Cleaning transformations• Manipulatingindividualfieldvalues(e.g.,missingvalues)

Michael Carey/Padhraic Smyth, UC Irvine: Stats 170A/B, Winter 2018 23

Profiling

• Individualvalueprofiling• Syntacticconstraints

• Ex:MM-DD-YYYY,(XXX)XXX-XXXX• Semanticconstraints

• Ex:nosalestransactionsonholidays

• Set-basedprofiling• Checkingtheshape/extentofthedistributionofvaluesforagivenfield• Ex:Expecteddistributionofsalesacrossmonths

Michael Carey/Padhraic Smyth, UC Irvine: Stats 170A/B, Winter 2018 24

ActionsintheWorkflowFramework

Michael Carey/Padhraic Smyth, UC Irvine: Stats 170A/B, Winter 2018 25

SyntacticValueProfiling

• Basedonconstraintsonallowablefieldvalues,e.g.:• Boolean:{0,1}(or{true,false},{T,F},…)• Gender:{male,female}• ATMtransactioncount:[0..50000]where50000isbasedon(bankage*365*withdrawals/daylimit)

• CheckingislikeevaluatingaCHECKCONSTRAINTinaSQLDBMS• Whileexploringdata,onemayneedtolookatpositiveandnegativeexamplestodeterminewhatthefinalconstraintshouldreallybe

Michael Carey/Padhraic Smyth, UC Irvine: Stats 170A/B, Winter 2018 26

SemanticValueProfiling

• Basedonvalues’meanings/interpretations,e.g.:• Agefieldmayhave-1whenageisn’treported,andmaywanttoderiveanewBooleanfield(reported_age)toanalyzewillingnessofcustomerstodisclosetheirage

• Anaddressfieldmayhaveresolvabledifferencessuchas“SanJose,CA”,“SnJose,CA”,and“SanJose,CA,USA”.Othercases,e.g.,is“Moscow”inRussiaorIdaho,maynotbeeasilyresolvedandthereforesemanticallyinvalid

• Somecasesmayinvolveconversions,e.g.,fromanageinyearstoalifestage(e.g.,teen,adult,senior)• Profilingsuchcasesofteninvolvesderivinganewfieldthatencodesthesemanticinterpretationofasourcefield(thatcanthenbesyntacticallychecked)• Howtoconvertasourcefieldtoitsinterpretedvalue?

• Commoncase:deterministicrules• Moredifficultcase(s):probabilisticmappings

Michael Carey/Padhraic Smyth, UC Irvine: Stats 170A/B, Winter 2018 27

Set-BasedProfiling

• Focusisshape/extentofthedistributionoffieldvaluesacrossrecords,ortherangeofrelationshipsbetweenmultiplerecordfields• Numericfields

• Buildhistogramandcomparetoaknowndistribution• Alsoexaminemin,max,mean,sumtounderstandthedistributionandspotproblems,outliers,etc.

• Categoricalfields• Countuniquevaluesand/orvalueclusters(GROUPBY)

• Otherspecificfieldtypes• Geospatial(zipcode,lat/long):examineaplotonamap• Temporal:examinedataon/indifferentscales/buckets (e.g.,dayofweek,monthofyear,…)

• Canalsodoscatterplotsofvaluesfromseveralfields• Electionexampleinbook:“CandidateMasterFile”(2015-16)

Michael Carey/Padhraic Smyth, UC Irvine: Stats 170A/B, Winter 2018 28

Transformation:Structuring

• Intrarecord structuring• Reorderingrecordfields(movingcolumns)• Creatingnewrecordfieldsviavalueextraction• Creatingnewrecordfieldsbycombiningfields

• Interrecord structuring• Filteringdocumentsbyremovingsetsofrecords• Shiftinggranularitythroughaggregationsandpivots

Michael Carey/Padhraic Smyth, UC Irvine: Stats 170A/B, Winter 2018 29

ExtractingValues(Intra)

• Positionalextraction• Dateexample:17012018

• Nameexample:WAYNE,BRUCE

• Patternbasedextraction• Moneyexample:BRIBE($999.00MONTHLY)

• Complexstructureextraction• JSONexample:

{“id:“123”,“Customer”:{“name”:“Fred”,“city”:“LA”},“total”:25.97,“gift”:true,“shipping”:“UPSGround”,“Items”:[{“sku”:401, “qty”:2,“price”:9.99},…]}

Michael Carey/Padhraic Smyth, UC Irvine: Stats 170A/B, Winter 2018 30

CombiningFields(Intra)

• Nameexample(inreverse):

Michael Carey/Padhraic Smyth, UC Irvine: Stats 170A/B, Winter 2018 31

FirstName MiddleName LastNameBruce Wayne

Anthony Edward Stark

NameWayne,Bruce

Stark, AnthonyEdward

FilteringRecordsandFields(Inter)

• Removingrecordsorfieldsfromadataset• Example(bothrecord-basedandfield-based):

Michael Carey/Padhraic Smyth, UC Irvine: Stats 170A/B, Winter 2018 32

Name SuperHero AlmaMaterWayne,Bruce Batman GothamCityCollege

Stark, AnthonyEdward IronMan MIT

Smoak, Felicity MIT

Allen,BartholomewHenry Flash CentralCityU

Name AlmaMaterStark, AnthonyEdward MIT

Smoak, Felicity MIT

Aggregations(Inter)

• Shiftingthegranularityofadataset• Simpleaggregationexample:

Michael Carey/Padhraic Smyth, UC Irvine: Stats 170A/B, Winter 2018 33

Name SuperHero AlmaMater NumYearsWayne,Bruce Batman GothamCityCollege 2

Stark, AnthonyEdward IronMan MIT 4

Smoak, Felicity MIT 5

Allen,BartholomewHenry Flash CentralCityU 4

AlmaMater NumGrads AvgYears

Central CityU 1 4.0

GothamCityCollege 1 2.0

MIT 2 4.5

Columnà RowPivots(Inter)

• Shiftingthegranularityofadataset• Simple“unpivoting”example:

Michael Carey/Padhraic Smyth, UC Irvine: Stats 170A/B, Winter 2018 34

Customer CellPhone HomePhone OfficePhoneJohnSmith (212)123-4567 (212)111-2233SallyForth (949)124-8163 (949) 987-6543

Customer PhoneLoc PhoneNum

JohnSmith Cell (212)123-4567

JohnSmith Home (212)111-2233

SallyForth Home (949)124-8163

SallyForth Office (949) 987-6543

Rowà ColumnPivots (Inter)

• Again,shiftingthegranularityofadataset• A“pivoting”example:

Michael Carey/Padhraic Smyth, UC Irvine: Stats 170A/B, Winter 2018 35

Donor RedCrossSum KJAZSum UnitedWaySumJohnSmith 300.00 500.00 0.00SallyForth 0.00 0.00 7500.00

Donor Charity Gift

JohnSmith RedCross 100.00

JohnSmith KJAZ 500.00

JohnSmith RedCross 200.00

SallyForth United Way 7500.00

Enrichment:Transformations

• Union• Ex:YearSales =Q1SalesU Q2SalesU Q3SalesU Q4Sales• SimplecasewhenQi’sareunion-compatible(alaSQL)• Mayneed“outerunion”ifslightlydifferentfieldsets

• Sales1(region,amount,listpricesale)• Sales2(region,amount,channel)☛ AllSales(region,amount,listpricesale,channel)

• Join• Ex:CustomerPhone INNERJOINDonationONCustomerPhone.Customer =Donation.Donor(ThenwecanstartmakingthoseannoyingcallsJ)

Michael Carey/Padhraic Smyth, UC Irvine: Stats 170A/B, Winter 2018 36

Enrichment:Metadata

• Examples:• Filenamesofsourcedata(basicprovenance)• Byteoffsetsand/orrecordnumbers(location)• Currentdateand/ortime• Creation/update/accesstimestamps• Recordand/orrecordfieldlineage(provenance)

• Q:Whymightonewanttodothis…?• Gobacktothesource(s)intheeventoferrors• Credibility/authorityofdataunderlyingagivendataproductoranalysis

Michael Carey/Padhraic Smyth, UC Irvine: Stats 170A/B, Winter 2018 37

Enrichment:ValueDerivation

• Genericderivations• Commonexamples

• Derivedayofweekorseasonfromdate• Convertaddressintozipcode,lat/long,orregion• Analyzetextforsentimentorforentityreferences(people,places,things)

• Computesaleaslistpricetimesdiscountplustax• Mayinvolvedomain-specificaspects

• Regiondefinitionsforgovernmentvs.businesses• Specialterminologyorentitytypes

• Maybedrivenbylaw(e.g.,fieldredaction)• Proprietaryderivations

• Individualorganizations’customizedmodels• CommonlyusedDB/BigDatamechanism:UDFs

Michael Carey/Padhraic Smyth, UC Irvine: Stats 170A/B, Winter 2018 38

DataCleaningTransformations

• MissingorNULLvalues• Canfilterourrecordswithsuchfields• Canreplacesuchvalues(imputation)

• Averageormedianvalue• Generatevaluesfromsimilarrecords• Uselastvalidvalue(orinterpolate)insequencedata

• Invalidvalues• Somecommonsymptoms

• Inconsistentwithotherfields(e.g.,agevs.DOB)• Ambiguous(e.g.,twodigityears,abbreviations,…)

• Somepotentialcures• Calculatethecorrectorconsistentvalue• Markthevalueasinvalidandanalyzethedatawith/without• Datastandardization(basedonfixedlibraryofvalidvalues),usingeditdistanceordomainknowledgeasatool

Michael Carey/Padhraic Smyth, UC Irvine: Stats 170A/B, Winter 2018 39

DataWranglingCastofCharacters

Michael Carey/Padhraic Smyth, UC Irvine: Stats 170A/B, Winter 2018 40

DataEngineer’sResponsibilities

Michael Carey/Padhraic Smyth, UC Irvine: Stats 170A/B, Winter 2018 41

DataArchitect’sResponsibilities

Michael Carey/Padhraic Smyth, UC Irvine: Stats 170A/B, Winter 2018 42

DataScientist’sResponsibilities

Michael Carey/Padhraic Smyth, UC Irvine: Stats 170A/B, Winter 2018 43

DataAnalyst’sResponsibilities

Michael Carey/Padhraic Smyth, UC Irvine: Stats 170A/B, Winter 2018 44

ActionListRevisited

1. Ingestingdata2. Describingdata3. Assessingdatautility4. Designingandbuildingrefineddata5. Adhocreporting6. Exploratorymodelingandforecasting7. Designingandbuildingoptimizeddata8. Regularreporting9. Buildingproductsandservices

Michael Carey/Padhraic Smyth, UC Irvine: Stats 170A/B, Winter 2018 45

OrganizationalBestPractices

• Providewideaccesstodata• Implementmechanismstotrackdatausage• Useacommondatamanipulationlanguagethatspansbusinessunitsanduserroles(e.g.,Excel,SQL,Python,…)• Maintainasystemthatallowsyoutoeasilytransitionfromdevelopmenttoproduction• Considerarotationprogramacrossrolestoenableacleanerhand-offandincreasecross-functionaltrust

Michael Carey/Padhraic Smyth, UC Irvine: Stats 170A/B, Winter 2018 46

DataWranglingTools

Tool DataScale

UsualPlatform

DataStructures

TransformationParadigm

Excel MB toGB Desktop Grid UI:ScriptandwizardsScope: Singlevalues(formulas)

SQL GBtoTB Server Tables UI:“Script”only (SQL)Scope:Programmatic(scriptsovermultiplerecords)

Trifecta Unlimited Cluster Various UI:Script, “builder”,machine-guidedScope:Programmatic(scriptsovermultiplerecords)

Michael Carey/Padhraic Smyth, UC Irvine: Stats 170A/B, Winter 2018 47

Note:OtherpotentialoptionsincludeNoSQLdatabases,Hadoop/Sparkbasedplatforms(e.g.,Hive,SparkSQL),…

Questions,Comments,Etc.?

Michael Carey/Padhraic Smyth, UC Irvine: Stats 170A/B, Winter 2018 48