data preparation for automated machine...
TRANSCRIPT
DATAPREPARATIONFORAUTOMATEDMACHINELEARNING
©2017ImpactAnalytix,LLC-Allrightsreserved. 1
DATAPREPARATIONFORAUTOMATEDMACHINELEARNINGBYJENUNDERWOOD
WHITEPAPER
DATAPREPARATIONFORAUTOMATEDMACHINELEARNING
©2017ImpactAnalytix,LLC-Allrightsreserved. 2
TABLEOFCONTENTSIntroduction........................................................................................................................3
ArtandScienceofDataPreparation......................................................................3
AutomationforFasterDataPreparation................................................................3
DataRobotforMachineLearning...........................................................................4
WheretoStart.....................................................................................................................5
MachineLearningLifecycle....................................................................................5
PlanforDataCollection..........................................................................................5
AvoidOverfittingandUnderfitting.........................................................................7
CollectandStructureData..................................................................................................8
GatherData............................................................................................................8
BewareofBias......................................................................................................11
ExploreandProfile............................................................................................................11
UnderstandYourData..........................................................................................12
DetectLeakage.....................................................................................................12
FindandReduceErrors........................................................................................13
ImproveDataQuality........................................................................................................15
EngineerFeatures.............................................................................................................18
Conclusion.........................................................................................................................20
RecommendedNextSteps...................................................................................20
JANUARY2018–WHITEPAPERCOMMISSIONEDBYDATAROBOT
DATAPREPARATIONFORAUTOMATEDMACHINELEARNING
©2017ImpactAnalytix,LLC-Allrightsreserved. 3
INTRODUCTIONThe beauty of the human mind in combination with automated machine learningempowers amazing predictive insights that might never be found using manualtechniques.Sincethequalityofpredictiveoutputreliesonthequalityof input,properdatapreparationisacriticalsuccessfactorforachievingoptimalmachinelearningresults.
THEARTANDSCIENCEOFDATAPREPARATIONTheiterativeprocessofpreparingdataforautomatedmachinelearningisbothanartandascience.Theartofdatapreparationrequiresknowledgeofthebusiness–or“domainexpertise”–toselecttherightproblemstosolve,identifycrucialinputdata,andcarefullytransformandengineerinformativefeaturestomaximizepredictivemodelaccuracy.Thescienceofdatapreparationinvolvescleansingandnormalizingcollecteddata,selectinginfluencer features, and generating training, testing, and validation data sets forautomatedmachinelearning.
Although automated machine learning solutions may provide safeguards to preventcommonmistakes,you’llstillwanttolearnhowtocorrectlyprepare,shape,andformat
your data to create greatmodels. Providing the wrong data, irrelevant data, orimproperly prepared data undermines model performance for generalization.Essentiallyyouwillwanttodesignaninputdatasetwithfeatureindependence,explainable variance, and maximum information gain to find signals in thenoise.
AUTOMATIONFORFASTERDATAPREPARATIONTo identify relationships in data — “the signals”— and isolate distracting,irrelevantdata—“thenoise”—youcanexpeditethehighlyiterativepredictive
datapreparationprocesswithautomatedmachinelearning.Inminutes,youcanfindthemostrelevantfeaturesandpinpointspecificareas inyourdatasetwhere
prediction errors occur to help you focus efforts on the right data and reduceexperimentationtime.
Afterrunningbasicinputdataandevaluatingtheresults,youwillenhanceinputdata,addfeatures,buildanothermodel,andreviewperformanceonceagain.You’llcontinuethisprocessuntilyourmodelmeetsperformanceobjectives.
Ideally subject matter experts that understand the business process and data sourcenuances will assist in the data preparation process. Depending on your project, datapreparationmightbeaone-timeactivityoraperiodicone.Asnewinsightsarerevealed,itiscommontoexperimentfurther.
“Youcanexpeditethehighlyiterativepredictivedatapreparationprocesswithautomatedmachinelearning.”
DATAPREPARATIONFORAUTOMATEDMACHINELEARNING
©2017ImpactAnalytix,LLC-Allrightsreserved. 4
DATAROBOTFORMACHINELEARNINGDataRobot is the world’s most advanced automated machine learningplatform.DataRobotautomatesthemachinelearningprocessfromdata ingestion to deployment. It delivers immediate value andunmatchedease-of-use,andnocomplicatedmathorscriptingisrequired.
DataRobot includes an array of data preparation features,automatingfeatureengineeringtofindkeyinsightsandhiddenpatterns. This invaluable technology expedites analyticalinvestigation across millions of variable combinations thatwould be far too time-consuming for manual humanexploration.
For optimal machine learning model performance, domainknowledge and best practices used by the world’s leading datascientistshavebeenuniquelybakedintoDataRobotblueprints.Usersofallskill levels can safely apply machine learning with its built-in optimizations andsafeguards.
DataRobot supports popular advancedmachine learning techniques and open sourcetoolssuchasApacheSpark,H2O,Scala,Python,R,andTensorFlow.Usingdrag-and-drop,point-and-click guided menu options, DataRobot users can simply and quickly createpredictivemodelswithautomatedmachinelearning.Theprocessissimple:
o Ingestdatasources
o Selectatargetvariabletopredict
o Automatically generate features, extract balanced samples, build and iteratethrough100sofmachinelearningmodels
o Visuallyexploretopperformingmodelsandkeyfindings
o Easilydeployandoperationalizemodels
Machinelearningdevelopmentstepsthatusedtotakeweeksormonthsofeffortcannowbe completed in hours. By embeddingDataRobot automatedmachine learningmodelintelligence into your reporting or business processes, you can quickly close the loopbetweeninsightandaction.
Sinceeachdatasetandbusinessobjectivecanbeuniquewithvariedchallenges,wehaveprovidedthefollowingguidelinestohelpgetyoustarted.Wealsoshareessentialtipsandadditionalresourcesforfurtherstudy.
“Domainknowledgeandbestpracticesusedbytheworld’sleadingdatascientistshavebeenuniquelybakedintoDataRobotblueprints.”
DATAPREPARATIONFORAUTOMATEDMACHINELEARNING
©2017ImpactAnalytix,LLC-Allrightsreserved. 5
WHERETOSTARTThe machine learning process begins with Business Understanding. This initial stepfocusesondefiningtherightproblemtosolveandrecognizingthebusinessobjectivesandrequirements.Afterselectingaproblem,youwillcollectandassessdata.DuringtheDataUnderstandingstep,youwillgetfamiliarwithavailabledatasources,identifydataqualityproblems,andperformexploratoryanalysis.Then,intheDataPreparationstep,youwillcleansethedata,shapingandtransformingitintoaflattenedformatforloadingintotheautomatedmachinelearningplatform.
MACHINELEARNINGLIFECYCLE
Figure1:OverviewoftheMachineLearningProcess
Forthepurposesofthiswhitepaper,wewillconcentrateoncollectingdataandpreparingitproperly.Wewillnotcovertheentiremachinelearninglifecycle.
Beforeyoubeginthedatacollectionanddatapreparationprocess,itisassumedthatyoualreadyhaveselected,defined,andisolatedabusinessproblemtosolvethatisaviablecandidateformachinelearning.Youshouldalsohavechosenatleastonemetricthatyouwanttobetterunderstand.Ifyouneedmoreinformationonthosestepsinthemachinelearningprocess,pleaserefertoourpreviouswhitepaper,MovingfromBItoMachineLearning.
PLANFORDATACOLLECTIONAsyoudeveloprequirementsformachinelearningmodeldatacollection,contemplatethebusinessprocessandreviewitfromdifferentperspectives.Considerwhathappensateach step, what data is captured, where it is stored, if data history or changes areretained,andifthatdatatrulyreflectstherealworldforresultingpredictions.Oftendataiscollectedinline-of-businessapplicationsordatawarehousesforotherreasons.Existing
DATAPREPARATIONFORAUTOMATEDMACHINELEARNING
©2017ImpactAnalytix,LLC-Allrightsreserved. 6
datamaybemissingsituationalcontextsuchaslocation,environmentalconditions,andother relevant variables for predicting an outcome. Document known issues andpreferreddatathatcouldbeaddedinthefuture.
TOIMPROVEINPUTDATACOLLECTION,DIAGRAMTHEBUSINESSPROCESSFLOW.IDENTIFYTHESTEPS,ENVIRONMENTALCONDITIONS,SCENARIOS,PEOPLE,SYSTEMS,AVAILABLEDATA,ANDMISSINGDATA.
As you continue planning to gather data for yourmachine learningmodeling project,you’ll need to confirmdecision-levelmetric granularity.Granularity refers to aunitofanalysis. A unit might be an opportunity, customer, or transaction. Granularity is
determined by the business objectives and how your model will be usedoperationally.Askstakeholdershowdecisionswillbemadefromthepredictivemodels.Aretheybasedonasinglecustomer,transaction,orevent,oraretheybasedonaggregatedataovertime?
To illustratetheseconcepts,wewillbereferringtoapubliclyavailableBankMarketingDataSet1fromUCI’smachinelearningrepository.Thesampledatasetcontainspartiallyprepareddata topredictclient termdepositscollectedduringthebank’stelemarketingcampaigns.
In the BankMarketing Data Set, the desired outcome to predict is client termdeposits. This is a binary yes or no outcome in the sample, but it could have
alternativelybeenatotalamountfiguretomaximizedeposits.Don’talwayslimityourselftocollectingoneoutcomevariablewhileassemblingdata.Thinkaboutotherquestionsthatmightbeaskedanddatathatwouldmakesensetoinclude.
Potentialinfluencerfeaturesfortheexampleclienttermdepositoutcomeincludeclientdemographics such as age, job, marital status, and education. Past credit and loanrepayment information is also important to know. Other features chosen includedcampaigncontacts,previousmarketingcampaignoutcomes,andseveralexternalsocialandeconomicenvironmentalattributessuchasemploymentrate.
NOTE:WHENSELECTINGFEATURES,YOUWILLWANTTOEXTRACTTHEMAXIMUMINFORMATIONFROMTHEMINIMUMNUMBEROFINPUTVARIABLES.
1UCImachinelearningrepositorydatasethttps://archive.ics.uci.edu/ml/datasets/bank+marketing
“Toillustratetheseconcepts,wewillbereferringtoapubliclyavailableBankMarketingDataSet1fromUCI’smachinelearningrepository.”
DATAPREPARATIONFORAUTOMATEDMACHINELEARNING
©2017ImpactAnalytix,LLC-Allrightsreserved. 7
AVOIDOVERFITTINGANDUNDERFITTINGOverfitting and underfitting are common mistakes for beginners who are preparingmachine learningmodeling data. Overfitting captures the noise in your data with anoverly complex, unreliable predictive model. Essentially what happens is the modelmemorizesunnecessarydetails.Whennewdatacomesin,themodelfails.
Ifyoudon’thaveenoughfeatures,yourmodelmightbeoversimplifiedandsufferfromunderfittingissues.Underfittingoccurswhenastatisticalalgorithmcannotcapturetheunderlyingpatternsinthedata.
Whyareoverfittingandunderfittingproblematic?Themachinelearningmodelwillhavetoomanypredictionerrorstobeusefulfordecision-making.
Figure2:CommonDataPreparationIssues
Forcategoricaldata,overfittingcanoccur ifahighnumberofcategoriesareobservedwith a small number of observations per category. These types of variables hold lessinformation for predictive value. For time series data, overly complex mathematicalfunctions that describe the relationship between the input variable and the targetvariablecanalsoleadtooverfitting.Inthemostextremeformofoverfitting,individualidentifiersare inadvertentlyusedasmachine learning inputs. Individual identifierscanperfectly model existing data, but would only by chance reliably model and predictoutcomesforotherdata.
Thus,thereisadelicatebalancebetweenbeingtoospecificwithtoomanyfeaturesandtoovaguewithnotenoughfeatures.Designingmachinelearningmodelfeatureswithjusttherightamountofpredictiveinformationgainandprecisionisakeyskill intheartofdatapreparation.
DATAPREPARATIONFORAUTOMATEDMACHINELEARNING
©2017ImpactAnalytix,LLC-Allrightsreserved. 8
COLLECTANDSTRUCTUREDATAAfter reviewing the business process and planning the machine learning input datarequirements,you’lldelveintothedatacollectionandshapingprocess.
GATHERDATAMachinelearningalgorithmsassumethateachrecordisindependentandarenotrelatedtootherrecords. If relationshipsexistbetweenrecords,youwillwanttocreateanewvariablecalleda feature ina columnwithin the rowofdata tocapture thatbehavior.Unlike third-normal form transactional or dimensional patterns used in businessintelligence,machinelearningrequiresdatatobeinputasa“flattened”table,view,orcommaseparated(.csv)flatfileofrowsandcolumns.Yourviewwillneedtocontainanoutcome metric and target variable, along with input predictor variables. This datarepresentationformachinelearningiscalledthefeaturematrix.
Figure3:ExamplePreparedDataSet
If you have data stored in several tables in a data warehouse or relational databaseformat,youwillneedtouserecordidentifierstojoinfieldsfrommultipletablestocreatea singleunified, flattened “view.” Formany target variables, inputdata is capturedatvariousbusinessprocessstepsinmultipledatasources.AsalesprocessmighthavedatainaCRM,emailmarketingprogram,Excelspreadsheet,and/oraccountingsystem.Ifthatisthecase,youwillwanttoidentifythefieldsinthosesystemsthatcanrelate,join,orblendthedifferentdatasourcestogether.
Prepareddatashouldbecollectedatalevelofanalyticalgranularityuponwhichyoucanmakedecisions.Chooseagranularitythatisactionable,understandable,andusefulinthe
DATAPREPARATIONFORAUTOMATEDMACHINELEARNING
©2017ImpactAnalytix,LLC-Allrightsreserved. 9
eventyouincorporatetheresultsintoyourexistingbusinessprocessorapplication.Forexample,ifyouwanttomakedailysalesforecasts,youneedtoinputdataatadaylevel
ratherthanweek,month,oryear.
Ifyouaretryingtocapturechangesindataoveracertaintimeperiod,checkifyourdatasourceisonlykeepingthecurrentstatevaluesofarecord.Mostdatawarehousesaredesignedtosavedifferentvaluesofarecordovertimeanddonot overwrite historical data values with current data values. TransactionalapplicationdatasourcessuchasSalesforceonlycontainthecurrentstatevalueforarecord.Ifyouwanttogetapriorvalue,youneedtohaveasnapshotofthehistoricaldatastoredorkeepthepriorvaluedataincustomfieldsonthe
currentrecord.
Whilestructuringinputdata,ensurethat it iscleanandconsistent.Theorderandmeaningofinputvariablesshouldremainthesamefromrecordtorecord.Inconsistentdataformats,“dirtydata,”andoutlierscanunderminethequalityofanalyticalfindings.
HOWMUCHDATATOCOLLECTTheactualnumberofrecordsisnotalwayseasytodetermineanddependsonpatternsinyourdata.Ifyouhavemorenoiseinyourdata,youwillneedmoredatatoovercomeit.Noiseinthiscontextmeansunobservedrelationshipsinthedatathatarenotcapturedbytheinputpredictorvariables.
Figure4:SampleSizeEstimation
To determine minimum data set sizes, consider the dimensionality of your data andpatterncomplexity.2
o Forsmallmodelswithafewvariables,10to20recordspervariablevaluemaybesufficient.
o Formorecomplexmodels,~100 recordspervariablevaluemaybeneeded tocapturepatterns.
o For complex models with ~100 input variables, you will need a minimum of10,000recordsinthedataforeachsubset(training,testing,andvalidation).
TRAINING,TESTING,ANDVALIDATIONDATASETSThemostcommonstrategyistosplitdataintotraining,testing,andvalidationdatasets.
2AppliedPredictiveAnalytics:PrinciplesandTechniquesfortheProfessionalDataAnalystbyDeanAbbott
“Shapingdatainvolvessubjectmatterexpertthoughttocreativelyselect,create,andtransformvariablesformaximuminfluence.”
DATAPREPARATIONFORAUTOMATEDMACHINELEARNING
©2017ImpactAnalytix,LLC-Allrightsreserved. 10
These data sets are usually created through a random selection of records withoutreplacement,meaningeachrecordbelongstooneandonlyonesubset.All threedatasetsshouldreflectyourreal-worldscenario.
CrossValidation
KeepinmindthatDataRobotincludesindustrystandardk-foldcrossvalidationfeaturesthatdividethedataintoksubsetswiththeholdoutrepeatedktimes.Eachtime,oneoftheksubsetsisusedasthetestsetandtheotherk-1subsetsarecombinedtoformatrainingset.DataRobot’sk-foldfeatureenablesyoutoindependentlychoosehowmuchdatayouwanttouseintesting.
TimeSeriesConsiderations
Data that changes over time should be reflected in your input data set. When timesequences(ContactMade>QuoteProvided>DealClosed)areimportantinpredictions,proportionallycollectingdatafromthosedifferenttimeperiodsisimportantaswell.Thekeyprincipleistoprovidedatathatreflectswhatactuallyhappensattherightlevelofoutcomemetricgranularity.
Whencollectingdata, thinkaboutthebalanceofyourvalues inyourraw data. In our example, howmany prospects do you have atdifferent ages and stages of life?Dotheyhaveloans?Whatarethebalances on outstanding loans?How many defaulted on a loan?Howrecentwasthedefault,?Whatis the prospect’s income history,credit ratings, debt-to-incomeratio,andsoon?
ThinkProportionally
Whenextractingasubsetofdata,besuretoincludeapproximatelythesameproportionofvariablesinyourDataRobotinputdatasetthatyouseeinthereal-worlddata.Ifyouprovide more records of one variable value, you can accidentally bias the machinelearningmodel’spredictions,therebydiminishingperformance.Ifyouhavedatasetswithmillionsorbillionsofrows,itismuchlesslikelythatyouwillencounteraccidentalbiasindatapreparation.
Figure5:SamplingData
DATAPREPARATIONFORAUTOMATEDMACHINELEARNING
©2017ImpactAnalytix,LLC-Allrightsreserved. 11
SAMPLINGThechoiceoftheoptimalsamplingmethod3foragivenproblemespeciallydependsonthecharacterofthedatasetandthedesiredproportionofthesubsets.Eachmethodhasitsadvantagesbutalsoitslimitations.
Forsimple,nearlyuniformlydistributeddatasets,themethodofsimplerandomsamplingmay be sufficient. For naturally well-ordered time series data, highly efficientdeterministic approaches such as convenience and systematic sampling can achievereliableresults.Whendealingwithcomplex,high-dimensionaldata,moresophisticatedandstratifiedsamplingtechniquescanreducethebiasandvarianceofmodelerror.
UnbalancedTwo-ClassProblemsDatasetsseldomcomewithevenlydistributedsamples.Unbalanceddataisacommonissue to remediate. Fraud or failure rate data are examples of unbalanced two-classproblems.Analyzingunbalanceddatacreatesuselessresultswithexceptionallyhigherrorrates.
Tobuildpredictivemodelsonunbalanceddata,youneedtoapplysamplingtechniquesthatincreasetheminorityclassproportionwithdownsamplingorupsamplingtocreateabalanceddataset.Aftertrainingamachinelearningmodelwithabalancedtrainingset,youwillvalidateperformanceofthemodelwiththereal-world,unbalanced,unseentestset.
BEWAREOFBIASWhileyouaccumulatedata,considerpotentialbiases.4Humannatureisconsciouslyandunconsciouslybiased.Cognitivebiasesaretendenciestothinkincertainwaysthatcanlead to irrational judgment.Outcome,omission, andmanyotherbias types caneasilycreepintothedatacollectionprocess.
If unknown bias exists, it is basically an unjustified assumption that your input datareflectsreality.Anymodelbuiltonsuchassumptionsreflectsonlythedistortedrealityandwillperformpoorly.Toreducepotentialbias, testhypotheses,pokeholes inyourown ideas,welcomechallenges,andconductpeer reviewsofyourdatacollectionandsamplingthoughtprocesses.Machinelearningprojectsshouldbegroupprojectsandnotdoneinisolation.
EXPLOREANDPROFILENow youwill assess the condition of your source data. DataRobot automates severalaspects of initial data examination. DataRobot also automates sampling to avoidconventional sampling and overfitting issues. During this step, you’ll visually look for
3CommonTypesofDataSamplingMethodshttps://en.wikipedia.org/wiki/Sampling4TheCognitiveBiasCodexhttps://upload.wikimedia.org/wikipedia/commons/a/a4/The_Cognitive_Bias_Codex_-_180%2B_biases%2C_designed_by_John_Manoogian_III_%28jm3%29.png
DATAPREPARATIONFORAUTOMATEDMACHINELEARNING
©2017ImpactAnalytix,LLC-Allrightsreserved. 12
trends, extreme values, outliers, exceptions, skewed data, incorrect values, andinconsistentandmissingdata.
UNDERSTANDYOURDATAAsyoubegintoexploreandunderstandyourdata,DataRobotprovidesdataprofilesfor
everyfeaturethatincludehowmanyvaluesareuniqueormissing,aswellasthestatisticalmean,standarddeviation,median,minimum,andmaximumvalue.Youcanalsoreviewthedistributionsofeachfeatureusingahistogramwithoptionalbinsettingsandapplytransformationstonormalizeyourdata.
Informative Features immediately ranks variables that provide the mostinformationgainforbuildingoptimalmachinelearningmodels.Knowingwhichareasofyourdatamostinfluencetheoutcomeisinvaluableonitsown.Thisinformationcanguidethebusinesstofocuslimitedtimeandresourcesonthe
activitiesthatmattermost.
Another one of DataRobot’s data preparation strengths is the array of differentvisualizationtoolsthatidentifyinfluentialfeatures,rankthem,anduncovererrors.TheFeature Ranking reportmeasures howmuch each feature by itself contributes to theaccuracyofamachinelearningmodel.
Figure6:FeatureImpact
DETECTLEAKAGELeakageistheaccidentalinclusionofoutcomeinformationthatwouldnotbelegitimatelyavailable for predictions. If you have inadvertent leakage, you’ll likely notice it in thefeaturerankingreportwhenafeaturehasanexceptionallyhighimpact.
“Bylookingatthesefindings,thebusinesscanimmediatelyappreciatethevariablesthattrulyinfluenceoutcomestogetimmediatevalue.”
DATAPREPARATIONFORAUTOMATEDMACHINELEARNING
©2017ImpactAnalytix,LLC-Allrightsreserved. 13
Figure7:LeakageExample
In our Bank Marketing Data Set example, duration was identified in the DataRobotFeatureImpactreportasaleakedfeaturewith100%impact.Thenextbestperformingfeaturecarriesa littlemore than10% impact. Ifyouseeanexceptionallyhigh impact,verifyprocessflowtimingforthatfeature.
TOAVOIDLEAKAGE,YOUNEEDTOCONSIDERTHETIMINGANDORDEROFEVENTSTOENSURETHATOUTCOMEDATAISNOTUSEDASINPUTDATA.
FINDANDREDUCEERRORSDataRobotModelX-RayandReasonCodescapabilitiescanalsoprovidedeepinsightsforenhancing data preparation. Model X-Ray enables interactive, visual exploration ofmachinelearningmodelperformance.Youcaneasilyseewhereamodelmakesmistakesbyselectinginputfeatures.Itshinesalightonissuesthatmightnotgetdetectedusingother tools.Model X-Ray allowsmachine learningmodel designers to concentrate onwherethemostmodelperformanceimprovementscanbemadeinthedatapreparationprocess.
DATAPREPARATIONFORAUTOMATEDMACHINELEARNING
©2017ImpactAnalytix,LLC-Allrightsreserved. 14
Figure8:ModelX-Ray
ReasonCodesunveilwhatvalueswithinafeaturedrivethemodel’sresults.Thistooliscrucialformachinelearningmodeldevelopmentprocessesthataresubjecttoregulatorycomplianceorlegalscrutiny.WithReasonCodes,youcandiscoverwhichcombinationsoffeaturevaluestriggeraspecificmachinelearningoutcome.Thisinformationcanalsobe useful throughout the iterative data preparation process to incrementally improveresults.
Figure9ReasonCodes
DATAPREPARATIONFORAUTOMATEDMACHINELEARNING
©2017ImpactAnalytix,LLC-Allrightsreserved. 15
IMPROVEDATAQUALITYWerecommendthatyouaddressdataqualityissuesasearlyaspossible.Hereareafewtipsforhandlingcommondataissuesinthedatapreparationprocess.
CorrectingIncorrectValues
Machinelearningmodelsassumetheinputdataiscorrect.Ifyouareseeingerrorsfromsourceapplicationsthatshouldgetfixed,abestpracticeistotryandresolvetheissueatthesourcesystemversusinadatapreparationprocess.
Treat incorrect values as missing if there is a minimal amount and you can’t easilydeterminecorrectvalues.Iftherearealotofinaccuratevalues,trytodeterminewhathappenedtorepairthem.Ifyoudomakechangestodata,documentyourreasoning.Alsocaptureinitialcontextandchangedvalueswithaflagtoidentifychanges.Thepatterninyourdatamightbehiddeninthoseincorrectvalues.
SkewedVariables
Forcontinuousvariables,reviewthedistributions,centraltendency,andvariablespreadin DataRobot. These are measured using various statistical metrics and visualizationmethodssuchashistograms.Continuousvariablesshouldbenormallydistributed.Ifnot,reduce skewnesswith transformations or by experimentingwith bin sizes for optimalprediction.
When a skewed distribution needs to be corrected, the variable is transformed by afunctionthathasadisproportionateeffectonthetailsofthedistribution.Logtransformslikelog(x),logn(x),log10(x);themultiplicativeinverse(1/x);squareroottransformsqrt(x);orpower(xn)arethemostfrequentlyusedcorrectivefunctions.
In the table to the left,several issues and formulastominimizeskewareshown.The before and after chartsillustrate how differentskewed variabletransformationscanbeusedto normalize featuredistributions.
Figure10:TransformationsforSkewness
DATAPREPARATIONFORAUTOMATEDMACHINELEARNING
©2017ImpactAnalytix,LLC-Allrightsreserved. 16
High-Cardinality
High-cardinality fields are categorical attributes that contain a very large number ofdistinctvalues.Examplesincludenames,ZIPcodes,andaccountnumbers.Althoughthesevariables can be highly informative, high-cardinality attributes are rarely used inpredictive modeling. The main reason is that including them will vastly increase thedimensionalityofthedataset,makingitdifficultorevenimpossibleformostalgorithmstobuildaccuratepredictivemodels.
Redundant,HighlyCorrelatedVariables
Duplicate,redundant,orotherhighlycorrelatedvariablesthatcarrythesameinformationshould be minimized. DataRobot algorithms will perform better without collinearvariables.Collinearityoccurswhentwoormorepredictorvariablesarehighlycorrelated,meaningthatonecanbelinearlypredictedfromtheotherswithasubstantialdegreeofaccuracy.
Toidentifyhighcorrelationbetweentwocontinuousvariables,reviewscatterplots.Thepatternofascatterplot indicates therelationshipbetweenvariables.Therelationshipcanbelinearornon-linear.Tofindthestrengthoftherelationship,computecorrelation.Correlationvariesbetween-1and+1.
Ifyouhavetwovariablesthatare almost identical and youdo want to retain thedifference between them,consider creating a ratiovariableasafeature.Anotherapproach is to use PrincipalComponent Analysis (PCA)outputasinputvariables.
Toavoidthecollinearityissue,do not include multiplevariables that are highlycorrelatedordatathatisfromthesamereportinghierarchy.Often those fields provideobviousinsights.Forexample,customerswholiveinthecityofTampaalsohappentoliveinthestateofFlorida.
Figure11:Scatterplotsforcorrelationdetection
DATAPREPARATIONFORAUTOMATEDMACHINELEARNING
©2017ImpactAnalytix,LLC-Allrightsreserved. 17
Missingvalues
Themostcommonrepairformissingvaluesisimputingalikelyorexpectedvalueusingameanorcomputedvaluefromadistribution.Ifyouusethemean,youmaybereducingyourstandarddeviationthusthedistributionimputationapproachismorereliable.
Anotherapproachistoremoveanyrecordwithmissingvalues.Don’tgettooambitiouswithfilteringoutmissingvalues.Ifyoudeletetoomanyrecords,youwillunderminethereal-worldaspectsinyouranalysis.Asyouaddressmissingvalues,donotlosetheinitialmissingvaluecontext.Acommondatapreparationapproachistoaddacolumntotherowtoflagdatawasmissingcodedwitha1or0.
ExtremeValuesandOutliers
Outliersarevaluesthatexceedthreestandarddeviationsfromthemean.Manymachinelearningalgorithmsaresensitivetooutlierssincethosevaluesaffectaverages(means)andstandarddeviationsinstatisticalsignificancecalculations.Ifyoucomeacrossunusualvaluesoroutliers,confirmthatthesedatapointsarerelevantandreal.Often,oddvaluesareerrors.
If theextremedatapointsareaccurate,predictable,andsomethingyoucancountonhappening again, do not remove them unless those points are unimportant. You canreduceoutlierinfluencebyusinglogtransformationsorconvertingthenumericvariabletoacategoricalvaluewithbinning.
DATAPREPARATIONFORAUTOMATEDMACHINELEARNING
©2017ImpactAnalytix,LLC-Allrightsreserved. 18
ENGINEERFEATURESFeaturecreationistheartofextractingmoreinformationfromexistingdatatoimprovethepredictivepowerofmachinelearningalgorithms.Youaremakingthedatayoualreadyhavemoreuseful.Strongfeaturesthatpreciselydescribetheprocessbeingpredictedcanimprovepatterndetectionandenablemoreactionableinsightstobefound.
Creating features from several combined variables and ratios usually provideshighermodelaccuracy thananysingle-variable transformationbecauseof theinformationgainassociatedwithdatainteractions.Ifyoueversawthemovieorread the book “Moneyball: TheArt ofWinning anUnfairGame”byMichaelLewis, you’ll know how baseball analysis was revolutionized with newperformancemetrics likeOn-BasePercentage(OBP)andSluggingPercentage(SLG). With feature engineering, you will be using a fundamentally similarapproach.
HUMANINGENUITYANDCREATIVITYREQUIREDFeatureengineeringischallengingbecauseitdependsonleveraginghumanintuition
to interpret implicit signals in data sets that machine learning algorithms use.Consequently,featureengineeringisoftenthedeterminingfactorinwhetheramachinelearningmodelingprojectissuccessfulornot.Thisstepintheprocessisexperimentalandusuallythebottleneckinautomatedmachinelearningprocesses.Although collected raw data fields can be used as-is in DataRobot withouttransformations, supplemental data, or calculations to trainmachine learningmodels,you’llalmostalwayswanttoaddmoreperspectivetoyourdatasetbydesigningfeatures.Engineeredfeaturesprovidebettercontexttodifferentiatepatternsinthedata.
Aggregations
Some commonly computed aggregate features including the mean (average), mostrecent,minimum,maximum,sum,multiplyingtwovariablestogether,andratiosmadebydividingonevariablebyanother.NoteDataRobotautomaticallygeneratesdateandtimeaggregationfeatures.
Ratios
Ratios can be excellent feature variables. Ratios can communicate more complexconcepts such as price-to-earnings ratio, where neither price nor earnings alone candeliverthisinsight.
Transformations
Transformation refers to the replacement of a variable by a function. For instance,replacingavariablexbythesquareorcuberootorlogarithmxisatransformation.Youtransformvariableswhenyouwanttochangethescaleofavariableorstandardizethevaluesofavariableforbetterunderstanding.Variabletransformationcanalsobedone
“Featureengineeringisoftenthedeterminingfactorinwhetheramachinelearningmodelingprojectissuccessfulornot.”
DATAPREPARATIONFORAUTOMATEDMACHINELEARNING
©2017ImpactAnalytix,LLC-Allrightsreserved. 19
usingcategoriesorbins tocreatenewvariables.Anexample transformationmightbebinningcontinuousLeadAgeintoLeadAgeGroupsorLoanAmount intoLoanAmountCategories.
FEATUREENGINEERINGTECHNIQUESHereareafewpopularfeatureengineering ideassharedonDataScienceCentral5thatcanbeusedtoextractmoreinformationfromyourinputdata:
1. Singlevariabletransformations
2. Ratioorfrequencyofcategoricalvariables
3. Combineimportantvariables
4. Computevariableinteractions
5. Changedatatypes
6. Computerelativedifferences
7. Cartesiantransformations
8. Bintransformations
9. Windowtimeseriesdata
10. Reframecontinuousvariables
11. Onehotencodingforsequenceproblems
12. Sparsevaluecoding
Thepossibilitiesforfeatureengineeringarelimitedonlybyyourownhumaningenuityandcreativity.Featureengineeringtrulyisthehumanartofdatapreparationforautomatedmachinelearning.
5DataScienceCentralhttps://www.datasciencecentral.com/profiles/blogs/feature-engineering-data-scientist-s-secret-sauce-1
DATAPREPARATIONFORAUTOMATEDMACHINELEARNING
©2017ImpactAnalytix,LLC-Allrightsreserved. 20
CONCLUSIONIn thiswhitepaper,webriefly introducedbasicdatapreparation formachine learningconcepts.Wediscussedhow toplanyourprojectandcollect,organize, structure,andshapedatainamachinelearning-friendlyformat.Wealsobestowedvitaltipsforfeatureengineering to help you master the art of data preparation for automated machinelearningmodels.
Asyouprogress inyourautomatedmachine learningmodel journey,eachareaof thedata preparation process should be further researched. For additional reading, thefollowingbooksarehighlyrecommended:
o DataPreparationforDataMiningbyDorianPyle
o DataPreprocessinginDataMiningbySalvadorGarcíaandJuliánLuengo
o AppliedPredictiveAnalytics:PrinciplesandTechniquesfortheProfessionalDataAnalystbyDeanAbbott
o FeatureEngineeringforMachineLearningModels:PrinciplesandTechniquesforDataScientistsbyAliceZhengandAmandaCasari
NotethatDataRobotalsoprovidesclassesthatcovertheseandotherrelatedtopics.
RECOMMENDEDNEXTSTEPSForadditionalinformationonautomatedmachinelearning,pleasecontactanexpertatDataRobot.It’seasytogetstarted.
o DataRobotwww.datarobot.com
o DataPreparationEssentialsforAutomatedMachineLearningwebinarrecordingwww.datarobot.com/webinar/data-prep/
o DataRobotAIAccelerationPackageswww.datarobot.com/product/enterprise-ai/
o DataRobotCourseswww.datarobot.com/education/all-courses/
DATAPREPARATIONFORAUTOMATEDMACHINELEARNING
©2017ImpactAnalytix,LLC-Allrightsreserved. 21
AboutDataRobotDataRobotoffersanenterprisemachine learningplatform that empowersusersofall skill levels tomakebetterpredictionsfaster.Incorporatingalibraryofhundredsofthemostpowerfulopensourcemachinelearningalgorithms,theDataRobot platformautomates, trains and evaluates predictivemodels in parallel, deliveringmore accuratepredictionsatscale.DataRobotprovidesthefastestpathtodatasciencesuccessfororganizationsofallsizes.Formoreinformation,visitwww.datarobot.com.
AbouttheAuthorJenUnderwood, founder of Impact Analytix, LLC, is an analytics industry expertwith a unique blend of productmanagement,designandover20yearsof“hands-on”advancedanalyticsdevelopment. Inadditiontokeepingapulseonindustrytrends,sheenjoysdiggingintooceansofdata.JenishonoredtobeanIBMAnalyticsInsider,SAScontributor,andformerTableauZenMaster.ShealsowritesforInformationWeek,O’ReillyMedia,andothertechindustrypublications.
JenhasaBachelorofBusinessAdministration–Marketing,CumLaudefromtheUniversityofWisconsin,Milwaukeeandapost-graduatecertificateinComputerScience–DataMiningfromtheUniversityofCalifornia,SanDiego.Formoreinformation,visitwww.jenunderwood.com.