data preparation for automated machine...

DATAPREPARATIONFORAUTOMATEDMACHINELEARNING

©2017ImpactAnalytix,LLC-Allrightsreserved. 1

DATAPREPARATIONFORAUTOMATEDMACHINELEARNINGBYJENUNDERWOOD

WHITEPAPER



TABLEOFCONTENTSIntroduction........................................................................................................................3

ArtandScienceofDataPreparation......................................................................3

AutomationforFasterDataPreparation................................................................3

DataRobotforMachineLearning...........................................................................4

WheretoStart.....................................................................................................................5

MachineLearningLifecycle....................................................................................5

PlanforDataCollection..........................................................................................5

AvoidOverfittingandUnderfitting.........................................................................7

CollectandStructureData..................................................................................................8

GatherData............................................................................................................8

BewareofBias......................................................................................................11

ExploreandProfile............................................................................................................11

UnderstandYourData..........................................................................................12

DetectLeakage.....................................................................................................12

FindandReduceErrors........................................................................................13

ImproveDataQuality........................................................................................................15

EngineerFeatures.............................................................................................................18

Conclusion.........................................................................................................................20

RecommendedNextSteps...................................................................................20

JANUARY2018–WHITEPAPERCOMMISSIONEDBYDATAROBOT



INTRODUCTIONThe beauty of the human mind in combination with automated machine learningempowers amazing predictive insights that might never be found using manualtechniques.Sincethequalityofpredictiveoutputreliesonthequalityof input,properdatapreparationisacriticalsuccessfactorforachievingoptimalmachinelearningresults.

THEARTANDSCIENCEOFDATAPREPARATIONTheiterativeprocessofpreparingdataforautomatedmachinelearningisbothanartandascience.Theartofdatapreparationrequiresknowledgeofthebusiness–or“domainexpertise”–toselecttherightproblemstosolve,identifycrucialinputdata,andcarefullytransformandengineerinformativefeaturestomaximizepredictivemodelaccuracy.Thescienceofdatapreparationinvolvescleansingandnormalizingcollecteddata,selectinginfluencer features, and generating training, testing, and validation data sets forautomatedmachinelearning.

Although automated machine learning solutions may provide safeguards to preventcommonmistakes,you’llstillwanttolearnhowtocorrectlyprepare,shape,andformat

your data to create greatmodels. Providing the wrong data, irrelevant data, orimproperly prepared data undermines model performance for generalization.Essentiallyyouwillwanttodesignaninputdatasetwithfeatureindependence,explainable variance, and maximum information gain to find signals in thenoise.

AUTOMATIONFORFASTERDATAPREPARATIONTo identify relationships in data — “the signals”— and isolate distracting,irrelevantdata—“thenoise”—youcanexpeditethehighlyiterativepredictive

datapreparationprocesswithautomatedmachinelearning.Inminutes,youcanfindthemostrelevantfeaturesandpinpointspecificareas inyourdatasetwhere

prediction errors occur to help you focus efforts on the right data and reduceexperimentationtime.

Afterrunningbasicinputdataandevaluatingtheresults,youwillenhanceinputdata,addfeatures,buildanothermodel,andreviewperformanceonceagain.You’llcontinuethisprocessuntilyourmodelmeetsperformanceobjectives.

Ideally subject matter experts that understand the business process and data sourcenuances will assist in the data preparation process. Depending on your project, datapreparationmightbeaone-timeactivityoraperiodicone.Asnewinsightsarerevealed,itiscommontoexperimentfurther.

“Youcanexpeditethehighlyiterativepredictivedatapreparationprocesswithautomatedmachinelearning.”



DATAROBOTFORMACHINELEARNINGDataRobot is the world’s most advanced automated machine learningplatform.DataRobotautomatesthemachinelearningprocessfromdata ingestion to deployment. It delivers immediate value andunmatchedease-of-use,andnocomplicatedmathorscriptingisrequired.

DataRobot includes an array of data preparation features,automatingfeatureengineeringtofindkeyinsightsandhiddenpatterns. This invaluable technology expedites analyticalinvestigation across millions of variable combinations thatwould be far too time-consuming for manual humanexploration.

For optimal machine learning model performance, domainknowledge and best practices used by the world’s leading datascientistshavebeenuniquelybakedintoDataRobotblueprints.Usersofallskill levels can safely apply machine learning with its built-in optimizations andsafeguards.

DataRobot supports popular advancedmachine learning techniques and open sourcetoolssuchasApacheSpark,H2O,Scala,Python,R,andTensorFlow.Usingdrag-and-drop,point-and-click guided menu options, DataRobot users can simply and quickly createpredictivemodelswithautomatedmachinelearning.Theprocessissimple:

o Ingestdatasources

o Selectatargetvariabletopredict

o Automatically generate features, extract balanced samples, build and iteratethrough100sofmachinelearningmodels

o Visuallyexploretopperformingmodelsandkeyfindings

o Easilydeployandoperationalizemodels

Machinelearningdevelopmentstepsthatusedtotakeweeksormonthsofeffortcannowbe completed in hours. By embeddingDataRobot automatedmachine learningmodelintelligence into your reporting or business processes, you can quickly close the loopbetweeninsightandaction.

Sinceeachdatasetandbusinessobjectivecanbeuniquewithvariedchallenges,wehaveprovidedthefollowingguidelinestohelpgetyoustarted.Wealsoshareessentialtipsandadditionalresourcesforfurtherstudy.

“Domainknowledgeandbestpracticesusedbytheworld’sleadingdatascientistshavebeenuniquelybakedintoDataRobotblueprints.”



WHERETOSTARTThe machine learning process begins with Business Understanding. This initial stepfocusesondefiningtherightproblemtosolveandrecognizingthebusinessobjectivesandrequirements.Afterselectingaproblem,youwillcollectandassessdata.DuringtheDataUnderstandingstep,youwillgetfamiliarwithavailabledatasources,identifydataqualityproblems,andperformexploratoryanalysis.Then,intheDataPreparationstep,youwillcleansethedata,shapingandtransformingitintoaflattenedformatforloadingintotheautomatedmachinelearningplatform.

MACHINELEARNINGLIFECYCLE

Figure1:OverviewoftheMachineLearningProcess

Forthepurposesofthiswhitepaper,wewillconcentrateoncollectingdataandpreparingitproperly.Wewillnotcovertheentiremachinelearninglifecycle.

Beforeyoubeginthedatacollectionanddatapreparationprocess,itisassumedthatyoualreadyhaveselected,defined,andisolatedabusinessproblemtosolvethatisaviablecandidateformachinelearning.Youshouldalsohavechosenatleastonemetricthatyouwanttobetterunderstand.Ifyouneedmoreinformationonthosestepsinthemachinelearningprocess,pleaserefertoourpreviouswhitepaper,MovingfromBItoMachineLearning.

PLANFORDATACOLLECTIONAsyoudeveloprequirementsformachinelearningmodeldatacollection,contemplatethebusinessprocessandreviewitfromdifferentperspectives.Considerwhathappensateach step, what data is captured, where it is stored, if data history or changes areretained,andifthatdatatrulyreflectstherealworldforresultingpredictions.Oftendataiscollectedinline-of-businessapplicationsordatawarehousesforotherreasons.Existing



datamaybemissingsituationalcontextsuchaslocation,environmentalconditions,andother relevant variables for predicting an outcome. Document known issues andpreferreddatathatcouldbeaddedinthefuture.

TOIMPROVEINPUTDATACOLLECTION,DIAGRAMTHEBUSINESSPROCESSFLOW.IDENTIFYTHESTEPS,ENVIRONMENTALCONDITIONS,SCENARIOS,PEOPLE,SYSTEMS,AVAILABLEDATA,ANDMISSINGDATA.

As you continue planning to gather data for yourmachine learningmodeling project,you’ll need to confirmdecision-levelmetric granularity.Granularity refers to aunitofanalysis. A unit might be an opportunity, customer, or transaction. Granularity is

determined by the business objectives and how your model will be usedoperationally.Askstakeholdershowdecisionswillbemadefromthepredictivemodels.Aretheybasedonasinglecustomer,transaction,orevent,oraretheybasedonaggregatedataovertime?

To illustratetheseconcepts,wewillbereferringtoapubliclyavailableBankMarketingDataSet1fromUCI’smachinelearningrepository.Thesampledatasetcontainspartiallyprepareddata topredictclient termdepositscollectedduringthebank’stelemarketingcampaigns.

In the BankMarketing Data Set, the desired outcome to predict is client termdeposits. This is a binary yes or no outcome in the sample, but it could have

alternativelybeenatotalamountfiguretomaximizedeposits.Don’talwayslimityourselftocollectingoneoutcomevariablewhileassemblingdata.Thinkaboutotherquestionsthatmightbeaskedanddatathatwouldmakesensetoinclude.

Potentialinfluencerfeaturesfortheexampleclienttermdepositoutcomeincludeclientdemographics such as age, job, marital status, and education. Past credit and loanrepayment information is also important to know. Other features chosen includedcampaigncontacts,previousmarketingcampaignoutcomes,andseveralexternalsocialandeconomicenvironmentalattributessuchasemploymentrate.

NOTE:WHENSELECTINGFEATURES,YOUWILLWANTTOEXTRACTTHEMAXIMUMINFORMATIONFROMTHEMINIMUMNUMBEROFINPUTVARIABLES.

1UCImachinelearningrepositorydatasethttps://archive.ics.uci.edu/ml/datasets/bank+marketing

“Toillustratetheseconcepts,wewillbereferringtoapubliclyavailableBankMarketingDataSet1fromUCI’smachinelearningrepository.”



AVOIDOVERFITTINGANDUNDERFITTINGOverfitting and underfitting are common mistakes for beginners who are preparingmachine learningmodeling data. Overfitting captures the noise in your data with anoverly complex, unreliable predictive model. Essentially what happens is the modelmemorizesunnecessarydetails.Whennewdatacomesin,themodelfails.

Ifyoudon’thaveenoughfeatures,yourmodelmightbeoversimplifiedandsufferfromunderfittingissues.Underfittingoccurswhenastatisticalalgorithmcannotcapturetheunderlyingpatternsinthedata.

Whyareoverfittingandunderfittingproblematic?Themachinelearningmodelwillhavetoomanypredictionerrorstobeusefulfordecision-making.

Figure2:CommonDataPreparationIssues

Forcategoricaldata,overfittingcanoccur ifahighnumberofcategoriesareobservedwith a small number of observations per category. These types of variables hold lessinformation for predictive value. For time series data, overly complex mathematicalfunctions that describe the relationship between the input variable and the targetvariablecanalsoleadtooverfitting.Inthemostextremeformofoverfitting,individualidentifiersare inadvertentlyusedasmachine learning inputs. Individual identifierscanperfectly model existing data, but would only by chance reliably model and predictoutcomesforotherdata.

Thus,thereisadelicatebalancebetweenbeingtoospecificwithtoomanyfeaturesandtoovaguewithnotenoughfeatures.Designingmachinelearningmodelfeatureswithjusttherightamountofpredictiveinformationgainandprecisionisakeyskill intheartofdatapreparation.



COLLECTANDSTRUCTUREDATAAfter reviewing the business process and planning the machine learning input datarequirements,you’lldelveintothedatacollectionandshapingprocess.

GATHERDATAMachinelearningalgorithmsassumethateachrecordisindependentandarenotrelatedtootherrecords. If relationshipsexistbetweenrecords,youwillwanttocreateanewvariablecalleda feature ina columnwithin the rowofdata tocapture thatbehavior.Unlike third-normal form transactional or dimensional patterns used in businessintelligence,machinelearningrequiresdatatobeinputasa“flattened”table,view,orcommaseparated(.csv)flatfileofrowsandcolumns.Yourviewwillneedtocontainanoutcome metric and target variable, along with input predictor variables. This datarepresentationformachinelearningiscalledthefeaturematrix.

Figure3:ExamplePreparedDataSet

If you have data stored in several tables in a data warehouse or relational databaseformat,youwillneedtouserecordidentifierstojoinfieldsfrommultipletablestocreatea singleunified, flattened “view.” Formany target variables, inputdata is capturedatvariousbusinessprocessstepsinmultipledatasources.AsalesprocessmighthavedatainaCRM,emailmarketingprogram,Excelspreadsheet,and/oraccountingsystem.Ifthatisthecase,youwillwanttoidentifythefieldsinthosesystemsthatcanrelate,join,orblendthedifferentdatasourcestogether.

Prepareddatashouldbecollectedatalevelofanalyticalgranularityuponwhichyoucanmakedecisions.Chooseagranularitythatisactionable,understandable,andusefulinthe



eventyouincorporatetheresultsintoyourexistingbusinessprocessorapplication.Forexample,ifyouwanttomakedailysalesforecasts,youneedtoinputdataatadaylevel

ratherthanweek,month,oryear.

Ifyouaretryingtocapturechangesindataoveracertaintimeperiod,checkifyourdatasourceisonlykeepingthecurrentstatevaluesofarecord.Mostdatawarehousesaredesignedtosavedifferentvaluesofarecordovertimeanddonot overwrite historical data values with current data values. TransactionalapplicationdatasourcessuchasSalesforceonlycontainthecurrentstatevalueforarecord.Ifyouwanttogetapriorvalue,youneedtohaveasnapshotofthehistoricaldatastoredorkeepthepriorvaluedataincustomfieldsonthe

currentrecord.

Whilestructuringinputdata,ensurethat it iscleanandconsistent.Theorderandmeaningofinputvariablesshouldremainthesamefromrecordtorecord.Inconsistentdataformats,“dirtydata,”andoutlierscanunderminethequalityofanalyticalfindings.

HOWMUCHDATATOCOLLECTTheactualnumberofrecordsisnotalwayseasytodetermineanddependsonpatternsinyourdata.Ifyouhavemorenoiseinyourdata,youwillneedmoredatatoovercomeit.Noiseinthiscontextmeansunobservedrelationshipsinthedatathatarenotcapturedbytheinputpredictorvariables.

Figure4:SampleSizeEstimation

To determine minimum data set sizes, consider the dimensionality of your data andpatterncomplexity.2

o Forsmallmodelswithafewvariables,10to20recordspervariablevaluemaybesufficient.

o Formorecomplexmodels,~100 recordspervariablevaluemaybeneeded tocapturepatterns.

o For complex models with ~100 input variables, you will need a minimum of10,000recordsinthedataforeachsubset(training,testing,andvalidation).

TRAINING,TESTING,ANDVALIDATIONDATASETSThemostcommonstrategyistosplitdataintotraining,testing,andvalidationdatasets.

2AppliedPredictiveAnalytics:PrinciplesandTechniquesfortheProfessionalDataAnalystbyDeanAbbott

“Shapingdatainvolvessubjectmatterexpertthoughttocreativelyselect,create,andtransformvariablesformaximuminfluence.”



These data sets are usually created through a random selection of records withoutreplacement,meaningeachrecordbelongstooneandonlyonesubset.All threedatasetsshouldreflectyourreal-worldscenario.

CrossValidation

KeepinmindthatDataRobotincludesindustrystandardk-foldcrossvalidationfeaturesthatdividethedataintoksubsetswiththeholdoutrepeatedktimes.Eachtime,oneoftheksubsetsisusedasthetestsetandtheotherk-1subsetsarecombinedtoformatrainingset.DataRobot’sk-foldfeatureenablesyoutoindependentlychoosehowmuchdatayouwanttouseintesting.

TimeSeriesConsiderations

Data that changes over time should be reflected in your input data set. When timesequences(ContactMade>QuoteProvided>DealClosed)areimportantinpredictions,proportionallycollectingdatafromthosedifferenttimeperiodsisimportantaswell.Thekeyprincipleistoprovidedatathatreflectswhatactuallyhappensattherightlevelofoutcomemetricgranularity.

Whencollectingdata, thinkaboutthebalanceofyourvalues inyourraw data. In our example, howmany prospects do you have atdifferent ages and stages of life?Dotheyhaveloans?Whatarethebalances on outstanding loans?How many defaulted on a loan?Howrecentwasthedefault,?Whatis the prospect’s income history,credit ratings, debt-to-incomeratio,andsoon?

ThinkProportionally

Whenextractingasubsetofdata,besuretoincludeapproximatelythesameproportionofvariablesinyourDataRobotinputdatasetthatyouseeinthereal-worlddata.Ifyouprovide more records of one variable value, you can accidentally bias the machinelearningmodel’spredictions,therebydiminishingperformance.Ifyouhavedatasetswithmillionsorbillionsofrows,itismuchlesslikelythatyouwillencounteraccidentalbiasindatapreparation.

Figure5:SamplingData



SAMPLINGThechoiceoftheoptimalsamplingmethod3foragivenproblemespeciallydependsonthecharacterofthedatasetandthedesiredproportionofthesubsets.Eachmethodhasitsadvantagesbutalsoitslimitations.

Forsimple,nearlyuniformlydistributeddatasets,themethodofsimplerandomsamplingmay be sufficient. For naturally well-ordered time series data, highly efficientdeterministic approaches such as convenience and systematic sampling can achievereliableresults.Whendealingwithcomplex,high-dimensionaldata,moresophisticatedandstratifiedsamplingtechniquescanreducethebiasandvarianceofmodelerror.

UnbalancedTwo-ClassProblemsDatasetsseldomcomewithevenlydistributedsamples.Unbalanceddataisacommonissue to remediate. Fraud or failure rate data are examples of unbalanced two-classproblems.Analyzingunbalanceddatacreatesuselessresultswithexceptionallyhigherrorrates.

Tobuildpredictivemodelsonunbalanceddata,youneedtoapplysamplingtechniquesthatincreasetheminorityclassproportionwithdownsamplingorupsamplingtocreateabalanceddataset.Aftertrainingamachinelearningmodelwithabalancedtrainingset,youwillvalidateperformanceofthemodelwiththereal-world,unbalanced,unseentestset.

BEWAREOFBIASWhileyouaccumulatedata,considerpotentialbiases.4Humannatureisconsciouslyandunconsciouslybiased.Cognitivebiasesaretendenciestothinkincertainwaysthatcanlead to irrational judgment.Outcome,omission, andmanyotherbias types caneasilycreepintothedatacollectionprocess.

If unknown bias exists, it is basically an unjustified assumption that your input datareflectsreality.Anymodelbuiltonsuchassumptionsreflectsonlythedistortedrealityandwillperformpoorly.Toreducepotentialbias, testhypotheses,pokeholes inyourown ideas,welcomechallenges,andconductpeer reviewsofyourdatacollectionandsamplingthoughtprocesses.Machinelearningprojectsshouldbegroupprojectsandnotdoneinisolation.

EXPLOREANDPROFILENow youwill assess the condition of your source data. DataRobot automates severalaspects of initial data examination. DataRobot also automates sampling to avoidconventional sampling and overfitting issues. During this step, you’ll visually look for

3CommonTypesofDataSamplingMethodshttps://en.wikipedia.org/wiki/Sampling4TheCognitiveBiasCodexhttps://upload.wikimedia.org/wikipedia/commons/a/a4/The_Cognitive_Bias_Codex_-_180%2B_biases%2C_designed_by_John_Manoogian_III_%28jm3%29.png



trends, extreme values, outliers, exceptions, skewed data, incorrect values, andinconsistentandmissingdata.

UNDERSTANDYOURDATAAsyoubegintoexploreandunderstandyourdata,DataRobotprovidesdataprofilesfor

everyfeaturethatincludehowmanyvaluesareuniqueormissing,aswellasthestatisticalmean,standarddeviation,median,minimum,andmaximumvalue.Youcanalsoreviewthedistributionsofeachfeatureusingahistogramwithoptionalbinsettingsandapplytransformationstonormalizeyourdata.

Informative Features immediately ranks variables that provide the mostinformationgainforbuildingoptimalmachinelearningmodels.Knowingwhichareasofyourdatamostinfluencetheoutcomeisinvaluableonitsown.Thisinformationcanguidethebusinesstofocuslimitedtimeandresourcesonthe

activitiesthatmattermost.

Another one of DataRobot’s data preparation strengths is the array of differentvisualizationtoolsthatidentifyinfluentialfeatures,rankthem,anduncovererrors.TheFeature Ranking reportmeasures howmuch each feature by itself contributes to theaccuracyofamachinelearningmodel.

Figure6:FeatureImpact

DETECTLEAKAGELeakageistheaccidentalinclusionofoutcomeinformationthatwouldnotbelegitimatelyavailable for predictions. If you have inadvertent leakage, you’ll likely notice it in thefeaturerankingreportwhenafeaturehasanexceptionallyhighimpact.

“Bylookingatthesefindings,thebusinesscanimmediatelyappreciatethevariablesthattrulyinfluenceoutcomestogetimmediatevalue.”



Figure7:LeakageExample

In our Bank Marketing Data Set example, duration was identified in the DataRobotFeatureImpactreportasaleakedfeaturewith100%impact.Thenextbestperformingfeaturecarriesa littlemore than10% impact. Ifyouseeanexceptionallyhigh impact,verifyprocessflowtimingforthatfeature.

TOAVOIDLEAKAGE,YOUNEEDTOCONSIDERTHETIMINGANDORDEROFEVENTSTOENSURETHATOUTCOMEDATAISNOTUSEDASINPUTDATA.

FINDANDREDUCEERRORSDataRobotModelX-RayandReasonCodescapabilitiescanalsoprovidedeepinsightsforenhancing data preparation. Model X-Ray enables interactive, visual exploration ofmachinelearningmodelperformance.Youcaneasilyseewhereamodelmakesmistakesbyselectinginputfeatures.Itshinesalightonissuesthatmightnotgetdetectedusingother tools.Model X-Ray allowsmachine learningmodel designers to concentrate onwherethemostmodelperformanceimprovementscanbemadeinthedatapreparationprocess.



Figure8:ModelX-Ray

ReasonCodesunveilwhatvalueswithinafeaturedrivethemodel’sresults.Thistooliscrucialformachinelearningmodeldevelopmentprocessesthataresubjecttoregulatorycomplianceorlegalscrutiny.WithReasonCodes,youcandiscoverwhichcombinationsoffeaturevaluestriggeraspecificmachinelearningoutcome.Thisinformationcanalsobe useful throughout the iterative data preparation process to incrementally improveresults.

Figure9ReasonCodes



IMPROVEDATAQUALITYWerecommendthatyouaddressdataqualityissuesasearlyaspossible.Hereareafewtipsforhandlingcommondataissuesinthedatapreparationprocess.

CorrectingIncorrectValues

Machinelearningmodelsassumetheinputdataiscorrect.Ifyouareseeingerrorsfromsourceapplicationsthatshouldgetfixed,abestpracticeistotryandresolvetheissueatthesourcesystemversusinadatapreparationprocess.

Treat incorrect values as missing if there is a minimal amount and you can’t easilydeterminecorrectvalues.Iftherearealotofinaccuratevalues,trytodeterminewhathappenedtorepairthem.Ifyoudomakechangestodata,documentyourreasoning.Alsocaptureinitialcontextandchangedvalueswithaflagtoidentifychanges.Thepatterninyourdatamightbehiddeninthoseincorrectvalues.

SkewedVariables

Forcontinuousvariables,reviewthedistributions,centraltendency,andvariablespreadin DataRobot. These are measured using various statistical metrics and visualizationmethodssuchashistograms.Continuousvariablesshouldbenormallydistributed.Ifnot,reduce skewnesswith transformations or by experimentingwith bin sizes for optimalprediction.

When a skewed distribution needs to be corrected, the variable is transformed by afunctionthathasadisproportionateeffectonthetailsofthedistribution.Logtransformslikelog(x),logn(x),log10(x);themultiplicativeinverse(1/x);squareroottransformsqrt(x);orpower(xn)arethemostfrequentlyusedcorrectivefunctions.

In the table to the left,several issues and formulastominimizeskewareshown.The before and after chartsillustrate how differentskewed variabletransformationscanbeusedto normalize featuredistributions.

Figure10:TransformationsforSkewness



High-Cardinality

High-cardinality fields are categorical attributes that contain a very large number ofdistinctvalues.Examplesincludenames,ZIPcodes,andaccountnumbers.Althoughthesevariables can be highly informative, high-cardinality attributes are rarely used inpredictive modeling. The main reason is that including them will vastly increase thedimensionalityofthedataset,makingitdifficultorevenimpossibleformostalgorithmstobuildaccuratepredictivemodels.

Redundant,HighlyCorrelatedVariables

Duplicate,redundant,orotherhighlycorrelatedvariablesthatcarrythesameinformationshould be minimized. DataRobot algorithms will perform better without collinearvariables.Collinearityoccurswhentwoormorepredictorvariablesarehighlycorrelated,meaningthatonecanbelinearlypredictedfromtheotherswithasubstantialdegreeofaccuracy.

Toidentifyhighcorrelationbetweentwocontinuousvariables,reviewscatterplots.Thepatternofascatterplot indicates therelationshipbetweenvariables.Therelationshipcanbelinearornon-linear.Tofindthestrengthoftherelationship,computecorrelation.Correlationvariesbetween-1and+1.

Ifyouhavetwovariablesthatare almost identical and youdo want to retain thedifference between them,consider creating a ratiovariableasafeature.Anotherapproach is to use PrincipalComponent Analysis (PCA)outputasinputvariables.

Toavoidthecollinearityissue,do not include multiplevariables that are highlycorrelatedordatathatisfromthesamereportinghierarchy.Often those fields provideobviousinsights.Forexample,customerswholiveinthecityofTampaalsohappentoliveinthestateofFlorida.

Figure11:Scatterplotsforcorrelationdetection



Missingvalues

Themostcommonrepairformissingvaluesisimputingalikelyorexpectedvalueusingameanorcomputedvaluefromadistribution.Ifyouusethemean,youmaybereducingyourstandarddeviationthusthedistributionimputationapproachismorereliable.

Anotherapproachistoremoveanyrecordwithmissingvalues.Don’tgettooambitiouswithfilteringoutmissingvalues.Ifyoudeletetoomanyrecords,youwillunderminethereal-worldaspectsinyouranalysis.Asyouaddressmissingvalues,donotlosetheinitialmissingvaluecontext.Acommondatapreparationapproachistoaddacolumntotherowtoflagdatawasmissingcodedwitha1or0.

ExtremeValuesandOutliers

Outliersarevaluesthatexceedthreestandarddeviationsfromthemean.Manymachinelearningalgorithmsaresensitivetooutlierssincethosevaluesaffectaverages(means)andstandarddeviationsinstatisticalsignificancecalculations.Ifyoucomeacrossunusualvaluesoroutliers,confirmthatthesedatapointsarerelevantandreal.Often,oddvaluesareerrors.

If theextremedatapointsareaccurate,predictable,andsomethingyoucancountonhappening again, do not remove them unless those points are unimportant. You canreduceoutlierinfluencebyusinglogtransformationsorconvertingthenumericvariabletoacategoricalvaluewithbinning.



ENGINEERFEATURESFeaturecreationistheartofextractingmoreinformationfromexistingdatatoimprovethepredictivepowerofmachinelearningalgorithms.Youaremakingthedatayoualreadyhavemoreuseful.Strongfeaturesthatpreciselydescribetheprocessbeingpredictedcanimprovepatterndetectionandenablemoreactionableinsightstobefound.

Creating features from several combined variables and ratios usually provideshighermodelaccuracy thananysingle-variable transformationbecauseof theinformationgainassociatedwithdatainteractions.Ifyoueversawthemovieorread the book “Moneyball: TheArt ofWinning anUnfairGame”byMichaelLewis, you’ll know how baseball analysis was revolutionized with newperformancemetrics likeOn-BasePercentage(OBP)andSluggingPercentage(SLG). With feature engineering, you will be using a fundamentally similarapproach.

HUMANINGENUITYANDCREATIVITYREQUIREDFeatureengineeringischallengingbecauseitdependsonleveraginghumanintuition

to interpret implicit signals in data sets that machine learning algorithms use.Consequently,featureengineeringisoftenthedeterminingfactorinwhetheramachinelearningmodelingprojectissuccessfulornot.Thisstepintheprocessisexperimentalandusuallythebottleneckinautomatedmachinelearningprocesses.Although collected raw data fields can be used as-is in DataRobot withouttransformations, supplemental data, or calculations to trainmachine learningmodels,you’llalmostalwayswanttoaddmoreperspectivetoyourdatasetbydesigningfeatures.Engineeredfeaturesprovidebettercontexttodifferentiatepatternsinthedata.

Aggregations

Some commonly computed aggregate features including the mean (average), mostrecent,minimum,maximum,sum,multiplyingtwovariablestogether,andratiosmadebydividingonevariablebyanother.NoteDataRobotautomaticallygeneratesdateandtimeaggregationfeatures.

Ratios

Ratios can be excellent feature variables. Ratios can communicate more complexconcepts such as price-to-earnings ratio, where neither price nor earnings alone candeliverthisinsight.

Transformations

Transformation refers to the replacement of a variable by a function. For instance,replacingavariablexbythesquareorcuberootorlogarithmxisatransformation.Youtransformvariableswhenyouwanttochangethescaleofavariableorstandardizethevaluesofavariableforbetterunderstanding.Variabletransformationcanalsobedone

“Featureengineeringisoftenthedeterminingfactorinwhetheramachinelearningmodelingprojectissuccessfulornot.”



usingcategoriesorbins tocreatenewvariables.Anexample transformationmightbebinningcontinuousLeadAgeintoLeadAgeGroupsorLoanAmount intoLoanAmountCategories.

FEATUREENGINEERINGTECHNIQUESHereareafewpopularfeatureengineering ideassharedonDataScienceCentral5thatcanbeusedtoextractmoreinformationfromyourinputdata:

1. Singlevariabletransformations

2. Ratioorfrequencyofcategoricalvariables

3. Combineimportantvariables

4. Computevariableinteractions

5. Changedatatypes

6. Computerelativedifferences

7. Cartesiantransformations

8. Bintransformations

9. Windowtimeseriesdata

10. Reframecontinuousvariables

11. Onehotencodingforsequenceproblems

12. Sparsevaluecoding

Thepossibilitiesforfeatureengineeringarelimitedonlybyyourownhumaningenuityandcreativity.Featureengineeringtrulyisthehumanartofdatapreparationforautomatedmachinelearning.

5DataScienceCentralhttps://www.datasciencecentral.com/profiles/blogs/feature-engineering-data-scientist-s-secret-sauce-1



CONCLUSIONIn thiswhitepaper,webriefly introducedbasicdatapreparation formachine learningconcepts.Wediscussedhow toplanyourprojectandcollect,organize, structure,andshapedatainamachinelearning-friendlyformat.Wealsobestowedvitaltipsforfeatureengineering to help you master the art of data preparation for automated machinelearningmodels.

Asyouprogress inyourautomatedmachine learningmodel journey,eachareaof thedata preparation process should be further researched. For additional reading, thefollowingbooksarehighlyrecommended:

o DataPreparationforDataMiningbyDorianPyle

o DataPreprocessinginDataMiningbySalvadorGarcíaandJuliánLuengo

o AppliedPredictiveAnalytics:PrinciplesandTechniquesfortheProfessionalDataAnalystbyDeanAbbott

o FeatureEngineeringforMachineLearningModels:PrinciplesandTechniquesforDataScientistsbyAliceZhengandAmandaCasari

NotethatDataRobotalsoprovidesclassesthatcovertheseandotherrelatedtopics.

RECOMMENDEDNEXTSTEPSForadditionalinformationonautomatedmachinelearning,pleasecontactanexpertatDataRobot.It’seasytogetstarted.

o DataRobotwww.datarobot.com

o DataPreparationEssentialsforAutomatedMachineLearningwebinarrecordingwww.datarobot.com/webinar/data-prep/

o DataRobotAIAccelerationPackageswww.datarobot.com/product/enterprise-ai/

o DataRobotCourseswww.datarobot.com/education/all-courses/



AboutDataRobotDataRobotoffersanenterprisemachine learningplatform that empowersusersofall skill levels tomakebetterpredictionsfaster.Incorporatingalibraryofhundredsofthemostpowerfulopensourcemachinelearningalgorithms,theDataRobot platformautomates, trains and evaluates predictivemodels in parallel, deliveringmore accuratepredictionsatscale.DataRobotprovidesthefastestpathtodatasciencesuccessfororganizationsofallsizes.Formoreinformation,visitwww.datarobot.com.

AbouttheAuthorJenUnderwood, founder of Impact Analytix, LLC, is an analytics industry expertwith a unique blend of productmanagement,designandover20yearsof“hands-on”advancedanalyticsdevelopment. Inadditiontokeepingapulseonindustrytrends,sheenjoysdiggingintooceansofdata.JenishonoredtobeanIBMAnalyticsInsider,SAScontributor,andformerTableauZenMaster.ShealsowritesforInformationWeek,O’ReillyMedia,andothertechindustrypublications.

JenhasaBachelorofBusinessAdministration–Marketing,CumLaudefromtheUniversityofWisconsin,Milwaukeeandapost-graduatecertificateinComputerScience–DataMiningfromtheUniversityofCalifornia,SanDiego.Formoreinformation,visitwww.jenunderwood.com.

data preparation for automated machine...

Documents