machine learning function examples - 1010datamachine learning function examples | logistic...

(212) 405.1010 | [email protected] | Follow: @1010data | www.1010data.com

Machine LearningFunction Examples

Machine Learning Function Examples | Contents | 2

© 2017 1010data, Inc. All rights reserved.

Contents

Logistic Regression.................................................................................... 3Chart cumulative gains and calculate the AUC................................................................................ 10Extract logistic regression fit statistics.............................................................................................. 12

Block code: logreg_stats.................................................................................................. 13Bank Marketing Data Set..................................................................................................................14

Principal Component Analysis.................................................................20Chart cumulative gains and calculate the AUC................................................................................ 30Bank Marketing Data Set..................................................................................................................32

Clustering................................................................................................... 38Bank Marketing Data Set..................................................................................................................44

Least Squares Regression....................................................................... 49Istanbul Stock Exchange Data Set................................................................................................... 58

Weighted Least Squares Regression...................................................... 60Census Income Data Set..................................................................................................................67

Machine Learning Function Examples | Logistic Regression | 3

Logistic Regression

In this example, a logistic regression is performed on a data set containing bank marketing information topredict whether or not a customer subscribed for a term deposit.

The logistic regression, using the 1010data function g_logreg(G;S;Y;XX;Z), is applied to the BankMarketing Data Set on page 14, which contains information related to a campaign by a Portuguesebanking institution to get its customers to subscribe for a term deposit.

The logistic regression uses the following 10 variables in that data set as predictors:

• age• duration• previous• empvarrate• housing• default• loan• poutcome• job• marital

As a response, the column y is used, which is yes if a customer has subscribed for a term deposit.

This analysis will follow the following steps:

• Prepare the data by creating dummy variables for each of the categorial columns (since we cannot usetextual data to build our model).

• Divide the data into a training set and a test set.• Run the logistic regression on the training data set based on the continuous variables in the original

data set and the dummy variables that we created.• Obtain the predicted probability that a customer has subscribed for a term deposit.• Create a cumulative gains chart and calculate the area under the curve (AUC) for the test data.• Obtain the model coefficients.• Chart the logistic curve for both the training and test data.

1. Open the Bank Marketing data set (pub.demo.mleg.uci.bankmarketing).

http://www2.1010data.com/documentationcenter/beta/1010dataReferenceManual/index_frames.html?q=Functions/GroupFunctions/g_logreg.html


2. Since we cannot use textual data in our analysis, we first create dummy variables for each of thecategorial columns.

<willbe name="yy" value="y='yes'"/><willbe name="hsng" value="housing='yes'"/><willbe name="h_unk" value="housing='unknown'"/><willbe name="def" value="default='yes'"/><willbe name="d_unk" value="default='unknown'"/><willbe name="loans" value="loan='yes'"/><willbe name="l_unk" value="loan='unknown'"/><willbe name="nonxst" value="poutcome='nonexistent'"/><willbe name="succ" value="poutcome='success'"/><willbe name="blue" value="job='blue-collar'"/><willbe name="tech" value="job='technician'"/><willbe name="j_unk" value="job='unknown'"/><willbe name="svcs" value="job='services'"/><willbe name="mgmt" value="job='management'"/><willbe name="ret" value="job='retired'"/><willbe name="entr" value="job='entrepreneur'"/><willbe name="self" value="job='self-employed'"/><willbe name="maid" value="job='housemaid'"/><willbe name="unemp" value="job='unemployed'"/><willbe name="stud" value="job='student'"/><willbe name="marr" value="marital='married'"/><willbe name="sgl" value="marital='single'"/><willbe name="m_unk" value="marital='unknown'"/>

These <willbe> operations create a computed column for each of the categories, where a 1 in thecolumn indicates that the category is true for that row. For instance, in the following screenshot, therows where hsng=1 indicate that the client had a housing loan (i.e., housing='yes' in the originaltable), and the rows where h_unk=1 indicate that it is unknown if the client had a housing loan (i.e.,housing='unknown').

http://www2.1010data.com/documentationcenter/beta/1010dataReferenceManual/index_frames.html?q=MLOperations/classicops/willbe.html


See Dummy Variables on page 17 for a list of the dummy variables used here and their meanings.

3. Next, we want to create a column that we will use to separate training data and test data. We want touse 90% of our data as training data.

<note>SELECT TRAINING DATA</note><willbe name="train" value="draw_(41185;0)<0.9"/><willbe name="test" value="train<>1"/>

4. Now we run the logistic regression based on the continuous variables in the original data set and thedummy variables that we created. We use the train column from the previous step as the secondparameter of the g_logreg(G;S;Y;XX;Z) function. The train column will act as a selector, so thatour function will only train 90% of the data. We also specify options for the Z parameter that controlconvergence criteria.

<willbe name="model" value="g_logreg(;train;yy;1 age duration previous empvarrate hsng h_unk def d_unk loans l_unk nonxst succ blue tech j_unk svcs mgmt ret entr self maid unemp stud marr sgl m_unk;



'cgdeveps' 0.0000001 'lreps' 0.000000001)"/>

Note: The first element of XX must be the special value 1 for the constant (intercept) term in thelinear model.

This creates a column named model that contains the results of the logistic regression:

Clicking on the > opens a window containing a summary of the model output:

5. We can then use the score(XX;M;Z) function to obtain the predicted probability (prob_score)returned by the logistic model, which in our example represents the probability a person subscribed fora term deposit.

<note>OBTAIN MODEL SCORE WITH SCORE FUNCTION</note><willbe name="prob_score" value="score(1 age duration previous empvarrate hsng h_unk def d_unk loans l_unk nonxst succ blue tech j_unk svcs mgmt ret entr self maid unemp stud marr sgl m_unk;model;)" format="dec:7"/>

Note: We specify format="dec:7" so that our results show with 7 decimal places.

6. To create the cumulative gains chart and calculate the area under the curve, clone the current tab andfollow the steps in Chart cumulative gains and calculate the AUC on page 10 in the cloned tab.

You should see results similar to the following:

http://www2.1010data.com/documentationcenter/beta/1010dataReferenceManual/index_frames.html?q=Functions/MiscellaneousFunctions/score.html

http://www2.1010data.com/documentationcenter/beta/1010dataUsersGuide/index_frames.html?q=TablesAndWorksheets/CloneTab.html


7. To obtain the model coefficients, we can use the param(M;P;I) function. For our example, we willonly obtain the parameters for the intercept (b0) and the first three variables (b1, b2, and b3).

Note: Perform the remaining steps in the original tab, not the cloned tab.

<willbe name="b0" value="param(model;'b';1)" format="dec:7"/><willbe name="b1" value="param(model;'b';2)" format="dec:7"/><willbe name="b2" value="param(model;'b';3)" format="dec:7"/><willbe name="b3" value="param(model;'b';4)" format="dec:7"/>

8. One might also want to obtain coefficients in one column, which can be achieved with the following:

<note>CALCULATE COEFFICIENTS IN ONE COLUMN</note><willbe name="var_names" value="'intercept,age,duration,previous,empvarrate,hsng,h_unk,def,d_unk,loans,l_unk,nonxst,succ,blue,tech,j_unk,svcs,mgmt,ret,entr,self,maid,unemp,stud,marr,sgl,m_unk'"/><willbe name="temp_i" value="mod(i_(1);27)"/><willbe name="i" value="if(temp_i=0;27;temp_i)"/><willbe name="b" value="param(model;'b';i)" format="dec:7"/><willbe name="var_name" value="csl_pick(var_names;i)"/>

http://www2.1010data.com/documentationcenter/beta/1010dataReferenceManual/index_frames.html?q=Functions/MiscellaneousFunctions/param.html


Note: The number of coefficients we obtain from the model corresponds to the number ofvariables in our analysis. So, for our example, we obtain 27 coefficients (the intercept and the 26predictors).

9. To extract logistic regression fit statistics (e.g., deviance, AIC, p-values, z-values, and standard errors),clone the current tab and follow the steps in Extract logistic regression fit statistics on page 12 in thecloned tab.

A number of new columns containing the various fit statistics (as well as some intermediary columnsthat are used in the calculation of the fit statistics) will be added to the table.

For example, the following columns show deviance and AIC:

The columns below show the standard error, z-values, and p-values for both the intercept (indicated byconst in the column names) and the age variable:

10.We can also calculate the logit of the predicted probability (prob_score), which we can use forpurposes of further analysis such as visualization.


<willbe name="z_estimate" value="loge(prob_score / (1-prob_score))" format="dec:7"/>



11.We can chart the logistic curve for both the training and test data using the 1010data Chart Builder.

Let's chart the results for our training data:

a) Create computed columns that contain the prob_score and z_estimate for just the training data.

<willbe name="prob_score_train" value="prob_score*train"/><willbe name="z_estimate_train" value="z_estimate*train"/>

b) Click Chart > Scatter.c) Drag the z_estimate_train column to the DATA (X-AXIS) area.d) Drag the prob_score_train and yy columns to the DATA (Y-AXIS) area.e) Change X-Range (max) to 20.f) Click Update.

Let's chart the results for our test data:

a) Create computed columns that contain the prob_score and z_estimate for just the test data.

<willbe name="prob_score_test" value="prob_score*test"/><willbe name="z_estimate_test" value="z_estimate*test"/>

b) Click Chart > Scatter.c) Drag the z_estimate_test column to the DATA (X-AXIS) area.d) Drag the prob_score_test and yy columns to the DATA (Y-AXIS) area.e) Change X-Range (max) to 20.f) Click Update.

The results should look similar to the following charts:


Chart cumulative gains and calculate the AUCGiven a model score and target variable, you can produce a cumulative gains chart and calculate the AreaUnder the Curve (AUC).

You must have already generated a model using g_logreg(G;S;Y;XX;Z) and obtained the predictedprobability using score(XX;M;Z). This example also assumes that the query has defined a set of testingdata denoted by the column test.

To chart the cumulative gains and calculate the AUC:

1. Add the following <library> to your query.



http://www2.1010data.com/documentationcenter/beta/1010dataReferenceManual/index_frames.html?q=MLOperations/blocks/library.html


Note: You can insert the following Macro Language code anywhere within your query, thoughit is best practice to include libraries at the top of queries. Alternatively, you can save the libraryto an external file and then use the <import> operation to import the library into your currentquery. See the section on Macro Language: Blocks in the 1010data Reference Manual for moreinformation about libraries and blocks.

<library name="cum_gains"> <block name="cum_gains" score="" target=""> <note>*****************************************************************************************</note> <note>*** Given a model score and target variable, this block will produce the data for a ****</note> <note>*** cumulative gains chart and calculate the Area Under the Curve (AUC). ****</note> <note>*** ****</note> <note>*** In this implementation, AUC is defined to be between -1 and 1, where 0 indicates ****</note> <note>*** the model performs the same as a `model` which randomly assigns the probability ****</note> <note>*** of observing a target event. An AUC of 1 indicates perfect performance in the ****</note> <note>*** sense that ranking by the model score perfectly separates the `1` target events ****</note> <note>*** from `0` target events. A negative AUC indicates that the model is ****</note> <note>*** `anti-predictive` in the sense that `0` events are assigned a higher score than ****</note> <note>*** `1` events. ****</note> <note>*** ****</note> <note>*** Specifically, here the AUC is defined by integrating the area under the ****</note> <note>*** cumulative gains chart and normalizing by subtracting the area under the ****</note> <note>*** diagonal (which is the area of a random model) and dividing by the area that ****</note> <note>*** would be found for a model that perfectly separates `1`s and `0`s in the target ****</note> <note>*****************************************************************************************</note> <sel value="{@score}<>na"/> <willbe name="score_population" value="g_cnt({@score};)"/> <willbe name="score_num_true" value="g_sum({@score};;{@target})"/> <willbe name="tot_population" value="g_cnt(;)"/> <willbe name="tot_num_true" value="g_sum(;;{@target})"/> <sel value="g_first1({@score};;)"/> <willbe name="score_rank" value="g_rank(;;;{@score})"/> <willbe name="cum_pop" value="g_cumsum(;;score_rank;score_population)"/> <willbe name="cum_true" value="g_cumsum(;;score_rank;score_num_true)"/> <willbe name="cum_pop_pct" value="100*(cum_pop/tot_population)" format="dec:5" label="% of Population"/> <willbe name="cum_true_pct" value="100*(cum_true/tot_num_true)" format="dec:5" label="% of Target"/> <note>**** AUC by integration ****</note> <willbe name="true_pct_of_pop" value="100*(tot_num_true/tot_population)" format="dec:3"/> <willbe name="perfect_auc" value="0.5*(true_pct_of_pop*100)+100*(100-true_pct_of_pop)-0.5*(100^2)"/> <willbe name="prev_cum_pop_pct" value="ifnull(g_rshift(;;score_rank;cum_pop_pct;-1);0)"/> <willbe name="prev_cum_true_pct" value="ifnull(g_rshift(;;score_rank;cum_true_pct;-1);0)"/> <willbe name="bucket_width" value="cum_pop_pct-prev_cum_pop_pct"/> <willbe name="bucket_auc" value="0.5*bucket_width*(prev_cum_true_pct+cum_true_pct-prev_cum_pop_pct-cum_pop_pct)"/> <willbe name="model_raw_auc" value="g_sum(;;bucket_auc)"/> <willbe name="auc" value="model_raw_auc/perfect_auc" format="dec:5" label="AUC"/> <colord cols="auc,{@score},cum_pop_pct,cum_true_pct"/> <note>*** For charting purposes, insert a row for the (0, 0) intercept ****</note> <willbe name="row_num" value="g_cumcnt(;;score_rank)"/> <sel value="if(row_num=1;2;1)" expand="1"/> <willbe name="origin_row" value="(row_num=1)*(ii_(0)=0)"/> <sort col="cum_pop_pct" dir="up"/> <willbe name="chart_score" value="if(origin_row=1;1;{@score})" label="Score"/> <willbe name="chart_pop_pct" value="if(origin_row=1;0;cum_pop_pct)" format="dec:5" label="Pct of Population"/> <willbe name="chart_true_pct" value="if(origin_row=1;0;cum_true_pct)" format="dec:5" label="Model Pct of Target"/> <willbe name="chart_random_model" value="chart_pop_pct" label="Random Model"/> <willbe name="chart_perfect_model" value="100*min(1;cum_pop/tot_num_true)" format="dec:5" label="Perfect Model"/> <colord cols="auc,chart_score,chart_pop_pct,chart_true_pct,chart_random_model,chart_perfect_model"/> </block></library>

2. Select the testing data.

Note: The following Macro Language code must be added after the calls tog_logreg(G;S;Y;XX;Z) and score(XX;M;Z).

<sel value="test=1"/>

3. Insert the cum_gains <block> in your query.

The value for the score variable should be the name of the column containing the results fromscore(XX;M;Z), which in our case is prob_score. The value for target should be the columnname denoting the dependent variable specified to g_logreg(G;S;Y;XX;Z), which is the Yparameter. In our example, this is the column yy.

<insert block="cum_gains" score="prob_score" target="yy"/>

Note: If you have saved the <library> in an external file, you must also do an <import>before you do the <insert>.

http://www2.1010data.com/documentationcenter/beta/1010dataReferenceManual/index_frames.html?q=MLOperations/blocks/MacroLanguageBlocks.html



http://www2.1010data.com/documentationcenter/beta/1010dataReferenceManual/index_frames.html?q=MLOperations/blocks/block.html

http://www2.1010data.com/documentationcenter/beta/1010dataReferenceManual/index_frames.html?q=MLOperations/blocks/import.html

http://www2.1010data.com/documentationcenter/beta/1010dataReferenceManual/index_frames.html?q=MLOperations/blocks/insert.html



4. We can chart the cumulative gains using the 1010data Chart Builder.

a) Click Chart > Line.b) Drag the Pct of Population (chart_pop_pct) column to the DATA (X-AXIS) area.c) Drag the Model Pct of Target (chart_true_pct), Random Model (chart_random_model), and

Perfect Model (chart_perfect_model) columns to the DATA (Y-AXIS) area.d) Click Update.

You should see a chart that looks like the one below:

Extract logistic regression fit statisticsFor a particular model, you can extract various fit statistics such as deviance, AIC, p-values, z-values, andstandard errors. These statistics can be calculated using a 1010data-supplied library and inserting theassociated block code within your query.

You must have already generated a model using g_logreg(G;S;Y;XX;Z).

To extract the logistic regression fit statistics:



1. Import the regression_statistics library, which can be found in pub.lib.modeling.

This library contains the block logreg_stats, which we will use to calculate the fit statistics.

<import path="pub.lib.modeling" library="regression_statistics"/>

Note: It is best practice to put the <import> operation at the top of your Macro Languagecode; however, the only requirement is that it appears prior to any <insert> operation thatreferences block code within the specified library.

2. Insert the logreg_stats block at the end of your query.

The block logreg_stats takes a number of parameters: model, target, arg_list, get, and ns.We will call this block and specify the parameters using the <insert> operation.

Note: For more information on the parameters or the actual block code, see Block code:logreg_stats on page 13.

<insert block="logreg_stats" model="model" target="yy" arg_list="age duration previous empvarrate hsng h_unk def d_unk loans l_unk nonxst succ blue tech j_unk svcs mgmt ret entr self maid unemp stud marr sgl m_unk" get="all" ns="_stats1"/>

A number of new columns containing the various fit statistics (as well as some intermediary columnsthat are used in the calculation of the fit statistics) will be added to the table.

For example, the following columns show deviance and AIC:

The columns below show the standard error, z-values, and p-values for both the intercept (indicated byconst in the column names) and the age variable:

Block code: logreg_stats

The block named logreg_stats is contained in the library regression_statistics, which can befound in pub.lib.modeling.

<library name="regression_statistics"> <block name="logreg_stats" group="" model="" target="" arg_list="" get="pv" ns=""> <note>*********************************************************</note> <note>*** Extract Logistic Regression Fit Statistics: ****</note> <note>*** - Deviance (-2*Log Likelihood) ****</note> <note>*** - AIC ****</note> <note>*** AND, depending on value of get parameter: ****</note> <note>*** - Parameter p-values (get=pv) ****</note> <note>*** - Parameter standard errors (get=se) ****</note>





<note>*** - Parameter z-values (get=zv) ****</note> <note>*** - All of the above (get=all) ****</note> <note>*** ****</note> <note>*** Allowed arguments: ****</note> <note>*** - group: pass same field(s) as used in G ****</note> <note>*** argument of g_logreg (may be blank) ****</note> <note>*** - model: Required, logreg model column ****</note> <note>*** - target: Required, logreg dependent variable ****</note> <note>*** - arg_list: Required, comma- or space- ****</note> <note>*** separated list of independent variables ****</note> <note>*** - get: Required, see options above ****</note> <note>*** - ns: Optional, namespace suffix which will be ****</note> <note>*** appended to all column names created by ****</note> <note>*** this block to prevent name collisions ****</note> <note>*********************************************************</note> <let arg_list_csl="{if(contains(@arg_list;',');@arg_list;strsubst(@arg_list;' ';0;','))}"> <if test="{(strpick(@arg_list_csl;',';1)='1')|(strpick(@arg_list_csl;',';-1)='1')|(strfind(@arg_list_csl;',1,';1)<>na)}"> <signal msg="logreg_stats Error: Invalid arg_list - arg_list should contain only predictor variables and no constant term (1)."/> </if> <willbe name="p{@ns}" value="score(1 {@arg_list_csl};{@model};)"/> <willbe name="z{@ns}" value="loge(p{@ns}/(1-p{@ns}))"/> <willbe name="likelihood_tmp{@ns}" value="-({@target}*loge(p{@ns}/(1-p{@ns}))+(loge(1-p{@ns})))"/> <willbe name="nlog_lik{@ns}" value="g_sum(seg_ {@group};;likelihood_tmp{@ns})"/> <willbe name="p_null{@ns}" value="g_avg(seg_ {@group};;{@target})"/> <willbe name="likelihood_null_tmp{@ns}" value="-({@target}*loge(p_null{@ns}/(1-p_null{@ns}))+(loge(1-p_null{@ns})))"/> <willbe name="nlog_lik_null{@ns}" value="g_sum(seg_ {@group};;likelihood_null_tmp{@ns})"/> <willbe name="deviance{@ns}" value="-2*nlog_lik{@ns}"/> <willbe name="aic{@ns}" value="2*{strcount(@arg_list_csl;',')+2}+2*nlog_lik{@ns}"/> <willbe name="x_0{@ns}" value="sqrt(p{@ns}*(1-p{@ns}))"/> <foreach i="{@arg_list_csl}"> <willbe name="{@i}__x{@ns}" value="{@i}*sqrt(p{@ns}*(1-p{@ns}))"/> </foreach> <letseq tmp="{splice('__x' ',';@ns)}" lsq_arg_list="{strsubst(@arg_list_csl;',';0;@tmp)}__x{@ns}"> <willbe name="lsq_model{@ns}" value="g_lsq(seg_ {@group};;z{@ns};x_0{@ns},{@lsq_arg_list})"/> </letseq> <let args="const,{@arg_list_csl}"> <switch on="{@get}"> <case when="se"> <for i="1" to="{strcount(@arg_list_csl;',')+2}"> <willbe name="se_{strpick(@args;',';@i)}{@ns}" value="sqrt(param(lsq_model{@ns};'g';{@i}))"/> </for> </case> <case when="zv"> <for i="1" to="{strcount(@arg_list_csl;',')+2}"> <willbe name="z_value_{strpick(@args;',';@i)}{@ns}" value="param({@model};'b';{@i})/sqrt(param(lsq_model{@ns};'g';{@i}))"/> </for> </case> <case when="pv"> <for i="1" to="{strcount(@arg_list_csl;',')+2}"> <willbe name="p_value_{strpick(@args;',';@i)}{@ns}" value="2*(1-normal_cdf(abs(param({@model};'b';{@i})/sqrt(param(lsq_model{@ns};'g';{@i})));0;1))"/> </for> </case> <case when="all"> <for i="1" to="{strcount(@arg_list_csl;',')+2}"> <willbe name="se_{strpick(@args;',';@i)}{@ns}" value="sqrt(param(lsq_model{@ns};'g';{@i}))"/> </for> <for i="1" to="{strcount(@arg_list_csl;',')+2}"> <willbe name="z_value_{strpick(@args;',';@i)}{@ns}" value="param({@model};'b';{@i})/sqrt(param(lsq_model{@ns};'g';{@i}))"/> </for> <for i="1" to="{strcount(@arg_list_csl;',')+2}"> <willbe name="p_value_{strpick(@args;',';@i)}{@ns}" value="2*(1-normal_cdf(abs(param({@model};'b';{@i})/sqrt(param(lsq_model{@ns};'g';{@i})));0;1))"/> </for> </case> </switch> </let> </let> </block></library>

Bank Marketing Data SetThis data set was obtained from the UC Irvine Machine Learning Repository and contains informationrelated to a direct marketing campaign of a Portuguese banking institution and its attempts to get its clientsto subscribe for a term deposit.


Source

This data set was obtained by downloading bank-additional-full.csv (contained in bank-additional.zip) from https://archive.ics.uci.edu/ml/datasets/Bank+Marketing.

The table contains 41,188 rows and 21 columns.

The path to this data set is pub.demo.mleg.uci.bankmarketing.

Input Variables

There are 20 columns in the table that provide information about each client, such as age, marital status,and education level. A subset of these are related to the last contact of the current campaign, such as themonth and day of the week the last contact was made as well as the number of days since the client waslast contacted in a previous campaign. There are 10 columns in the table that are categorial, meaning thatthey contain textual values that correspond to a particular category for a given variable.

Column Name Description Type

age Age of the client Numeric

job Client's occupation Categorial:

• admin• blue-collar• entrepreneur• housemaid• management• retired• self-employed• services• student• technician• unemployed• unknown

marital Marital status Categorial:

• divorced• married• single• unknown

Note: divorced means divorced or widowed

education Client's education level Categorial:

• basic.4y• basic.6y• basic.9y• high.school• illiterate• professional.course• university.degree• unknown

default Indicates whether the clienthas credit in default

Categorial:

• no• yes

https://archive.ics.uci.edu/ml/datasets/Bank+Marketing



• unknown

housing Indicates whether the clienthas a housing loan

Categorial:

• no• yes• unknown

loan Indicates whether the clientas a personal loan

Categorial:


contact Type of contactcommunication

Categorial:

• cellular• telephone

month Month that last contact wasmade

Categorial:

• jan• feb• #• dec

day_of_week Day that last contact wasmade

Categorial:

• mon• tue• wed• thu• fri

duration Duration of last contact inseconds

Numeric

Note: This attribute highly affects the outputtarget (e.g., if duration=0 then y=no). Yet,the duration is not known before a call isperformed. Also, after the end of the call, yis obviously known. Thus, this input shouldonly be included for benchmark purposes andshould be discarded if the intention is to havea realistic predictive model.

campaign Number of contactsperformed during thiscampaign for this client(including last contact)

Numeric

pdays Number of days since theclient was last contacted in aprevious campaign

Numeric

Note: 999 means client was not previouslycontacted

previous Number of contactsperformed before thiscampaign for this client

Numeric



poutcome Outcome of the previousmarketing campaign

Categorial:

• failure• nonexistent• success

empvarrate Employment variation rate(quarterly indicator)

Note: This columnwas namedemp.var.rate inthe original data set.

Numeric

conspriceidx Consumer price index(monthly indicator)

Note: This columnwas namedcons.price.idx inthe original data set.

Numeric

consconfidx Consumer confidence index(monthly indicator)

Note: This columnwas namedcons.conf.idx inthe original data set.

Numeric

euribor3m Euribor 3-month rate (dailyindicator)

Numeric

nremployed Number of employees(quarterly indicator)

Note: This columnwas namednr.employed in theoriginal data set.

Numeric

Output Variable

There is one column in the table that corresponds to our target value.

ColumnName

Description Type

y Indicates whether the client hassubscribed for a term deposit

Binary (yes or no)

Dummy Variables

Since we cannot use textual data in our analysis, categorial variables are coded as dummy variables. Eachdummy variable represents one of the categories in the categorial columns.


yy Client subscribes for a term deposit Boolean (0 or 1)



y='yes'

hsng Client has a housing loan

housing='yes'

Boolean (0 or 1)

h_unk Unknown if the client has a housingloan

housing='unknown'

Boolean (0 or 1)

def Client has credit in default

default='yes'

Boolean (0 or 1)

d_unk Unknown if the client has credit indefault

default='unknown'

Boolean (0 or 1)

loans Client has a personal loan

loan='yes'

Boolean (0 or 1)

l_unk Client has a personal loan

loan='unknown'

Boolean (0 or 1)

nonxst Previous outcome of marketingcampaign is nonexistent

poutcome='nonexistent'

Boolean (0 or 1)

succ Previous outcome of marketingcampaign was a success

poutcome='success'

Boolean (0 or 1)

blue Client occupation: blue-collar worker

job='blue-collar'

Boolean (0 or 1)

tech Client occupation: technician

job='technician'

Boolean (0 or 1)

j_unk Client occupation: unknown

job='unknown'

Boolean (0 or 1)

svcs Client occupation: services

job='services'

Boolean (0 or 1)

mgmt Client occupation: management

job='management'

Boolean (0 or 1)

ret Client occupation: retired

job='retired'

Boolean (0 or 1)

entr Client occupation: entrepreneur

job='entrepreneur'

Boolean (0 or 1)



self Client occupation: self-employed

job='self-employed'

Boolean (0 or 1)

maid Client occupation: housemaid

job='housemaid'

Boolean (0 or 1)

unemp Client occupation: unemployed

job='unemployed'

Boolean (0 or 1)

stud Client occupation: student

job='student'

Boolean (0 or 1)

marr Marital status: married

marital='married'

Boolean (0 or 1)

sgl Marital status: single

marital='single'

Boolean (0 or 1)

m_unk Marital status: unknown

marital='unknown'

Boolean (0 or 1)

Machine Learning Function Examples | Principal Component Analysis | 20

Principal Component Analysis

In this example, a principal component analysis is used as a dimension reduction technique to determinethe principal components of a data set containing bank marketing information. These principal componentsare then used in a logistic regression to predict whether or not a customer subscribed for a term deposit.

The principal component analysis, using the 1010data function g_pca(G;S;XX;Z), will be performed onthe Bank Marketing Data Set on page 14, which was used in the Logistic Regression on page 3 example.This data set contains information related to a campaign by a Portuguese banking institution to get itscustomers to subscribe for a term deposit.

The principal component analysis uses the following 10 variables in that data set:




• Run the principal component analysis (on the continuous variables in the original data set and thedummy variables that we created) using the correlation matrix of the data.

• Obtain the principal components.• Extract the eigenvalues and eigenvectors from the PCA model and calculate the cumulative sum of the

eigenvalues to show their distribution.• Chart both the distribution of explained variance in the PCA and the cumulative distribution of the

variance.• Obtain various model statistics, including the number of observations, the mean value of a specific

column, and the standard deviation of a specific column.• Divide the data into a training set and a test set.• Run the logistic regression on the training data set based on the first three principal components.• Obtain the predicted probability that a customer has subscribed for a term deposit.• Create a cumulative gains chart and calculate the area under the curve (AUC) for the test data.• Chart the logistic curve for both the training and test data.

The results of the logistic regression should be similar to those found in the Logistic Regression on page 3example, depending on the number of principal components used.


http://www2.1010data.com/documentationcenter/beta/1010dataReferenceManual/index_frames.html?q=Functions/GroupFunctions/g_pca.html



3. Using g_pca(G;S;XX;Z), we run the principal component analysis on the continuous variables in theoriginal data set and the dummy variables that we created. We use the corr method, which means westandardize our data first.

<note>COMPUTE PCA MODEL WITH 26 VARIABLES</note><willbe name="model_pca" value="g_pca(;;age duration previous empvarrate hsng h_unk def d_unk loans l_unk nonxst succ blue tech j_unk svcs mgmt ret entr self maid unemp stud marr sgl m_unk;'method''corr')"/>

This creates a column named model_pca that contains the results of the principal component analysis:

Clicking on the > opens a window containing a summary of the principal component analysis:

4. We can then obtain the principal components by using the score(XX;M;Z) function.

http://www2.1010data.com/documentationcenter/beta/1010dataReferenceManual/index_frames.html?q=Functions/GroupFunctions/g_pca.html



<note>OBTAIN FIRST PRINCIPAL COMPONENT</note><willbe name="pc1" value="score(age duration previous empvarrate hsng h_unk def d_unk loans l_unk nonxst succ blue tech j_unk svcs mgmt ret entr self maid unemp stud marr sgl m_unk; model_pca ; 1)"/><note>OBTAIN SECOND PRINCIPAL COMPONENT</note><willbe name="pc2" value="score(age duration previous empvarrate hsng h_unk def d_unk loans l_unk nonxst succ blue tech j_unk svcs mgmt ret entr self maid unemp stud marr sgl m_unk; model_pca ; 2)"/><note>OBTAIN THIRD PRINCIPAL COMPONENT</note><willbe name="pc3" value="score(age duration previous empvarrate hsng h_unk def d_unk loans l_unk nonxst succ blue tech j_unk svcs mgmt ret entr self maid unemp stud marr sgl m_unk; model_pca ; 3)"/>

Note: For our example, we extract the first three principal components. However, we coulduse the distribution of the eigenvalues to more precisely determine the number of principalcomponents we should use in the logistic regression in order to obtain the best results. We seehow to do that in the following steps.

5. Let's use the param(M;P;I) function to extract the eigenvalues and eigenvectors from the PCAmodel. Note that we extract one value at a time.

<willbe name="eigen_value1" value="param(model_pca;'evals';1)"/><willbe name="eigen_value2" value="param(model_pca;'evals';2)"/><willbe name="eigen_vector_1_elem_1" value="param(model_pca;'evecs';1 1)"/><willbe name="eigen_vector_1_elem_2" value="param(model_pca;'evecs';1 2)"/>

6. Alternatively, we can calculate the eigenvalues for the PCA model all at once in one column and thencalculate their cumulative sum to show their distribution. This distribution can then be used to determinehow many principal components to use in the logistic regression.

Note: The number of eigenvalues we extract from the PCA model corresponds to the number ofvariables in our analysis. So, for our example, we extract 26 eigenvalues.

<note>CALCULATE EIGENVALUES IN ONE COLUMN</note><willbe name="temp_i" value="mod(i_(1);26)"/><willbe name="i" value="if(temp_i=0;26;temp_i)"/><willbe name="eigen_value" value="param(model_pca;'evals';i)"/><note>VARIANCE DISTRIBUTION FOR EIGENVALUE</note><willbe name="indicator" value="i_(1)<=26"/><willbe name="cum_variance" value="g_cumsum(;indicator;;eigen_value)/g_sum(;indicator;eigen_value)"/>



Using this information, we can determine how many principal components we should use based on thevalue in the cum_variance column. We can see that if we want to include 80% of the information fromour original data, we need to use at least 16 principal components.

7. To visualize the distribution of explained variance in the PCA, use the 1010data Chart Builder.

a) Click Chart > Bar.b) Drag the eigen_value column to the DATA (BARS) area.c) Click Update.

You should see a chart similar to the following:


8. You can also plot the cumulative distribution of the variance.

a) Click Chart > Scatter.b) Drag the cum_variance column to the DATA (Y-AXIS) area.c) Under the Settings section, enter 26 for X-Range (max).d) Click Update.

You should see a chart similar to the following:


9. We can use the param(M;P;I) function to obtain various model statistics, depending on our analyticalpurposes. Some of these statistics include the number of observations, the mean value of a specificcolumn, and the standard deviation of a specific column.

<note>OBTAIN VARIOUS MODEL STATISTICS</note><willbe name="num_observations" value="param(model_pca;'valcnt';)"/><willbe name="mean_1" value="param(model_pca;'center';1)"/><willbe name="mean_2" value="param(model_pca;'center';2)"/><willbe name="std_dev_1" value="param(model_pca;'scale';1)"/><willbe name="std_dev_2" value="param(model_pca;'scale';2)"/>

10.Next, we want to create a column that we will use to separate training data and test data. We want touse 90% of our data as training data.

<note>SELECT TRAINING DATA</note><willbe name="train" value="draw_(41185;0)<0.9"/><willbe name="test" value="train<>1"/>

11.For demonstration purposes, we run the logistic model (g_logreg(G;S;Y;XX;Z)) using just thefirst three principal components we stored earlier (instead of the 26 variables we used in the LogisticRegression on page 3 example) and using the column yy as a response, which is 1 if a customer hassubscribed for a term deposit.

We use the train column from the previous step as the second parameter of theg_logreg(G;S;Y;XX;Z) function. The train column will act as a selector, so that our functionwill only train 90% of the data. We also specify options for the Z parameter that control convergencecriteria.

<note>COMPUTE LOGREG USING THE FIRST 3 PRINCIPAL COMPONENTS</note><willbe name="model" value="g_logreg(;train;yy;1 pc1 pc2 pc3;'cgdeveps' 0.0000001 'lreps' 0.000000001)"/>

Note: The first element of XX must be the special value 1 for the constant (intercept) term in thelinear model.

This creates a column named model that contains the results of the logistic regression:






12.We can then use the score(XX;M;Z) function to obtain the predicted probability (prob_score)returned by the logistic model, which in our example represents the probability a person subscribed fora term deposit.

<note>OBTAIN MODEL SCORE WITH SCORE FUNCTION</note><willbe name="prob_score" value="score(1 pc1 pc2 pc3;model;)"/>

13.To create the cumulative gains chart and calculate the area under the curve, clone the current tab andfollow the steps in Chart cumulative gains and calculate the AUC on page 10 in the cloned tab.


You should also see a chart that looks like the one below:




14.We can also calculate the logit of the predicted probability (prob_score), which we can use forpurposes of further analysis such as visualization.


<willbe name="z_estimate" value="loge(prob_score / (1-prob_score))"/>

15.We can chart the logistic curve for both the training and test data using the 1010data Chart Builder.

Let's chart the results for our training data:

a) Create computed columns that contain the prob_score and z_estimate for just the training data.

<willbe name="prob_score_train" value="prob_score*train"/><willbe name="z_estimate_train" value="z_estimate*train"/>

b) Click Chart > Scatter.c) Drag the z_estimate_train column to the DATA (X-AXIS) area.d) Drag the prob_score_train and yy columns to the DATA (Y-AXIS) area.e) Change X-Range (max) to 3.5.f) Click Update.

Let's chart the results for our test data:

a) Create computed columns that contain the prob_score and z_estimate for just the test data.


<willbe name="prob_score_test" value="prob_score*test"/><willbe name="z_estimate_test" value="z_estimate*test"/>

b) Click Chart > Scatter.c) Drag the z_estimate_test column to the DATA (X-AXIS) area.d) Drag the prob_score_test and yy columns to the DATA (Y-AXIS) area.e) Change X-Range (max) to 3.5.f) Click Update.

The results should look similar to the following charts:

Based on the chart and the error rate that we got, we can see that the PCA model successfully reduces thedimensions of our data as well as the computations for our logistic model. We can compare these results tothose we achieved in the Logistic Regression on page 3 example.


Chart cumulative gains and calculate the AUCGiven a model score and target variable, you can produce a cumulative gains chart and calculate the AreaUnder the Curve (AUC).

You must have already generated a model using g_logreg(G;S;Y;XX;Z) and obtained the predictedprobability using score(XX;M;Z). This example also assumes that the query has defined a set of testingdata denoted by the column test.

To chart the cumulative gains and calculate the AUC:

1. Add the following <library> to your query.

Note: You can insert the following Macro Language code anywhere within your query, thoughit is best practice to include libraries at the top of queries. Alternatively, you can save the libraryto an external file and then use the <import> operation to import the library into your currentquery. See the section on Macro Language: Blocks in the 1010data Reference Manual for moreinformation about libraries and blocks.

<library name="cum_gains"> <block name="cum_gains" score="" target=""> <note>*****************************************************************************************</note> <note>*** Given a model score and target variable, this block will produce the data for a ****</note> <note>*** cumulative gains chart and calculate the Area Under the Curve (AUC). ****</note> <note>*** ****</note> <note>*** In this implementation, AUC is defined to be between -1 and 1, where 0 indicates ****</note> <note>*** the model performs the same as a `model` which randomly assigns the probability ****</note> <note>*** of observing a target event. An AUC of 1 indicates perfect performance in the ****</note> <note>*** sense that ranking by the model score perfectly separates the `1` target events ****</note> <note>*** from `0` target events. A negative AUC indicates that the model is ****</note> <note>*** `anti-predictive` in the sense that `0` events are assigned a higher score than ****</note> <note>*** `1` events. ****</note> <note>*** ****</note> <note>*** Specifically, here the AUC is defined by integrating the area under the ****</note> <note>*** cumulative gains chart and normalizing by subtracting the area under the ****</note> <note>*** diagonal (which is the area of a random model) and dividing by the area that ****</note> <note>*** would be found for a model that perfectly separates `1`s and `0`s in the target ****</note> <note>*****************************************************************************************</note> <sel value="{@score}<>na"/> <willbe name="score_population" value="g_cnt({@score};)"/> <willbe name="score_num_true" value="g_sum({@score};;{@target})"/> <willbe name="tot_population" value="g_cnt(;)"/> <willbe name="tot_num_true" value="g_sum(;;{@target})"/> <sel value="g_first1({@score};;)"/> <willbe name="score_rank" value="g_rank(;;;{@score})"/> <willbe name="cum_pop" value="g_cumsum(;;score_rank;score_population)"/> <willbe name="cum_true" value="g_cumsum(;;score_rank;score_num_true)"/> <willbe name="cum_pop_pct" value="100*(cum_pop/tot_population)" format="dec:5" label="% of Population"/> <willbe name="cum_true_pct" value="100*(cum_true/tot_num_true)" format="dec:5" label="% of Target"/> <note>**** AUC by integration ****</note> <willbe name="true_pct_of_pop" value="100*(tot_num_true/tot_population)" format="dec:3"/> <willbe name="perfect_auc" value="0.5*(true_pct_of_pop*100)+100*(100-true_pct_of_pop)-0.5*(100^2)"/> <willbe name="prev_cum_pop_pct" value="ifnull(g_rshift(;;score_rank;cum_pop_pct;-1);0)"/> <willbe name="prev_cum_true_pct" value="ifnull(g_rshift(;;score_rank;cum_true_pct;-1);0)"/> <willbe name="bucket_width" value="cum_pop_pct-prev_cum_pop_pct"/> <willbe name="bucket_auc" value="0.5*bucket_width*(prev_cum_true_pct+cum_true_pct-prev_cum_pop_pct-cum_pop_pct)"/> <willbe name="model_raw_auc" value="g_sum(;;bucket_auc)"/> <willbe name="auc" value="model_raw_auc/perfect_auc" format="dec:5" label="AUC"/> <colord cols="auc,{@score},cum_pop_pct,cum_true_pct"/> <note>*** For charting purposes, insert a row for the (0, 0) intercept ****</note> <willbe name="row_num" value="g_cumcnt(;;score_rank)"/> <sel value="if(row_num=1;2;1)" expand="1"/> <willbe name="origin_row" value="(row_num=1)*(ii_(0)=0)"/> <sort col="cum_pop_pct" dir="up"/> <willbe name="chart_score" value="if(origin_row=1;1;{@score})" label="Score"/> <willbe name="chart_pop_pct" value="if(origin_row=1;0;cum_pop_pct)" format="dec:5" label="Pct of Population"/> <willbe name="chart_true_pct" value="if(origin_row=1;0;cum_true_pct)" format="dec:5" label="Model Pct of Target"/> <willbe name="chart_random_model" value="chart_pop_pct" label="Random Model"/> <willbe name="chart_perfect_model" value="100*min(1;cum_pop/tot_num_true)" format="dec:5" label="Perfect Model"/> <colord cols="auc,chart_score,chart_pop_pct,chart_true_pct,chart_random_model,chart_perfect_model"/> </block></library>

2. Select the testing data.

Note: The following Macro Language code must be added after the calls tog_logreg(G;S;Y;XX;Z) and score(XX;M;Z).



http://www2.1010data.com/documentationcenter/beta/1010dataReferenceManual/index_frames.html?q=MLOperations/blocks/library.html

http://www2.1010data.com/documentationcenter/beta/1010dataReferenceManual/index_frames.html?q=MLOperations/blocks/MacroLanguageBlocks.html




<sel value="test=1"/>

3. Insert the cum_gains <block> in your query.

The value for the score variable should be the name of the column containing the results fromscore(XX;M;Z), which in our case is prob_score. The value for target should be the columnname denoting the dependent variable specified to g_logreg(G;S;Y;XX;Z), which is the Yparameter. In our example, this is the column yy.

<insert block="cum_gains" score="prob_score" target="yy"/>

Note: If you have saved the <library> in an external file, you must also do an <import>before you do the <insert>.


4. We can chart the cumulative gains using the 1010data Chart Builder.

a) Click Chart > Line.b) Drag the Pct of Population (chart_pop_pct) column to the DATA (X-AXIS) area.c) Drag the Model Pct of Target (chart_true_pct), Random Model (chart_random_model), and

Perfect Model (chart_perfect_model) columns to the DATA (Y-AXIS) area.d) Click Update.

You should see a chart that looks like the one below:

http://www2.1010data.com/documentationcenter/beta/1010dataReferenceManual/index_frames.html?q=MLOperations/blocks/block.html





Source




Input Variables





• admin• blue-collar• entrepreneur• housemaid• management• retired• self-employed




• services• student• technician• unemployed• unknown







Categorial:



Categorial:



Categorial:



Categorial:



Categorial:





Categorial:



Numeric



Numeric


Numeric



Numeric


Categorial:




Numeric



Numeric


Numeric





Numeric



Numeric

Output Variable


ColumnName

Description Type


Binary (yes or no)

Dummy Variables



yy Client subscribes for a term deposit

y='yes'

Boolean (0 or 1)


housing='yes'

Boolean (0 or 1)


housing='unknown'

Boolean (0 or 1)


default='yes'

Boolean (0 or 1)


default='unknown'

Boolean (0 or 1)


loan='yes'

Boolean (0 or 1)

l_unk Client has a personal loan Boolean (0 or 1)



loan='unknown'



Boolean (0 or 1)


poutcome='success'

Boolean (0 or 1)


job='blue-collar'

Boolean (0 or 1)


job='technician'

Boolean (0 or 1)


job='unknown'

Boolean (0 or 1)


job='services'

Boolean (0 or 1)


job='management'

Boolean (0 or 1)


job='retired'

Boolean (0 or 1)


job='entrepreneur'

Boolean (0 or 1)


job='self-employed'

Boolean (0 or 1)


job='housemaid'

Boolean (0 or 1)


job='unemployed'

Boolean (0 or 1)


job='student'

Boolean (0 or 1)


marital='married'

Boolean (0 or 1)


marital='single'

Boolean (0 or 1)

m_unk Marital status: unknown Boolean (0 or 1)



marital='unknown'

Machine Learning Function Examples | Clustering | 38

Clustering

In this example, clustering is used to separate a data set containing bank marketing information into twoclasses. The clustering results are then examined to see if they accurately reflect the underlying pattern inthe data set, which in this example is whether or not a customer subscribed for a term deposit.

Clustering, using the 1010data function g_cluster(G;S;XX;A;N;Z), will be performed on the BankMarketing Data Set on page 14. This data set contains information related to a campaign by a Portuguesebanking institution to get its customers to subscribe for a term deposit.

The clustering algorithm uses the following 10 variables in that data set:


Clustering is unsupervised learning, which means we assume there is no label for each observation.However, after running the clustering algorithm, we could compare the value of column y (whethersomeone subscribed for a term deposit) with our clustering results to see if our algorithm found theunderlying pattern of data.



• Run k-means clustering on the continuous variables in the original data set and the dummy variablesthat we created, partitioning the data into two classes.

• Classify the results of the clustering.• Obtain the center of each class in various dimensions.• Chart the two classes and plot their centers.• Calculate the error rate by determining the percentage of observations that are misclassified.


http://www2.1010data.com/documentationcenter/beta/1010dataReferenceManual/index_frames.html?q=Functions/GroupFunctions/g_cluster.html



3. Using g_cluster(G;S;XX;A;N;Z), we run k-means clustering on the continuous variables in theoriginal data set and the dummy variables that we created. Using the N argument, we specify that wewant to partition the data into two classes.

<note>CLUSTER WITH 26 VARIABLES</note><willbe name="model_cluster" value="g_cluster(;;age duration previous empvarrate hsng h_unk def d_unk loans l_unk nonxst succ blue tech j_unk svcs mgmt ret entr self maid unemp stud marr sgl m_unk;'kmeans';2;)"/>

This creates a column named model_cluster that contains the results of the clustering:


http://www2.1010data.com/documentationcenter/beta/1010dataReferenceManual/index_frames.html?q=Functions/GroupFunctions/g_cluster.html


4. We can then classify the results of the clustering by using the classify(XX;M;Z) function.

<willbe name="class_estimate" value="classify(age duration previous empvarrate hsng h_unk def d_unk loans l_unk nonxst succ blue tech j_unk svcs mgmt ret entr self maid unemp stud marr sgl m_unk;model_cluster;)"/>

The two classes that result from our clustering algorithm have the values 0 and 1 in theclass_estimate column.

5. Let's obtain the center of each class for the various dimensions (e.g., age, duration, previous)using the param(M;P;I) function.

http://www2.1010data.com/documentationcenter/beta/1010dataReferenceManual/index_frames.html?q=Functions/MiscellaneousFunctions/classify.html



For instance, param(model_cluster;'centers';2 1) gives us the second dimension of the firstclass. (In our example, the second dimension is duration.)

So the following will give us the first three dimensions of both classes:

<willbe name="center_class_1_1" value="param(model_cluster;'centers';1 1)"/><willbe name="center_class_2_1" value="param(model_cluster;'centers';2 1)"/><willbe name="center_class_3_1" value="param(model_cluster;'centers';3 1)"/><willbe name="center_class_1_2" value="param(model_cluster;'centers';1 2)"/><willbe name="center_class_2_2" value="param(model_cluster;'centers';2 2)"/><willbe name="center_class_3_2" value="param(model_cluster;'centers';3 2)"/>

6. Finally, we can visualize the results of the clustering algorithm using the 1010data Chart Builder.

a) Create two columns for each class corresponding to the age and duration columns.

Note: Rows that contain N/A values will not be charted in 1010data.

<willbe name="cluster_1_age" value="if(class_estimate=0;age;NA)"/><willbe name="cluster_1_duration" value="if(class_estimate=0;duration;NA)"/><willbe name="cluster_2_age" value="if(class_estimate=1;age;NA)"/><willbe name="cluster_2_duration" value="if(class_estimate=1;duration;NA)"/>

b) Click Chart > Scatter.c) To visualize the first class, drag the cluster_1_age column to the DATA (X-AXIS) area and

cluster_1_duration to the DATA (Y-AXIS) area.d) To visualize the second class, drag cluster_2_age to DATA (X-AXIS) and

cluster_2_duration to DATA (Y-AXIS).e) To plot the center of the first class, drag center_class_1_1 to DATA (X-AXIS) and

center_class_2_1 to DATA (Y-AXIS).f) To plot the center of the second class, drag center_class_1_2 to DATA (X-AXIS) and

center_class_2_2 to DATA (Y-AXIS).g) To set the colors for the classes and their centers, under the DATA SERIES

section of the Customization Settings panel, enter the following in the Colors field:#c0504d;#000000;#c0504d;#000000

h) To set the scatter point sizes for the classes and their centers, under the DATA SERIESsection of the Customization Settings panel, enter the following in the Scatter point sizes field:small,small,large,large

i) Click Update.

The results should look similar to the following chart:

http://www2.1010data.com/documentationcenter/beta/1010dataUsersGuide/index_frames.html?q=Charting/CustomizationSettings.html


In the above chart, the red scatter points represent one class, and the black scatter points represent theother. The larger red and black scatter points represent the centers of those classes.

7. We can calculate the error rate by determining the percentage of observations that are misclassified;that is, we can compare the clustering results to see if they accurately reflect the underlying pattern inthe data set, which in our example is whether a customer subscribed for a term deposit or not.

It looks like, for the most part, the cluster with the value 0 for class_estimate looks like it matchesup with those values in y whose values are no, and the cluster with the value 1 for class_estimatelooks like it matches up with those values in y whose values are yes.

Let's find the number of observations in which neither of these cases is true to flag those observationsas misclassified. Then we'll divide that number by the total number of observations to find the error rate.

<willbe name="misclassified" value="(y='no'&class_estimate=1)|(y='yes'&class_estimate=0)"/><willbe name="num_observations" value="n_"/><willbe name="error_rate" value="g_cnt(;misclassified)/num_observations"/>



Source




Input Variables





• admin• blue-collar• entrepreneur• housemaid• management• retired• self-employed• services• student• technician• unemployed• unknown










Categorial:



Categorial:



Categorial:



Categorial:



Categorial:



Categorial:





Numeric



Numeric


Numeric



Numeric


Categorial:




Numeric



Numeric



Numeric


Numeric


Numeric




Output Variable


ColumnName

Description Type


Binary (yes or no)

Dummy Variables



yy Client subscribes for a term deposit

y='yes'

Boolean (0 or 1)


housing='yes'

Boolean (0 or 1)


housing='unknown'

Boolean (0 or 1)


default='yes'

Boolean (0 or 1)


default='unknown'

Boolean (0 or 1)


loan='yes'

Boolean (0 or 1)

l_unk Client has a personal loan

loan='unknown'

Boolean (0 or 1)



Boolean (0 or 1)


poutcome='success'

Boolean (0 or 1)




job='blue-collar'

Boolean (0 or 1)


job='technician'

Boolean (0 or 1)


job='unknown'

Boolean (0 or 1)


job='services'

Boolean (0 or 1)


job='management'

Boolean (0 or 1)


job='retired'

Boolean (0 or 1)


job='entrepreneur'

Boolean (0 or 1)


job='self-employed'

Boolean (0 or 1)


job='housemaid'

Boolean (0 or 1)


job='unemployed'

Boolean (0 or 1)


job='student'

Boolean (0 or 1)


marital='married'

Boolean (0 or 1)


marital='single'

Boolean (0 or 1)

m_unk Marital status: unknown

marital='unknown'

Boolean (0 or 1)

Machine Learning Function Examples | Least Squares Regression | 49

Least Squares Regression

In this example, a least squares regression is performed on a data set containing the returns of a numberof international stock exchanges and is used to show the linear relationship between the Istanbul StockExchange and the other exchanges.

The least squares regression, using the 1010data function g_lsq(G;S;Y;XX), is applied to the IstanbulStock Exchange Data Set on page 58, which contains the returns of the Istanbul Stock Exchange aswell as seven other international exchanges from June 5, 2009 to February 22, 2011.

The analysis uses the following 7 variables in that data set as predictors:

• sp• dax• ftse• nikkei• bovespa• eu• em

As a response, the column ise2 is used.

After applying the least squares technique, the results show the linear relationship between the seveninternational exchanges and the Istanbul Stock Exchange.


• Run the model on the seven predictors and the response.• Obtain the predicted value of the linear model.• Obtain the coefficients of the linear model.• Obtain the p-values of the coefficients.• Perform a stepwise regression using backward elimination until all the remaining predictors' p-values

are less than 0.05.• Visualize the results of the least squares regression.• Obtain various statistics for the model such as the degrees of freedom, residual sum of squares, mean

squared error, and number of observations.• Calculate the standard error of the coefficients.• Chart the residual plot.• Plot the predicted value against the original response.• Chart the QQ plot.• Chart the PP plot.

1. Open the Istanbul Stock Exchange data set (pub.demo.mleg.uci.istanbul).

http://www2.1010data.com/documentationcenter/beta/1010dataReferenceManual/index_frames.html?q=Functions/GroupFunctions/g_lsq.html


2. Run the model on the seven predictors as well as the response that we have selected using theg_lsq(G;S;Y;XX) function.

<willbe name="model_1" value="g_lsq(;;ise2;1 sp dax ftse nikkei bovespa eu em)"/>

Note: As the first element of XX, we specify the special value 1 for the constant (intercept) termin the linear model.

This creates a column named model_1 that contains the results of the least squares regression:


3. We can then obtain the predicted value of the linear model using the score(XX;M;Z) function.

<note>OBTAIN MODEL SCORE WITH SCORE FUNCTION</note>

http://www2.1010data.com/documentationcenter/beta/1010dataReferenceManual/index_frames.html?q=Functions/GroupFunctions/g_lsq.html



<willbe name="pred_1" value="score(1 sp dax ftse nikkei bovespa eu em;model_1;)" format="dec:7"/>


4. To obtain the coefficients of the linear model, we can use the param(M;P;I) function.

In the following Macro Language code, we obtain the parameters for the intercept (b0) and the first twovariables (b1 and b2).

<note>OBTAIN MODEL COEFFICIENTS</note><willbe name="b0" value="param(model_1;'b';1)" format="dec:7"/><willbe name="b1" value="param(model_1;'b';2)" format="dec:7"/><willbe name="b2" value="param(model_1;'b';3)" format="dec:7"/>

5. We can also obtain the p-values of the coefficients using the param(M;P;I) function. We will usethese to conduct the variable selection later in the analysis.

In the following Macro Language code, we obtain the p-value for the intercept (p0) and the first twovariables (p1 and p2).

<note>OBTAIN P-VALUES</note><willbe name="p0" value="param(model_1;'p';1)" format="dec:7"/><willbe name="p1" value="param(model_1;'p';2)" format="dec:7"/><willbe name="p2" value="param(model_1;'p';3)" format="dec:7"/>

6. However, one might want to obtain all of the coefficients in one column and the p-values in another,which can be achieved with the following Macro Language code:

<note>CALCULATE ALL COEFFICIENTS IN ONE COLUMN AND P-VALUES IN ANOTHER</note><willbe name="var_names_1" value="'intercept,sp,dax,ftse,nikkei,bovespa,eu,em'"/><willbe name="temp_i_1" value="mod(i_(1);8)"/><willbe name="i_1" value="if(temp_i_1=0;8;temp_i_1)"/><willbe name="b_1" value="param(model_1;'b';i_1)" format="dec:7"/><willbe name="p_1" value="param(model_1;'p';i_1)" format="dec:7"/><willbe name="var_name_1" value="csl_pick(var_names_1;i_1)"/>




Note: The number of coefficients and p-values we obtain from the linear model corresponds tothe number of variables in our analysis. So, for our example, we obtain 8 coefficients and 8 p-values, which correspond to the intercept plus the 7 predictors.

7. We now perform a stepwise regression using backward elimination.

We start by eliminating the variable that has the largest p-value greater than 0.05. Then, we run themodel with the remaining variables. We repeat this process until all the remaining predictors' p-valuesare less than 0.05.

a) In our example, sp has the largest p-value that is greater than 0.05, so we want to eliminate thatvariable and run the model with the remaining 6 variables.

<willbe name="model_2" value="g_lsq(;;ise2;1 dax ftse nikkei bovespa eu em)"/>

b) Next, we obtain the predicted value for this second model.

<willbe name="pred_2" value="score(1 dax ftse nikkei bovespa eu em;model_2;)" format="dec:7"/>

c) Then, we obtain all of the coefficients and p-values.

<willbe name="var_names_2" value="'intercept,dax,ftse,nikkei,bovespa,eu,em'"/><willbe name="temp_i_2" value="mod(i_(1);7)"/><willbe name="i_2" value="if(temp_i_2=0;7;temp_i_2)"/><willbe name="b_2" value="param(model_2;'b';i_2)" format="dec:7"/><willbe name="p_2" value="param(model_2;'p';i_2)" format="dec:7"/><willbe name="var_name_2" value="csl_pick(var_names_2;i_2)"/>

Note: Once again, the number of coefficients and p-values we obtain from the linear modelcorresponds to the number of variables in our analysis. So, for our example, we obtain 7coefficients and 7 p-values, which correspond to the intercept plus the 6 predictors.

d) Now we eliminate the next variable which has the largest p-value greater than 0.05 (which, in ourexample, is nikkei).

Then run the model with the remaining variables, which results in the following:


e) The next variable to drop is the intercept. Then we run the model with the remaining variables.

f) Next, drop ftse and run the model again:

g) Now, drop dax and run the model again:

Finally, all of the remaining predictors' p-values are less than 0.05.

So, the final model is:

Y = -0233*bovespa + 0.700*eu + 1.036*em

In our example, the final model is in the column named model_6 and the final predicted value ispred_6.

For clarity in the calculations going forward, let's put our results in the more generically named columns:model and pred.

<willbe name="model" value="model_6"/><willbe name="pred" value="pred_6"/>

8. We can visualize the results of the least squares regression using the 1010data Chart Builder.

a) Click Chart > Scatter.b) Drag the eu column to the DATA (X-AXIS) area.c) Drag the ise and pred columns to the DATA (Y-AXIS) area.d) Click Update.



9. Using the param(M;P;I) function, we can obtain various statistics for this model such as the degreesof freedom of the model, residual sum of squares, mean squared error, number of observations,average of Y, R2, and adjusted R2.

<note>OBTAIN VARIOUS MODEL STATISTICS</note><willbe name="dof" value="param(model;'df';)"/><willbe name="sum_sq_resids" value="param(model;'chi2';)" format="dec:7"/><willbe name="mean_sq_err" value="sum_sq_resids/dof" format="dec:7"/><willbe name="num_observations" value="param(model;'valcnt';)"/><willbe name="avg_y" value="param(model;'ybar';)" format="dec:7"/><willbe name="R_squared" value="param(model;'r2';)" format="dec:7"/><willbe name="adjusted_R_squared" value="param(model;'adjr2';)" format="dec:7"/>

10.If we want to calculate the standard error of the three coefficients (se1, se2, and se3), we must first

obtain g1, g2, and g3, the diagonal values of (XTX)-1, where X is the matrix of input values. We canobtain g1, g2, and g3 using the param(M;P;I) function.

<note>CALCULATE STANDARD ERRORS</note><willbe name="g1" value="param(model;'g';1)" format="dec:7"/><willbe name="g2" value="param(model;'g';2)" format="dec:7"/><willbe name="g3" value="param(model;'g';3)" format="dec:7"/><willbe name="se1" value="sqrt(g1*mean_sq_err)" format="dec:7"/><willbe name="se2" value="sqrt(g2*mean_sq_err)" format="dec:7"/><willbe name="se3" value="sqrt(g3*mean_sq_err)" format="dec:7"/>

11.To check the assumption of the linear model, we might want to create a residual plot. The residual isthe difference between the actual value (ise2) and the predicted value (pred).

<willbe name="residual" value="ise2-pred" format="dec:7"/>




We can then create a scatter chart in the 1010data Chart Builder with pred as the x-axis andresidual as the y-axis.

For a residual plot, you want the distribution of points to be random, as in the chart above. If thedistribution looks like a quadratic line or other non-linear form, you would probably need to transformyour data in some way (e.g., using a log or square root function first).

As a comparison, you can also create a scatter chart with eu as the x-axis and residual as the y-axis:


12.To see how good our fit is, you might want to plot the predicted value against the original response.

Create a scatter chart with the predicted value (pred) as the x-axis and the original response (ise2) asthe y-axis.

This should look fairly linear if our estimation is good.

13.Another useful visualization is the QQ plot, which shows the relationship between the theoreticalquantile and the sample quantile.

<note>QQ plot</note><tabu label="Tabulation on Istanbul" breaks="residual"><tcol source="residual" fun="cnt" name="count" label="Count"/></tabu><sort col="residual" dir="up"/><willbe name="residual_cdf" value="g_cumsum(;;;count)/g_sum(;;count)" format="dec:7"/>


<willbe name="theoretical_quantile" value="normal_cdf_inv(residual_cdf;0;1)" format="dec:7"/>

We can then chart the QQ plot as a scatter chart in 1010data using the theoretical quantile(theoretical_quantile) as the x-axis and the sample quantile (residual) as the y-axis.

We assume the residual follows a Gaussian distribution, in which case the QQ plot should be a straightline (as in the chart above). If the QQ plot is not a straight line, one should be careful when doingcalculations using normal assumptions like confidence intervals or p-values.

14.You might also want to see the PP plot, which shows the theoretical cumulative probability vs. thesample cumulative probability.

<note>PP plot</note><note>The tabulation and sort from the previous step would go here if the previous step was not performed.</note><willbe name="sample_cumulative_distribution" value="g_cumsum(;;;count)/g_sum(;;count)"/><willbe name="resi_sd" value="sqrt(g_var(;;residual))"/><willbe name="theoretical_cumulative_distribution" value="normal_cdf(residual;0;resi_sd)"/>

Note: The <tabu> and <sort> operations from the previous step (QQ plot) need to have beenperformed before these <willbe> operations.


We can then chart the PP plot as a scatter chart in 1010data using the theoretical cumulative probability(theoretical_cumulative_distribution) as the x-axis and the sample cumulative probability(sample_cumulative_distribution) as the y-axis.

If the normal assumptions hold, the PP plot should be a straight line (as in the chart above).

Istanbul Stock Exchange Data SetThis data set was obtained from the UC Irvine Machine Learning Repository and contains the returns ofthe Istanbul Stock Exchange (ISE) with seven other international indices: S&P 500, DAX, FTSE, Nikkei,Bovespa, MSCI Europe, and MSCI Emerging Markets from June 5, 2009 to February 22, 2011.

Source

This data set was obtained by downloading data_akbilgic.xlsx from https://archive.ics.uci.edu/ml/datasets/ISTANBUL+STOCK+EXCHANGE.

The table contains 536 rows and 10 columns.

The path to this data set is pub.demo.mleg.uci.istanbul.

https://archive.ics.uci.edu/ml/datasets/ISTANBUL+STOCK+EXCHANGE

https://archive.ics.uci.edu/ml/datasets/ISTANBUL+STOCK+EXCHANGE


Input Variables

There are 7 columns in the table that provide information about each stock market return index.


sp S&P 500 Index (New York Stock Exchange) Numeric

dax Deutscher Aktien Index (Frankfurt Stock Exchange) Numeric

ftse FTSE 100 Index (London Stock Exchange) Numeric

nikkei Nikkei Index (Tokyo Stock Exchange) Numeric

bovespa Bovespa Index (Brasil Sao Paulo Stock Exchange) Numeric

eu MSCI Europe Index Numeric

em MSCI Emerging Markets Index Numeric

Note: The first column in the table is named date and corresponds to the date of the returns.

Output Variable



ise2 ISE (Istanbul Stock Exchange)

Note: This column was named ise.2 in theoriginal data set.

Numeric

Note: The column ise2 is USD-based, whereas the column ise1 (which was named ise.1 inthe original data set) is based on the Turkish lira.

Machine Learning Function Examples | Weighted Least Squares Regression | 60

Weighted Least Squares Regression

In this example, a weighted least squares regression is applied to a data set containing weighted censusdata to show the relationship between both the age and education level of a worker and that person'sincome.

The weighted least squares regression, using the 1010data function g_wlsq(G;S;Y;W;XX), is applied tothe Census Income Data Set on page 67, which contains weighted census data extracted from the 1994and 1995 Current Population Surveys conducted by the U.S. Census Bureau.

The analysis uses the following 2 variables in that data set as predictors:

• age• edu_year

It also uses the square of the age, which we calculate in this tutorial.

For the weight, we will use the column instance_weight, which represents how each person in thesurvey relates demographically to the overall population.

As a response, the column wage_per_hour is used.

After applying the weighted least squares technique, the results show the linear relationship between boththe age and education level of the worker and that person's wages per hour.


• Select only those rows where wages are not equal to zero, since we only want to do the regression forthose people who have a job.

• Check the relationship of each of the predictors to the response and adjust for those that have aquadratic form.

• Fit the model on the three predictors and the response.• Obtain the predicted value of the linear model.• Obtain the coefficients of the linear model.• Obtain the p-values of the coefficients.• Obtain various statistics for the model such as the degrees of freedom, residual sum of squares, mean

squared error, and number of observations.• Calculate the AIC.• Visualize the results of the weighted least squares regression by plotting the age against both the

wages per hour and the predicted value of the linear model.

1. Open the Census Income data set (pub.demo.mleg.uci.censusincome).

http://www2.1010data.com/documentationcenter/beta/1010dataReferenceManual/index_frames.html?q=Functions/GroupFunctions/g_wlsq.html


2. Select only those rows where wage_per_hour is not equal to 0, since we only want to do theregression for those people who have a job.

<sel value="(wage_per_hour<>0)"/>

3. Since we're using age as a predictor, let's see if it has a linear relationship to wage_per_hour, ourresponse.

a) For visualization purposes, we can look at the relationship between the average wage per hour andage.

Let's first calculate the average wage per hour grouping by age. We'll use the functiong_avg(G;S;X) and we'll specify the age column for the G parameter (we're grouping by age) andwage_per_hour as the X parameter (since we want to calculate the average wage per hour). We'llomit the S parameter since we want to consider all rows in the table.

<willbe name="avg_wage_per_hour_age" value="g_avg(age;;wage_per_hour)" format="dec:2"/>

b) Click Chart > Line.c) Drag the age column to the DATA (X-AXIS) area.d) Drag the avg_wage_per_hour_age column to the DATA (Y-AXIS) area.e) Click Update.

http://www2.1010data.com/documentationcenter/beta/1010dataReferenceManual/index_frames.html?q=Functions/GroupFunctions/g_avg.html


Since we can see from this chart that the relationship between age and wage_per_hour has aquadratic form, we will include the square of the age as one of the predictors.

4. Let's create a column containing the square of the value in the age column, which we will use as one ofthe predictors.

<willbe name="age_sq" value="age^2"/>

5. Since we're also using edu_year as a predictor, let's see if it has a linear relationship towage_per_hour.

a) For visualization purposes, we can look at the relationship between the average wage per hour andyears of education.

Let's first calculate the average wage per hour grouping by years of education. We'll use the functiong_avg(G;S;X) and we'll specify the edu_year column for the G parameter (we're grouping byyears of education) and wage_per_hour as the X parameter (since we want to calculate theaverage wage per hour). We'll omit the S parameter since we want to consider all rows in the table.

<willbe name="avg_wage_per_hour_edu_year" value="g_avg(edu_year;;wage_per_hour)" format="dec:2"/>

b) Click Chart > Line.c) Drag the edu_year column to the DATA (X-AXIS) area.d) Drag the avg_wage_per_hour_edu_year column to the DATA (Y-AXIS) area.e) Click Update.

http://www2.1010data.com/documentationcenter/beta/1010dataReferenceManual/index_frames.html?q=Functions/GroupFunctions/g_avg.html


We can see from this chart that the relationship between edu_year and wage_per_hour has alinear form.

6. Now, let's show only those columns that we are going to use in the model (as well as those we used inthe charts from the previous steps).

<colord cols="avg_wage_per_hour_age,avg_wage_per_hour_edu_year,wage_per_hour,age,edu_year,instance_weight,age_sq"/>

7. Next, we fit our model on the three predictors as well as the response that we have selected using theg_wlsq(G;S;Y;W;XX) function.

<willbe name="model_wlsq" value="g_wlsq(;;wage_per_hour;instance_weight;1 age age_sq edu_year)"/>

Note: As the first element of XX, we specify the special value 1 for the constant (intercept) termin the linear model.

This creates a column named model_lsq that contains the results of the least squares regression:

http://www2.1010data.com/documentationcenter/beta/1010dataReferenceManual/index_frames.html?q=Functions/GroupFunctions/g_wlsq.html



8. We can then obtain the predicted value of the linear model using the score(XX;M;Z) function.

<note>OBTAIN MODEL SCORE WITH SCORE FUNCTION</note><willbe name="pred" value="score(1 age age_sq edu_year;model_wlsq;)" format="dec:7"/>


9. To obtain the coefficients of the linear model, we can use the param(M;P;I) function.

In the following Macro Language code, we obtain the parameters for the intercept (b0) and the first twovariables (b1 and b2).

<note>OBTAIN MODEL COEFFICIENTS</note><willbe name="b0" value="param(model_wlsq;'b';1)" format="dec:7"/><willbe name="b1"value="param(model_wlsq;'b';2)" format="dec:7"/><willbe name="b2" value="param(model_wlsq;'b';3)" format="dec:7"/>




10.We can also obtain the p-values of the coefficients using the param(M;P;I) function. We will usethese to conduct the variable selection later in the analysis.

In the following Macro Language code, we obtain the p-value for the intercept (p0) and the first twovariables (p1 and p2).

<note>OBTAIN P-VALUES</note><willbe name="p0" value="param(model_wlsq;'p';1)"/><willbe name="p1" value="param(model_wlsq;'p';2)"/><willbe name="p2" value="param(model_wlsq;'p';3)"/>

11.However, one might want to obtain all of the coefficients in one column and the p-values in another,which can be achieved with the following Macro Language code:

<note>CALCULATE ALL COEFFICIENTS IN ONE COLUMN AND P-VALUES IN ANOTHER</note><willbe name="var_names_1" value="'intercept,age,age_sq,edu_year'"/><willbe name="temp_i_1" value="mod(i_(1);4)"/><willbe name="i_1" value="if(temp_i_1=0;4;temp_i_1)"/><willbe name="b_1" value="param(model_wlsq;'b';i_1)" format="dec:7"/><willbe name="p_1" value="param(model_wlsq;'p';i_1)"/><willbe name="var_name_1" value="csl_pick(var_names_1;i_1)"/>

Note: The number of coefficients and p-values we obtain from the linear model corresponds tothe number of variables in our analysis. So, for our example, we obtain 4 coefficients and 4 p-values, which correspond to the intercept plus the 3 predictors.

12.Using the param(M;P;I) function, we can obtain various statistics for this model such as the degreesof freedom of the model, residual sum of squares, mean squared error, number of observations,average of Y, R2, and adjusted R2.

<note>OBTAIN VARIOUS MODEL STATISTICS</note><willbe name="dof" value="param(model_wlsq;'df';)"/><willbe name="sse_temp" value="(wage_per_hour-pred)^2"/><willbe name="sum_sq_resids" value="g_sum(;;sse_temp)" format="dec:7"/><willbe name="mean_sq_err" value="sum_sq_resids/dof" format="dec:7"/><willbe name="num_observations" value="param(model_wlsq;'valcnt';)"/><willbe name="avg_y" value="param(model_wlsq;'ybar';)" format="dec:7"/><willbe name="R_squared" value="param(model_wlsq;'r2';)" format="dec:7"/><willbe name="adjusted_R_squared" value="param(model_wlsq;'adjr2';)" format="dec:7"/>

13.We can calculate the AIC using the following Macro Language code:

<note>CALCULATE AIC</note>




<willbe name="sum_log_likelihood" value="loge(sum_sq_resids/num_observations)"/><willbe name="num_of_var" value="3"/><willbe name="AIC" value="2*num_of_var+num_observations*sum_log_likelihood"/>

14.Finally, we can visualize the results of the weighted least squares regression using the 1010data ChartBuilder.

a) Click Chart > Scatter.b) Drag the age column to the DATA (X-AXIS) area.c) Drag the wage_per_hour and pred columns to the DATA (Y-AXIS) area.d) Click Update.


If you want to see how to calculate the standard error of the coefficients or how to chart the residual plot,QQ plot, PP plot, or how to plot the predicted value against the original response, see the Least SquaresRegression on page 49 tutorial.


Census Income Data SetThis data set was obtained from the UC Irvine Machine Learning Repository and contains weighted censusdata extracted from the 1994 and 1995 Current Population Surveys conducted by the U.S. Census Bureau.

Source

This data set was obtained by downloading census-income.data (contained in census-income.data.gz) from http://archive.ics.uci.edu/ml/datasets/Census-Income+(KDD).

The original table contains 199,523 rows and 42 columns. An additional column, edu_year, has beenadded to aid in the analysis (see Input Variables on page 67).

The path to this data set is pub.demo.mleg.uci.censusincome.

Input Variables

There are 42 columns in the table that provide demographic and employment-related information.


age Age of the worker Numeric

class_worker Class of worker Categorial:

• Not in universe• Private• Self-employed-not incorporated• Local government• State government• Self-employed-incorporated• Federal government• Never worked• Without pay

det_ind_code Industry code Numeric

det_occ_code Occupation code Numeric

education Level of education Categorial:

• Children• Less than 1st grade• 1st 2nd 3rd or 4th grade• 5th or 6th grade• 7th and 8th grade• 9th grade• 10th grade• 11th grade• 12th grade no diploma• High school graduate• Some college but no degree• Associates degree-academic program• Associates degree-occup /vocational• Bachelors degree(BA AB BS)• Masters degree(MA MS MEng MEd MSW MBA)• Prof school degree (MD DDS DVM LLB JD)

http://archive.ics.uci.edu/ml/datasets/Census-Income+(KDD)



• Doctorate degree(PhD EdD)

wage_per_hour Wage per hour Numeric

hs_college Enrolled in educationalinstitution last week

Categorial:

• Not in universe• High school• College or university

marital_stat Marital status Categorial:

• Never married• Married-civilian spouse present• Divorced• Widowed• Separated• Married-spouse absent• Married-A F spouse present

major_ind_code Major industry code Categorial:

• Not in universe or children• Retail trade• Manufacturing-durable goods• Education• Manufacturing-nondurable goods• Finance insurance and real estate• Construction• Business and repair services• Medical except hospital• Public administration• Other professional services• Transportation• Hospital services• Wholesale trade• Agriculture• Personal services except private HH• Social services• Entertainment• Communications• Utilities and sanitary services• Private household services• Mining• Forestry and fisheries• Armed Forces

major_occ_code Major occupation code Categorial:

• Not in universe• Adm support including clerical• Professional specialty• Executive admin and managerial• Other service



• Sales• Precision production craft & repair• Machine operators assmblrs & inspctrs• Handlers equip cleaners etc• Transportation and material moving• Farming forestry and fishing• Technicians and related support• Protective services• Private household services• Armed Forces

race Race Categorial:

• White• Black• Asian or Pacific Islander• Other• Amer Indian Aleut or Eskimo

hisp_origin Hispanic origin Categorial:

• All other• Mexican-American• Mexican (Mexicano)• Central or South American• Puerto Rican• Other Spanish• Cuban• NA• Do not know• Chicano

sex Sex Categorial:

• Female• Male

union_member Member of a labor union Categorial:

• Not in universe• No• Yes

unemp_reason Reason for unemployment Categorial:

• Not in universe• Other job loser• Re-entrant• Job loser - on layoff• Job leaver• New entrant

full_or_part_emp Full- or part-timeemployment status

Categorial:

• Children or Armed Forces



• Full-time schedules• Not in labor force• PT for non-econ reasons usually FT• Unemployed full-time• PT for econ reasons usually PT• Unemployed part- time• PT for econ reasons usually FT

capital_gains Capital gains Numeric

capital_losses Capital losses Numeric

stock_dividends Dividends from stocks Numeric

tax_filer_stat Tax filer status Categorial:

• Nonfiler• Joint both under 65• Single• Joint both 65+• Head of household• Joint one under 65 & one 65+

region_prev_res Region of previousresidence

Categorial:

• Not in universe• South• West• Midwest• Northeast• Abroad

state_prev_res State of previous residence Categorial:

• Not in universe• California• Utah• Florida• North Carolina• ?• Abroad• Oklahoma• Minnesota• Indiana• North Dakota• New Mexico• Michigan• Alaska• Kentucky• Arizona• New Hampshire• Wyoming• Colorado• Oregon



• West Virginia• Georgia• Montana• Alabama• Ohio• Texas• Arkansas• Mississippi• Tennessee• Pennsylvania• New York• Louisiana• Vermont• Iowa• Illinois• Nebraska• Missouri• Nevada• Maine• Massachusetts• Kansas• South Dakota• Maryland• Virginia• Connecticut• District of Columbia• Wisconsin• South Carolina• New Jersey• Delaware• Idaho

det_hh_fam_stat Detailed household andfamily status

Categorial:

• Householder• Child <18 never marr not in subfamily• Spouse of householder• Nonfamily householder• Child 18+ never marr Not in a subfamily• Secondary individual• Other Rel 18+ ever marr not in subfamily• Grandchild <18 never marr child of subfamily RP• Other Rel 18+ never marr not in subfamily• Grandchild <18 never marr not in subfamily• Child 18+ ever marr Not in a subfamily• Child under 18 of RP of unrel subfamily• RP of unrelated subfamily• Child 18+ ever marr RP of subfamily• Other Rel 18+ ever marr RP of subfamily• Other Rel <18 never marr child of subfamily RP



• Other Rel 18+ spouse of subfamily RP• Child 18+ never marr RP of subfamily• Other Rel <18 never marr not in subfamily• Grandchild 18+ never marr not in subfamily• In group quarters• Child 18+ spouse of subfamily RP• Other Rel 18+ never marr RP of subfamily• Child <18 never marr RP of subfamily• Spouse of RP of unrelated subfamily• Child <18 ever marr not in subfamily• Grandchild 18+ ever marr not in subfamily• Grandchild 18+ spouse of subfamily RP• Child <18 ever marr RP of subfamily• Grandchild 18+ ever marr RP of subfamily• Grandchild 18+ never marr RP of subfamily• Other Rel <18 ever marr RP of subfamily• Other Rel <18 never married RP of subfamily• Other Rel <18 spouse of subfamily RP• Child <18 spouse of subfamily RP• Grandchild <18 ever marr not in subfamily• Grandchild <18 never marr RP of subfamily• Other Rel <18 ever marr not in subfamily

det_hh_summ Detailed householdsummary in household

Categorial:

• Householder• Child under 18 never married• Spouse of householder• Child 18 or older• Other relative of householder• Nonrelative of householder• Group Quarters- Secondary individual• Child under 18 ever married

mig_chg_msa Migration code - change inMSA

Categorial:

• ?• Nonmover• MSA to MSA• NonMSA to nonMSA• Not in universe• MSA to nonMSA• NonMSA to MSA• Abroad to MSA• Not identifiable• Abroad to nonMSA

mig_chg_reg Migration code - change inregion

Categorial:

• ?• Nonmover• Same county



• Different county same state• Not in universe• Different region• Different state same division• Abroad• Different division same region

mig_move_reg Migration code - movewithin region

Categorial:

• ?• Nonmover• Same county• Different county same state• Not in universe• Different state in South• Different state in West• Different state in Midwest• Abroad• Different state in Northeast

mig_same Live in this house one yearago

Categorial:

• Not in universe under 1 year old• Yes• No

mig_prev_sunbelt Migration - previousresidence in sunbelt

Categorial:

• ?• Not in universe• No• Yes

num_emp Number of persons thatworked for employer

Numeric

fam_under_18 Family members under 18 Categorial:

• Not in universe• Both parents present• Mother only present• Father only present• Neither parent present

country_father Country of birth father Categorial:

• United-States• Mexico• ?• Puerto-Rico• Italy• Canada• Germany• Dominican-Republic• Poland



• Philippines• Cuba• El-Salvador• China• England• Columbia• India• South Korea• Ireland• Jamaica• Vietnam• Guatemala• Japan• Portugal• Ecuador• Haiti• Greece• Peru• Nicaragua• Hungary• Scotland• Iran• Yugoslavia• Taiwan• Cambodia• Honduras• France• Outlying-U S (Guam USVI etc)• Laos• Trinadad&Tobago• Thailand• Hong Kong• Holand-Netherlands• Panama

country_mother Country of birth mother Categorial:

• United-States• Mexico• ?• Puerto-Rico• Italy• Canada• Germany• Philippines• Poland• Cuba• El-Salvador• Dominican-Republic• England



• China• Columbia• South Korea• Ireland• India• Vietnam• Japan• Jamaica• Guatemala• Ecuador• Peru• Haiti• Portugal• Nicaragua• Hungary• Greece• Scotland• Taiwan• Honduras• France• Iran• Yugoslavia• Cambodia• Outlying-U S (Guam USVI etc)• Laos• Thailand• Hong Kong• Trinadad&Tobago• Holand-Netherlands• Panama

country_self Country of birth Categorial:

• United-States• Mexico• ?• Puerto-Rico• Germany• Philippines• Cuba• Canada• Dominican-Republic• El-Salvador• China• South Korea• England• Columbia• Italy• India• Vietnam



• Poland• Guatemala• Japan• Jamaica• Peru• Ecuador• Haiti• Nicaragua• Taiwan• Portugal• Iran• Greece• Honduras• Ireland• France• Outlying-U S (Guam USVI etc)• Thailand• Laos• Hong Kong• Cambodia• Hungary• Scotland• Trinadad&Tobago• Yugoslavia• Panama• Holand-Netherlands

citizenship Citizenship Categorial:

• Native- Born in the United States• Foreign born- Not a citizen of U S• Foreign born- U S citizen by naturalization• Native- Born abroad of American Parent(s)• Native- Born in Puerto Rico or U S Outlying

own_or_self Own business or self-employed?

Numeric

vet_question Fill included questionnairefor Veterans Administration

Categorial:

• Not in universe• No• Yes

vet_benefits Veterans benefits Numeric

weeks_worked Weeks worked in the year Numeric

year Year of survey Numeric

income_50k Income less than or greaterthan $50,000

Categorial:

• - 50000.• 50000+.



edu_year Number of years ofeducation

Numeric

Note: The edu_year column is derived from the education column according to the followingmapping:

education edu_year

Children 0

Less than 1st grade 0.5

1st 2nd 3rd or 4th grade 2.5

5th or 6th grade 5.5

7th and 8th grade 7.5

9th grade 9

10th grade 10

11th grade 11

12th grade no diploma 12

High school graduate 12

Some college but no degree 14

Associates degree-academic program 14

Associates degree-occup /vocational 14

Bachelors degree(BA AB BS) 16

Masters degree(MA MS MEng MEd MSW MBA) 18

Prof school degree (MD DDS DVM LLB JD) 20

Doctorate degree(PhD EdD) 21

Weight Variable

There is one column in the table that corresponds to the weight value.


instance_weight Indicates the number of people in the population that eachrecord represents due to stratified sampling

Numeric

Output Variable



wage_per_hour Wage per hour (multiplied by 100)

For example, a value of 1200 would correspond to$12.00/hr.

Numeric

machine learning function examples - 1010datamachine learning function examples | logistic...

Documents