teachers guide

Part 1: The first zip code grep is 'grep "^12962" data/2010_income_by_zipcode.tsv'To get the number of zips that start with 3, 'grep "^3[0-9]\{4\}" data/2010_income_by_zipcode.tsv | wc -l'The flag to do a reverse grep is '-v'The awk command you'll need is: grep -v "^#" data/2010_income_by_zipcode.tsv | awk '{s=s+$4}; END{print s}'A sed command that will work is: sed 's/^[0-9][0-9][0-9][0-9] /0&/' data/2010_income_by_zipcode.tsvNote that all those spaces are actually a tab, which you can enter on the command line with ^v [TAB]. Part 3:To get the number of rows, the syntax is going to be something like "select count(*) from t2_postd_trxn_BDA;".The min/max query is "select min(trxn_amt),max(trxn_amt) from t2_postd_trxn_BDA;".To count the restricted table, try "select count(*) from t2_postd_trxn_BDA where tsys_tcat_cd = '1' and tsys_tbal_cd = '1' and debit_cr_cd = 'D' and mrch_cntry_cd = 'USA';"To get a substring of the zip codes, the command is "substr(mrch_pstl_cd,1,5)".To cast to int in Hive, the command would be "cast(mrch_pstl_cd as int)". To combine this with the substring command, you'd do "cast(substr(mrch_pstl_cd,1,5) as int)".You can make the intermediate table as follows:create table pyw137_fixed_trxnsstored as textfilelocation '/user/pyw137/fixed_trxns/'as select cast(substr(mrch_pstl_cd,1,5) as int) as mrch_pstl_cd,trxn_amt from t2_postd_trxn_BDAwhere tsys_tcat_cd = '1' and tsys_tbal_cd = '1' and debit_cr_cd = 'D' and mrch_cntry_cd = 'USA';"select count(*) from pyw137_fixed_trxns where mrch_pstl_cd = 1001;" to get the number of transactions in the 01001 zip codeThe command to get the median of a column of floats is percentile_approx([COLUMN_NAME],0.5).The full query to get the medians for each zip code is something like this:create table pyw137_median_trxn_by_zipstored as textfilelocation '/user/pyw137/median_trxn_by_zip/'as select mrch_pstl_cd,percentile_approx(trxn_amt,0.5) as median_amt,count(*) as n_trxnfrom pyw137_fixed_trxnsgroup by mrch_pstl_cd;To get the max median spend in a zip code the query is:select max(median_amt) from pyw137_median_trxn_by_zip;To join the two tables into a new one:create table pyw137_median_trxn_with_incomestored as textfilelocation '/user/pyw137/median_trxn_with_income/'as select trxn.*,income.median_income,income.num_peoplefrom pyw137_median_trxn_by_zip trxninner join pyw137_income_by_zip incomeon trxn.mrch_pstl_cd = income.zip;To extract the data into a CSV:insert overwrite local directory '/home/pyw137/project_data/joined_data'row format delimitedfields terminated by ','select * from pyw137_median_trxn_with_income;Then (outside of Hive):cd /home/pyw137/project_data/mv joined_data/000000_0 joined_data.csvPart 4: Detailed analysis.A working 'load_data_from_file' looks like this:def load_data_from_file(filename): ''' This function loads data from a file into arrays, using np.loadtxt. Arguments: filename: A string representing the name of the data file. Returns: data_array: a 2D numpy array of the data. ''' #Enter code here! data_array = np.loadtxt(filename,delimiter=',') return data_arrayA working 'cut_array_on_column_min' looks like this:def cut_array_on_column_min(data_array,column_to_slice_idx,column_min_value): ''' This function returns the data array, where all rows having a column with a value below a min value have been removed. Arguments: data_array: a 2D numpy array of the data. column_to_slice_idx: The index of the column you want to slice. column_min_value: The minimum value of the column for a row to stay in the data. Returns: cut_data_array: The input data array with all 'bad' values removed. ''' #Enter code here! cut_data_array = data_array[data_array[:,column_to_slice_idx] > column_min_value] return cut_data_arrayA working 'compute_spearman' looks like this:def compute_spearman(median_trxn,median_income): ''' Compute the Spearman rank-order correlation between the median transaction amount and household income. Arguments: median_trxn: A 1D numpy array of median transactions. median_income: A 1D numpy array of median household incomes. Returns: rho: The correlation coefficient. pval: The p-value of the correlation. ''' #Enter code here! rho,pval = spearmanr(median_trxn,median_income) #note that the Spearman function will be invoked as 'spearmanr([INPUTS])' return rho,pvalA working 'fit_line" looks like this:def fit_line(median_income,median_trxn): ''' Compute the best fitting line to predict the median transaction amount as a function of household income. Arguments: median_income: A 1D numpy array of median household incomes. median_transaction: A 1D numpy array of median transactions. Returns: m: The slope of the line. b: The y-intercept. ''' #Enter code here! A = np.vstack([median_income,np.ones(len(median_income))]).T m,b = np.linalg.lstsq(A,median_trxn)[0] #note that the least-squares fitter should be invoked as 'np.linalg.lstsq([INPUTS])'. return m,b

teachers guide

Documents