exploring data and descriptive statistics (using stata)

Exploring Data and Descriptive Statistics

(using Stata)

Oscar Torres-ReynaData [email protected]

http://dss.princeton.edu/training/

Data Analysis 101 Workshops

Agenda…

• What is Stata• Transferring data to Stata• Excel to Stata• Exercise 1: Data from ICPSR using the Online Learning Center.• Exercise 2: Data from the World Development Indicators & Global Development

Finance from the World Bank

Basic commands (review)

• Stata’s screen• First steps (working directory, log file, memory setting)• Frequencies• Crosstabulations• Scatterplots/Histograms

This document is created from the following:http://dss.princeton.edu/training/StataTutorial.pdfhttp://dss.princeton.edu/training/DataPrep101.pdfhttp://dss.princeton.edu/training/FindingData101.pdf

OTR 2

What is Stata?• It is a multi‐purpose statistical package to help you explore, summarize and analyze datasets.

• A dataset is a collection of several pieces of information called variables (usually arranged by columns). A variable can have one or several values (information for one or several cases).

• Other statistical packages are SPSS, SAS and R.

• Stata is widely used in social science research and the most used statistical software on campus.

OTR 3

Other data formats…

Features Stata SPSS SAS R

Data extensions *.dta *.sav,*.por (portable file)

*.sas7bcat, *.sas#bcat, *.xpt (xport files) *.Rdata

User interface Programming/point-and-click Mostly point-and-click Programming Programming

Data manipulation Very strong Moderate Very strong Very strong

Data analysis Powerful Powerful Powerful/versatile Powerful/versatile

Graphics Very good Very good Good Excellent

CostAffordable (perpetual

licenses, renew only when upgrade)

Expensive (but not need to renew until upgrade, long

term licenses)

Expensive (yearly renewal) Open source

Program extensions *.do (do-files) *.sps (syntax files) *.sas *.txt (log files)

Output extension

*.log (text file, any word processor can read it),

*.smcl (formated log, only Stata can read it).

*.spo (only SPSS can read it) (various formats)

*.R, *.txt(log files, any word

processor can read)

OTR 4

Stat/Transfer: Transferring data from one format to another (available in the DSS lab)

1) Select the current format of the dataset

2) Browse for the dataset

3) Select “Stata” or the data format you need

4) It will save the file in the same directory as the original but with the appropriate extension (*.dta for Stata)

5) Click on ‘Transfer’OTR 5

Example of a dataset in Excel.

Path to the file: http://dss.princeton.edu/training/students.xls

Variables are arranged by columns and cases by rows. Each variable has more than one value

OTR 6

1 ‐ To go from Excel to Stata you simply copy‐and‐paste data into the Stata’s “Data editor” which you can open by clicking on the icon that looks like this:

2 ‐ This window will open, is the data editor

3 ‐ Press Ctrl‐v to paste the data from Excel…

Excel to Stata (copy-and-paste)

OTR 7

Stata color‐coded systemAn important step is to make sure variables are in their expected format.

Stata has a color‐coded system for each type. Black is for numbers, red is for text or string and blue is for labeled variables.

For var1 a value 2 has the label “Fairly well”. It is still a numeric variable

Var2 is a string variable even though you see numbers. You can’t do any statistical procedure with this variable other than simple frequencies

Var3 is a numeric You can do any statistical procedure with this variable

Var4 is clearly a string variable. You can do frequencies and crosstabulations with this but not statistical procedures.

OTR 8

Another way to bring excel data into Stata is by saving the Excel file as *.csv (comma‐separated values) and import it in Stata using the insheet command. In Excel go to File->Save as and save the Excel file as *.csv:

Excel to Stata (using insheet) step 1

You may get the following messages, click OK and YES…

Go to the next page… OTR 9

Excel to Stata (insheet using *.csv) step 2

In Stata go to File->Import->”ASCII data created by spreadsheet”. Click on ‘Browse’ to find the file and then OK.

An alternative to using the menu you can type:

insheet using "c:\mydata\mydatafile.csv"

1

2

OTR 10

Exercises

Exercise 1Using the ICPSR Online Learning Center, go to guide on Civic Participation and Demographics in Rural China (1990)http://www.icpsr.umich.edu/icpsrweb/ICPSR/OLC/guides/China/sections/a01

Got to the tab ‘Dataset’ and download the data (http://www.icpsr.umich.edu/icpsrweb/ICPSR/OLC/guides/China/sections/a02)

We’ll focus on the first exercise on ‘Age and Participation’ and use the following variables:

• Respondent's year of birth (M1001)• Village meeting attendance (M3090)

Activities:

• Create the variable ‘age’ for each respondent• Create the variable ‘agegroup’ with the following categories: 16‐35, 36‐55 and 56‐79

Questions:

• What percentage of respondents reported attending a local village meeting? • Of those attending a meeting, which age group was most likely to report attending a village meeting? • Of those attending a meeting , which group was most likely to report no village meeting attendance?

Source: Inter‐university Consortium for Political and Social Research. Civic Participation and Demographics in Rural China: A Data‐Driven Learning Guide. Ann Arbor, MI: Inter‐university Consortium for Political and Social Research [distributor], July, 31 2009. Doi:10.3886/China

OTR 12

Exercise 2Got to the World Development Indicators (WDI) & Global Development Finance (GDF) from the World Bank (access from the library’s Articles and Databases, http://library.princeton.edu/catalogs/articles.php)

Direct link to WDI/GDF http://databank.worldbank.org/ddp/home.do?Step=12&id=4&CNO=2

Get data for the United States and all available years on:

• Long‐term unemployment (% of total unemployment)• Long‐term unemployment, female (% of female unemployment)• Long‐term unemployment, male (% of male unemployment)• Inflation, consumer prices (annual %)• GDP per capita (constant 2000 US$)• GDP per capita growth (annual %)

See here to arrange the data as panel data http://dss.princeton.edu/training/FindingData101.pdf#page=21For an example of how panel data looks like click here: http://dss.princeton.edu/training/DataPrep101.pdf#page=3

Activities:

• Rename the variables and explore the data (use describe, summarize)• Create a variable called crisis where it takes the value of 17 for the following years: 1960, 1961, 1969, 1970, 1973, 1974, 1975,

1981, 1982, 1990, 1991, 2001, 2007, 2008, 2009. Replace missing with zeros (source: nber.org).• Set as time series (see http://dss.princeton.edu/training/TS101.pdf#page=6)• Create a line graph with unemployment rate (total, female and males) and crisis by year.

Questions:

• What do you see? Who tends to be more affected by the economic recessions?OTR 13

Basic commands

Do‐file editor

Stop a process

Open data editor Open data browser

Review window. Displays recent commands.

Click to add to the command

window

Output window. Displays output

from the commands

Variables window.

Click here to add variables

to the command window Command window. Type Stata commands here (hit enter)

OTR 15

First steps: Working directoryTo see your working directory, type

pwd

To change the working directory to avoid typing the whole path when calling or saving files, type:

cd c:\mydata

Use quotes if the new directory has blank spaces, for example

cd “h:\stata and data”

If you want to use the menu go to (useful with Macs):

File -> Change Working Directory…OTR 16

First steps: log fileCreate a log file, sort of Stata’s built-in tape recorder and where you can: 1) retrieve the output of your work and 2) keep a record of your work.

In the command line type:

log using mylog.log

This creates the file ‘mylog.log’ in your working directory. You can read it using any word processor (notepad, word, etc.).

To close a log file type:

log close

To add more output to an existing log file add the option append, type:

log using mylog.log, append

To replace a log file add the option replace, type:

log using mylog.log, replace

Note that the option replace will delete the contents of the previous version of the log.

OTR 17

First steps: set the correct memory allocation

If you get the following error message while opening a datafile or adding more variables:

3. Increase the amount of memory allocated to the data area using the set memory command; see help memory.

2. Drop some variables or observations; see help drop.

rectangle; Stata can trade off width and length.) 1. Store your variables more efficiently; see help compress. (Think of Stata's data area as the area of a

following alternatives: An attempt was made to increase the number of observations beyond what is currently possible. You have theno room to add more observations

You need to set the correct memory allocation for your data or the maximunnumber of variable allowed. Some big datasets need more memory, depending on the size you can type, for example:

set mem 700m

Note: If this does not work try a bigger number.

*To allow more variables type set maxvar 10000

703.163M set matsize 400 max. RHS vars in models 1.254M set memory 700M max. data space 700.000M set maxvar 5000 max. variables allowed 1.909M settable value description (1M = 1024k) current memory usage

Current memory allocation

. set mem 700m

OTR 18

First steps: Opening/saving Stata files (*.dta)

To open files already in Stata with extension *.dta, run Stata and you can either:

• Use the menu: go to file->open, or• In the command window type use “c:\mydata\mydatafile.dta”

If your working directory is already set to c:\mydata, just type

use mydatafile

To save a data file from Stata go to file – save as or just type:

save, replace

If the dataset is new or just imported from other format go to file –> save as or just type:

save mydatafile

For ASCII data please see http://dss.princeton.edu/training/DataPrep101.pdfOTR 19

To get a general description of the dataset and the format for each variable type describe

Type help describe for more information…

Command: describe

newspaperread~k byte %8.0g Newspaper readership heightin byte %8.0g Height (in)averagescoreg~e byte %8.0g Average score (grade)sat int %8.0g SATage byte %8.0g Agecountry str9 %9s Countrymajor str8 %9s Majorstudentstatus str13 %13s Student Statusgender str6 %9s Genderstate str14 %14s Statecity str14 %14s Cityfirstname str6 %9s First Namelastname str5 %9s Last Nameid byte %8.0g ID variable name type format label variable label storage display value size: 2,580 (99.9% of memory free) vars: 14 29 Sep 2009 17:12 obs: 30 Contains data from http://dss.princeton.edu/training/students.dta

. describe

OTR 20

Type summarize to get some basic descriptive statistics.

newspaperr~k 30 4.866667 1.279368 3 7 heightin 30 66.43333 4.658573 59 75averagesco~e 30 80.36667 10.11139 63 96 sat 30 1848.9 275.1122 1338 2309 age 30 25.2 6.870226 18 39 country 0 major 0studentsta~s 0 gender 0 state 0 city 0 firstname 0 lastname 0 id 30 15.5 8.803408 1 30 Variable Obs Mean Std. Dev. Min Max

. summarize

Zeros indicate string variables

Type help summarize for more information…

Command: summarize

Use ‘min’ and ‘max’ values to check for a valid range in each variable. For example, ‘age’ should have the expected values (‘don’t know’ or ‘no answer’ are usually coded as 99 or 999)OTR 21

Exploring data: frequenciesFrequency refers to the number of times a value is repeated. Frequencies are used to analyze categorical data. The tables below are frequency tables, values are in ascending order. In Stata use the command tab varname.

‘Freq.’ provides a raw count of each value. In this case 10 students for each major.‘Percent’ gives the relative frequency for each value. For example, 33.33% of the students in this group are econ majors.‘Cum.’ is the cumulative frequency in ascending order of the values. For example, 66.67% of the students are econ or math majors.

‘Freq.’ Here 6 students read the newspaper 3 days a week, 9 students read it 5 days a week.‘Percent’. Those who read the newspaper 3 days a week represent 20% of the sample, 30% of the students in the sample read the newspaper 5 days a week.‘Cum.’ 66.67% of the students read the newspaper 3 to 5 days a week.

variable

variable

Total 30 100.00 Politics 10 33.33 100.00 Math 10 33.33 66.67 Econ 10 33.33 33.33 Major Freq. Percent Cum.

. tab major

Total 30 100.00 7 3 10.00 100.00 6 7 23.33 90.00 5 9 30.00 66.67 4 5 16.67 36.67 3 6 20.00 20.00 (times/wk) Freq. Percent Cum. readership Newspaper

. tab readnews

Type help tab for more details.OTR 22

Exploring data: frequencies and descriptive statistics (using table)

Command table produces frequencies and descriptive statistics per category. For more info and a list of all statistics type help table. Here are some examples, type

table gender, contents(freq mean age mean score)

The mean age of females is 23 years, for males is 27. The mean score is 78 for females and 82 for males. Here is another example:

table major, contents(freq mean age mean sat mean score mean readnews)

Male 15 27.2 82 Female 15 23.2 78.73333 Gender Freq. mean(age) mean(score)

. table gender, contents(freq mean age mean score)

Politics 10 28.8 1896.7 85.1 4.9 Math 10 23 1844 79.8 5.3 Econ 10 23.8 1806 76.2 4.4 Major Freq. mean(age) mean(sat) mean(score) mean(read~s)

. table major, contents(freq mean age mean sat mean score mean readnews)

OTR 23

Exploring data: crosstabsAlso known as contingency tables, crosstabs help you to analyze the relationship between two or more categorical variables. Below is a crosstab between the variable ‘ecostatu’ and ‘gender’. We use the command tab var1 var2

The first value in a cell tells you the number of observations for each xtab. In this case, 90 respondents are ‘male’ and said that the economy is doing ‘very well’, 59 are ‘female’ and believe the economy is doing ‘very well’

The second value in a cell gives you row percentages for the first variable in the xtab. Out of those who think the economy is doing ‘very well’, 60.40% are males and 39.60% are females.

The third value in a cell gives you column percentages for the second variable in the xtab. Among males, 14.33% think the economy is doing ‘very well’ while 7.92% of females have the same opinion.

var1 var2

Options ‘column’, ‘row’ gives you the column and row percentages.

NOTE: You can use tab1 for multiple frequencies or tab2 to run all possible crosstabs combinations. Type help tab for further details. 100.00 100.00 100.00

45.74 54.26 100.00 Total 628 745 1,373 0.48 0.00 0.22 100.00 0.00 100.00 Refused 3 0 3 0.32 1.34 0.87 16.67 83.33 100.00 Not sure 2 10 12 9.08 17.99 13.91 29.84 70.16 100.00 Very badly 57 134 191 22.13 28.05 25.35 39.94 60.06 100.00 Fairly badly 139 209 348 53.66 44.70 48.80 50.30 49.70 100.00 Fairly well 337 333 670 14.33 7.92 10.85 60.40 39.60 100.00 Very well 90 59 149 Nat'l Eco Male Female Total Status of Gender of Respondent

column percentage row percentage frequency Key

. tab ecostatu gender, column row

OTR 24

Exploring data: crosstabs (a closer look)You can use crosstabs to compare responses among categories in relation to aggregate responses. In the table below we can see how opinions for males and females diverge from the national average.

As a rule‐of‐thumb, a margin of error of ±4 percentage points can be used to indicate a significant difference (some use ±3). For example, rounding up the percentages, 11% (10.85) answer ‘very well’ at the national level. With the margin of error, this gives a range roughly between 7% and 15%, anything beyond this range could be considered significantly different (remember this is just an approximation). It does not appear to be a significant bias between males and females for this answer.In the ‘fairly well’ category we have 49%, with range between 45% and 53%. The response for males is 54% and for females 45%. We could say here that males tend to be a bit more optimistic on the economy and females tend to be a bit less optimistic. If we aggregate responses, we could get a better picture. In the table below 68% of males believe the economy is doing well (comparing to 60% at the national level, while 46% of females thing the economy is bad (comparing to 39% aggregate). Males seem to be more optimistic than females.

recode ecostatu (1 2 = 1 "Well") (3 4 = 2 "Bad") (5 6=3 "Not sure/ref"), gen(ecostatu1) label(eco)

100.00 100.00 100.00 45.74 54.26 100.00 Total 628 745 1,373 0.80 1.34 1.09 33.33 66.67 100.00 Not sure/ref 5 10 15 31.21 46.04 39.26 36.36 63.64 100.00 Bad 196 343 539 67.99 52.62 59.65 52.14 47.86 100.00 Well 427 392 819 Nat'l Eco) Male Female Total (Status of Gender of Respondent ecostatu RECODE of

100.00 100.00 100.00 45.74 54.26 100.00 Total 628 745 1,373 0.48 0.00 0.22 100.00 0.00 100.00 Refused 3 0 3 0.32 1.34 0.87 16.67 83.33 100.00 Not sure 2 10 12 9.08 17.99 13.91 29.84 70.16 100.00 Very badly 57 134 191 22.13 28.05 25.35 39.94 60.06 100.00 Fairly badly 139 209 348 53.66 44.70 48.80 50.30 49.70 100.00 Fairly well 337 333 670 14.33 7.92 10.85 60.40 39.60 100.00 Very well 90 59 149 Nat'l Eco Male Female Total Status of Gender of Respondent


. tab ecostatu gender, column row

OTR 25

Exploring data: crosstabs (test for associations)To see whether there is a relationship between two variables you can choose a number of tests. Some apply to nominal variables some others to ordinal. I am running all of them here for presentation purposes.tab ecostatu1 gender, column row nokey chi2 lrchi2 V exact gamma taub

– For nominal data use chi2, lrchi2, V– For ordinal data use gamma and taub– Use exact instead of chi2 when

frequencies are less than 5 across the table.

X2(chi‐square)

Likelihood‐ratio χ2(chi‐square)

Cramer’s V

Fisher’s exact test

Goodman & Kruskal’s γ (gamma)

Kendall’s τb (tau‐b)

X2(chi‐square) tests for relationships between variables. The null hypothesis (Ho) is that there is no relationship. To reject this we need a Pr < 0.05 (at 95% confidence). Here both chi2 are significant. Therefore we conclude that there is some relationship between perceptions of the economy and gender. lrchi2 reads the same way.

Cramer’s V is a measure of association between two nominal variables. It goes from 0 to 1 where 1 indicates strong association (for rXc tables). In 2x2 tables, the range is ‐1 to 1. Here the V is 0.15, which shows a small association.

Gamma and taub are measures of association between two ordinal variables (both have to be in the same direction, i.e. negative to positive, low to high). Both go from ‐1 to 1. Negative shows inverse relationship, closer to 1 a strong relationship. Gamma is recommended when there are lots of ties in the data. Taub is recommended for square tables.

Fisher’s exact test is used when there are very few cases in the cells (usually less than 5). It tests the relationship between two variables. The null is that variables are independent. Here we reject the null and conclude that there is some kind of relationship between variables

Fisher's exact = 0.000 Kendall's tau-b = 0.1553 ASE = 0.026 gamma = 0.3095 ASE = 0.050 Cramér's V = 0.1563 likelihood-ratio chi2(2) = 33.8162 Pr = 0.000 Pearson chi2(2) = 33.5266 Pr = 0.000

100.00 100.00 100.00 45.74 54.26 100.00 Total 628 745 1,373 0.80 1.34 1.09 33.33 66.67 100.00 Not sure/ref 5 10 15 31.21 46.04 39.26 36.36 63.64 100.00 Bad 196 343 539 67.99 52.62 59.65 52.14 47.86 100.00 Well 427 392 819 Nat'l Eco) Male Female Total (Status of Gender of Respondent ecostatu RECODE of

stage 1: enumerations = 0stage 2: enumerations = 16stage 3: enumerations = 1Enumerating sample-space combinations:

. tab ecostatu1 gender, column row nokey chi2 lrchi2 V exact gamma taub

OTR 26

Exploring data: descriptive statisticsFor continuous data use descriptive statistics. These statistics are a collection of measurements of: location and variability. Location tells you the central value the variable (the mean is the most common measure of this) . Variability refers to the spread of the data from the center value (i.e. variance, standard deviation). Statistics is basically the study of what causes such variability. We use the command tabstat to get these stats.

tabstat age sat score heightin readnews, s(mean median sd var count range min max)

•The mean is the sum of the observations divided by the total number of observations. •The median (p50 in the table above) is the number in the middle . To get the median you have to order the data from lowest to highest. If the number of cases is odd the median is the single value, for an even number of cases the median is the average of the two numbers in the middle.•The standard deviation is the squared root of the variance. Indicates how close the data is to the mean. Assuming a normal distribution, 68% of the values are within 1 sd from the mean, 95% within 2 sd and 99% within 3 sd •The variancemeasures the dispersion of the data from the mean. It is the simple mean of the squared distance from the mean.•Count (N in the table) refers to the number of observations per variable.•Range is a measure of dispersion. It is the difference between the largest and smallest value, max – min.•Min is the lowest value in the variable.•Max is the largest value in the variable.

max 39 2309 96 75 7 min 18 1338 63 59 3 range 21 971 33 16 4 N 30 30 30 30 30variance 47.2 75686.71 102.2402 21.7023 1.636782 sd 6.870226 275.1122 10.11139 4.658573 1.279368 p50 23 1817 79.5 66.5 5 mean 25.2 1848.9 80.36667 66.43333 4.866667 stats age sat score heightin readnews

. tabstat age sat score heightin readnews, s(mean median sd var count range min max)

Type help tabstat for a complete list of descriptive statistics

OTR 27

Exploring data: descriptive statisticsYou could also estimate descriptive statistics by subgroups (i.e. gender, age, etc.)

tabstat age sat score heightin readnews, s(mean median sd var count range min max) by(gender)

Type help tabstat for more options.

39 2309 96 75 7 18 1338 63 59 3 21 971 33 16 4 30 30 30 30 30 47.2 75686.71 102.2402 21.7023 1.636782 6.870226 275.1122 10.11139 4.658573 1.279368 23 1817 79.5 66.5 5 Total 25.2 1848.9 80.36667 66.43333 4.866667 39 2279 96 75 7 18 1434 65 63 3 21 845 31 12 4 15 15 15 15 15 45.88571 61046.14 92.42857 15.55238 1.695238 6.773899 247.0752 9.613978 3.943651 1.302013 28 1787 82 71 4 Male 27.2 1826 82 69.46667 4.533333 38 2309 95 68 7 18 1338 63 59 3 20 971 32 9 4 15 15 15 15 15 43.31429 94609.74 113.6381 9.685714 1.457143 6.581359 307.587 10.66012 3.112188 1.207122 20 1821 79 63 5Female 23.2 1871.8 78.73333 63.4 5.2 gender age sat score heightin readnews

by categories of: gender (Gender)Summary statistics: mean, p50, sd, variance, N, range, min, max

. tabstat age sat score heightin readnews, s(mean median sd var count range min max) by(gender)

OTR 28

Average SAT scores by gender and major. Notice, ‘sat’ variable is a continuous variable. The first cell reads the average SAT score for a female whose major is econ is 1952.3333 with a standard deviation 312.43, there are only 3 females with a major in econ.

Frequencies (tab command) Crosstabulations (tab with two variables)

Examples of frequencies and crosstabulations

In this sample we have 15 females and 15 males. Each represents 50% of the total cases.

Total 30 100.00 Male 15 50.00 100.00 Female 15 50.00 50.00 Gender Freq. Percent Cum.

. tab gender

100.00 100.00 100.00 50.00 50.00 100.00 Total 15 15 30 66.67 33.33 50.00 66.67 33.33 100.00 Male 10 5 15 33.33 66.67 50.00 33.33 66.67 100.00 Female 5 10 15 Gender Graduate Undergrad Total Student Status


. tab gender studentstatus, column row

10 10 10 30 219.16559 329.76928 287.20687 275.11218 Total 1806 1844 1896.7 1848.9 7 2 6 15 155.6146 72.124892 288.99994 247.07518 Male 1743.2857 2170 1807.8333 1826 3 8 4 15 312.43773 317.99326 262.25052 307.58697 Female 1952.3333 1762.5 2030 1871.8 Gender Econ Math Politics Total Major

Means, Standard Deviations and Frequencies of SAT

. tab gender major, sum(sat)

OTR 29

Three way crosstabs

bysort var3: tab var1 var2, colum row

bysort studentstatus: tab gender major, colum row

100.00 100.00 100.00 100.00 40.00 46.67 13.33 100.00 Total 6 7 2 15 50.00 14.29 50.00 33.33 60.00 20.00 20.00 100.00 Male 3 1 1 5 50.00 85.71 50.00 66.67 30.00 60.00 10.00 100.00 Female 3 6 1 10 Gender Econ Math Politics Total Major


-> studentstatus = Undergraduate

100.00 100.00 100.00 100.00 26.67 20.00 53.33 100.00 Total 4 3 8 15 100.00 33.33 62.50 66.67 40.00 10.00 50.00 100.00 Male 4 1 5 10 0.00 66.67 37.50 33.33 0.00 40.00 60.00 100.00 Female 0 2 3 5 Gender Econ Math Politics Total Major


-> studentstatus = Graduate

. bysort studentstatus: tab gender major, column row

OTR 30

Three way crosstabs with summary statistics of a fourth variable

Average SAT scores by gender and major for graduate and undergraduate students. The third cell reads: The average SAT score of a female graduate student whose major is politics is 2092.6667 with a standard deviation of 2.82.13, there are 3 graduate female students with a major in politics.

6 7 2 15 208.30979 336.59952 54.447222 257.72682 Total 1903.8333 1809.2857 1880.5 1856.6 3 1 1 5 61.711695 0 0 122.23011 Male 1855.3333 2119 1919 1920.8 3 6 1 10 312.43773 337.01197 0 305.36872 Female 1952.3333 1757.6667 1842 1824.5 Gender Econ Math Politics Total Major


-> studentstatus = Undergraduate

4 3 8 15 154.66819 367.97826 324.8669 300.38219 Total 1659.25 1925 1900.75 1841.2 4 1 5 10 154.66819 0 317.32286 284.3086 Male 1659.25 2221 1785.6 1778.6 0 2 3 5 . 373.35238 282.13531 323.32924 Female . 1777 2092.6667 1966.4 Gender Econ Math Politics Total Major


-> studentstatus = Graduate

. bysort studentstatus: tab gender major, sum(sat)

OTR 31

First steps: Quick way of finding variables (lookfor)

You can use the command lookfor to find variables in a dataset, for example you want to see which variables refer to education, type:

lookfor educ

lookfor will look for the keyword ‘educ’ in the variable name and labels. You will need to be creative with your keyword searches to find the variables you need.

It always recommended to use the codebook that comes with the dataset to have a better idea of where things are.

educ byte %10.0g Education of R. variable name type format label variable label storage display value

. lookfor educ

OTR 32

First steps: Subsetting using conditional ‘if’

Sometimes you may want to get frequencies, crosstabs or run a model just for a particular group (lets say just for females or people younger than certain age). You can do this by using the conditional ‘if’, for example:

/*Frequencies of var1 when gender = 1*/tab var1 if gender==1, column row

/*Frequencies of var1 when gender = 1 and age < 33*/tab var1 if gender==1 & age<33, column row

/*Frequencies of var1 when gender = 1 and marital status = single*/tab var1 if gender==1 & marital==2 | marital==3 | marital==4, column row

/*You can do the same with crosstabs: tab var1 var2 … */

/*Regression when gender = 1 and age < 33*/regress y x1 x2 if gender==1 & age<33, robust

/*Scatterplots when gender = 1 and age < 33*/scater var1 var2 if gender==1 & age<33

“if” goes at the end of the command BUT before the comma that separates the options from the command.

OTR 33

Graphs: scatterplotScatterplots are good to explore possible relationships or patterns between variables and to identify outliers. Use the command scatter(sometimes adding twoway is useful when adding more graphs). The format is scatter y x. Below we check the relationship between SAT scores and age. For more details type help scatter .twoway scatter sat age twoway scatter sat age, mlabel(last)

twoway scatter sat age, mlabel(last) || lfit sat age, yline(30) xline(1800)

twoway scatter sat age, mlabel(last) || lfit sat age

1400

1600

1800

2000

2200

2400

SAT

20 25 30 35 40Age

DOE01

DOE02

DOE03DOE04

DOE05

DOE06

DOE07

DOE08

DOE09

DOE10DOE11

DOE12DOE13

DOE14

DOE15

DOE16

DOE17 DOE18

DOE19

DOE20

DOE21

DOE22

DOE23

DOE24DOE25

DOE26

DOE27

DOE28

DOE29

DOE30

1400

1600

1800

2000

2200

2400

SAT

20 25 30 35 40Age

DOE01

DOE02

DOE03DOE04

DOE05

DOE06

DOE07

DOE08

DOE09

DOE10DOE11

DOE12DOE13

DOE14

DOE15

DOE16

DOE17 DOE18

DOE19

DOE20

DOE21

DOE22

DOE23

DOE24DOE25

DOE26

DOE27

DOE28

DOE29

DOE30

1400

1600

1800

2000

2200

2400

20 25 30 35 40Age

SAT Fitted values

DOE01

DOE02

DOE03DOE04

DOE05

DOE06

DOE07

DOE08

DOE09

DOE10DOE11

DOE12DOE13

DOE14

DOE15

DOE16

DOE17 DOE18

DOE19

DOE20

DOE21

DOE22

DOE23

DOE24DOE25

DOE26

DOE27

DOE28

DOE29

DOE30

1400

1600

1800

2000

2200

2400

20 25 30 35 40Age

SAT Fitted valuesOTR 34

Graphs: scatterplot

twoway scatter sat age, mlabel(last) by(major, total)

By categories

DOE08DOE12

DOE15

DOE17 DOE18DOE19DOE21DOE25

DOE27

DOE30DOE02

DOE04DOE05

DOE06DOE07

DOE09

DOE11

DOE14

DOE16DOE28

DOE01

DOE03

DOE10

DOE13DOE20DOE22

DOE23

DOE24DOE26

DOE29 DOE01

DOE02DOE03DOE04

DOE05

DOE06DOE07

DOE08

DOE09

DOE10DOE11

DOE12 DOE13DOE14

DOE15 DOE16

DOE17 DOE18DOE19DOE20

DOE21DOE22DOE23

DOE24DOE25DOE26

DOE27

DOE28DOE29

DOE30

1000

1500

2000

2500

1000

1500

2000

2500

20 25 30 35 40 20 25 30 35 40

Econ Math

Politics Total

SAT Fitted values

Age

Graphs by Major

Go to http://www.princeton.edu/~otorres/Stata/ for additional tipsOTR 35

Graphs: histogram

Histograms are another good way to visually explore data, especially to check for a normal distribution. Type help histogram for details.

histogram age, frequency

05

1015

Freq

uenc

y

20 25 30 35 40Age

histogram age, frequency normal

05

1015

Freq

uenc

y

20 25 30 35 40Age

OTR 36

Category Stata commands

Getting on‐line help help

search

Operating‐system interface pwd

cd

sysdir

mkdir

dir / ls

erase

copy

type

Using and saving data from disk use

clear

save

append

merge

compress

Inputting data into Stata input

edit

infile

infix

insheet

The Internet and Updating Stata update

net

ado

news

Basic data reporting describe

codebook

inspect

list

browse

count

assert

summarize

Table (tab)

tabulate

Data manipulation generate

replace

egen

recode

rename

drop

keep

sort

encode

decode

order

by

reshape

Formatting format

label

Keeping track of your work log

notes

Convenience display

Source: http://ww

w.ats.ucla.edu/stat/stata/notes2/comm

ands.htm

Frequently used Stata commands

Type help [command name]

in the window

s comm

and for details

OTR 37

Regression diagnostics: A checklisthttp://www.ats.ucla.edu/stat/stata/webbooks/reg/chapter2/statareg2.htm

Logistic regression diagnostics: A checklisthttp://www.ats.ucla.edu/stat/stata/webbooks/logistic/chapter3/statalog3.htm

Times series diagnostics: A checklist (pdf)http://homepages.nyu.edu/~mrg217/timeseries.pdf

Times series: dfueller test for unit roots (for R and Stata)http://www.econ.uiuc.edu/~econ472/tutorial9.htmlhttp://dss.princeton.edu/training/TS101.pdf#page=19

Panel data tests: heteroskedasticity and autocorrelation

– http://www.stata.com/support/faqs/stat/panel.html– http://www.stata.com/support/faqs/stat/xtreg.html– http://www.stata.com/support/faqs/stat/xt.html– http://dss.princeton.edu/online_help/analysis/panel.htm

Is my model OK? (links)

OTR 38

Data Analysis: Annotated Outputhttp://www.ats.ucla.edu/stat/AnnotatedOutput/default.htm

Data Analysis Exampleshttp://www.ats.ucla.edu/stat/dae/

Regression with Statahttp://www.ats.ucla.edu/STAT/stata/webbooks/reg/default.htm

Regressionhttp://www.ats.ucla.edu/stat/stata/topics/regression.htm

How to interpret dummy variables in a regressionhttp://www.ats.ucla.edu/stat/Stata/webbooks/reg/chapter3/statareg3.htm

How to create dummieshttp://www.stata.com/support/faqs/data/dummy.htmlhttp://www.ats.ucla.edu/stat/stata/faq/dummy.htm

Logit output: what are the odds ratios?http://www.ats.ucla.edu/stat/stata/library/odds_ratio_logistic.htm

I can’t read the output of my model!!! (links)

OTR 39

What statistical analysis should I use?http://www.ats.ucla.edu/stat/mult_pkg/whatstat/default.htm

Statnotes: Topics in Multivariate Analysis, by G. David Garsonhttp://www2.chass.ncsu.edu/garson/pa765/statnote.htm

Elementary Concepts in Statisticshttp://www.statsoft.com/textbook/stathome.html

Introductory Statistics: Concepts, Models, and Applicationshttp://www.psychstat.missouristate.edu/introbook/sbk00.htm

Statistical Data Analysishttp://math.nicholls.edu/badie/statdataanalysis.html

Stata Library. Graph Examples (some may not work with STATA 10)http://www.ats.ucla.edu/STAT/stata/library/GraphExamples/default.htm

Comparing Group Means: The T-test and One-way ANOVA Using STATA, SAS, and SPSShttp://www.indiana.edu/~statmath/stat/all/ttest/

Topics in Statistics (links)

OTR 40

Useful links / Recommended books

• DSS Online Training Section http://dss.princeton.edu/training/• UCLA Resources to learn and use STATA http://www.ats.ucla.edu/stat/stata/• DSS help‐sheets for STATA http://dss/online_help/stats_packages/stata/stata.htm• Introduction to Stata (PDF), Christopher F. Baum, Boston College, USA. “A 67‐page description of Stata, its key

features and benefits, and other useful information.” http://fmwww.bc.edu/GStat/docs/StataIntro.pdf• STATA FAQ website http://stata.com/support/faqs/• Princeton DSS Libguides http://libguides.princeton.edu/dss

Books

• Introduction to econometrics / James H. Stock, Mark W. Watson. 2nd ed., Boston: Pearson Addison Wesley, 2007.

• Data analysis using regression and multilevel/hierarchical models / Andrew Gelman, Jennifer Hill. Cambridge ; New York : Cambridge University Press, 2007.

• Applied Regression Analysis and Generalized Linear Models, Second Edition. John Fox, Sage, 2008

• Econometric analysis / William H. Greene. 6th ed., Upper Saddle River, N.J. : Prentice Hall, 2008.

• Designing Social Inquiry: Scientific Inference in Qualitative Research / Gary King, Robert O. Keohane, Sidney Verba, Princeton University Press, 1994.

• Unifying Political Methodology: The Likelihood Theory of Statistical Inference / Gary King, Cambridge University Press, 1989

• Statistical Analysis: an interdisciplinary introduction to univariate & multivariate methods / Sam Kachigan, New York : Radius Press, c1986

• Statistics with Stata (updated for version 9) / Lawrence Hamilton, Thomson Books/Cole, 2006OTR 41

exploring data and descriptive statistics (using stata)

Documents