chapter 1 · pdf file© 2003-present hun myoung park (1/26/2013) statistical and...

© 2003-Present Hun Myoung Park (1/26/2013) Statistical and Econometric Data Analyses in Stata: 1

http://www.sonsoo.org

CHAPTER 1

INTRODUCTION This chapter introduces Stata by describing its basic features, installation, updates, memory configuration, and online helps. 1.1 What is Stata? Stata is an integrated data analysis package for managing, analyzing, and graphing data. Like SAS and SPSS, Stata supports most statistical and econometric analyses that are frequently used in various fields. But it also has many features for data management, graphics, matrices operations, and programming language. The statistical and econometric analyses Stata supports include:

Linear Regression Models (Ordinary Least Squares) Generalized Linear Models Categorical Dependent Variable Models (Logit/Probit Models) Panel Data Models Event Count Data Models Time Series Analysis Tobit and Survival Analyses T-test and ANOVA Multivariate Analyses Nonparametric Methods Sampling and Simulations

Unlike SAS and SPSS, Stata is basically a command driven package in which users type in a command and hit ENTER to run it. Stata benefits from this interactive mode that provides highly flexible and efficient ways of communication. Also Stata supports the point-and-click GUI interface, batch processing (non-interactive mode), and programming; For instance, users can write their own commands. 1.2 Stata Flavors Stata is available in a variety of platforms and flavors. Stata runs on UNIX and UNIX-Like (e.g., Mac OSX, Linux, AIX, HP-UX, Irix, and Solaris) as well as Microsoft Windows. Stata has four different flavors. Stata/MP (Multiprocessor) and Stata/SE (Special Edition) are most powerful in a sense that it can handle large data sets and matrices fast. Stata/MP supports parallel processing using multiprocessors or multi-core processors (e.g., dual core and quad core). Intercooled Stata, a standard version between Stata/SE and Small Stata, provides moderate capacity for ordinary users. Small Stata is very limited in its capacity; for instance, it supports up to 99 variables. Table 1.1 summarizes major features of three major flavors. Table 1.1 Comparison of Three Major Flavors Stata/MP Stata/SE Stata/IC Observations Limited by resources Limited by resources Limited by



resources Max # Variables 32,767 32,767 2,047 Max # Right-hand Vars

10,998 10,998 798

Dataset Width 393,192 393,192 24,564 Command 1,081,527 characters 1,081,527 characters 67,800 characters Macro 1,081,511 characters 1,081,511 characters 67,800 characters String Variable 244 characters 244 characters 80 characters Matrices 11,000 by 11,000 11,000 by 11,000 800 by 800 One-way Table 12,000 12,000 3,000 Two-way Table 12,000 by 80 12,000 by 80 300 by 20

You may check the current version by executing the .about command or .version. Type in “about” in the Stata command window and hit ENTER to get the result. Note that the period (.) is the Stata prompt. . about Stata/SE 12.1 for Mac (64-bit Intel) Revision 18 Dec 2012 Copyright 1985-2011 StataCorp LP 45-user Stata network perpetual license: …

1.3 Installing Stata Make sure you have the serial number, license code, and authorization key. The information is not required during installation, but should be provided when you first run Stata after installation. Once Stata installer begins running, just follow the instructions provided. You are asked to choose the directory in which Stata is installed; the default is C:\STATA8. Then, you need to choose the flavor of Stata. If your license is of the intercooled Stata, Stata/SE and Small Stata will not work. Click the icon of your license. Before copying Stata files to your hard disk, Stata may ask you to choose the working directory; the default is C:\DATA.1 Once installation is complete, you may run Stata to type in the serial number, license code, and authorization key. You are asked to type in your name and institution so that they appear on the screen when Stata is launched. If you wish to verify if Stata is successfully installed, execute the .verinst command. If installation is performed correctly, you may see the following message. . verinst Stata/SE 12.1 for Mac (64-bit Intel) Revision 18 Dec 2012 Copyright 1985-2011 StataCorp LP 45-user Stata network perpetual license:

1 You may check the working directory at the left bottom of the Stata window.



There may be user-written ado files that you have interested in. You may search and get them installed using the .net command. Following commands download and install the SPOST module written by Scott J. Long for categorical dependent variable models. . net from http://www.indiana.edu/~jslsoc/stata/ . net install spostado

See the Section 11. Managing User-Written Files for the details. 1.4 Starting and Terminating Stata In Microsoft Windows, you can launch Stata by clicking the Stata icon from the Windows Start menu. Under the X window system, you have to execute xstata at the X terminal prompt to get Stata’s main windows. $ xstata

In UNIX machines, you need to type stata at the UNIX prompt to start Stata in an interactive mode. $ stata

If you want to run a batch job in non-interactive mode, at the UNIX prompt, type $ stata -b do cigar.do

You need to replace “cigar.do” with your Do-file name. The default extension of “.do” can be omitted. To terminate Stata, type exit in the command window (GUI) or command line (UNIX). Alternatively, you may choose FILEExit (Alt+F4) or click at the right upper corner of Stata window under Microsoft Windows and X window. . exit . exit, clear

Note that if you wish to terminate Stata without saving any change, add the clear option as in the second command above. 1.5 Stata Windows In X window and Microsoft Windows, you have four default windows when Stata is first launched: Stata Command window, Stata Results window, Variables window, and Review window. Stata Command window is the place where you type in commands. Stata Results window is the place where results are displayed. Variables window at the left bottom lists the variable

names in the current dataset. Review window lists the old commands executed so far. There are three more windows. You may click Window menu to see all windows available in Stata (see the left screenshot). Viewer window shows contents of Stata help or text files. Data Editor



window browses data in a spreadsheet style so that users can check and correct them. Finally Do-file Editior window allows users to write .do or .ado programs. 1.6 Managing Memory Stata puts a dataset into computer memory (including virtual memory), but it does not automatically use all the memory available in your computer. Stata/SE by default assigns 10MB. To check the current memory reserved to Stata, run the .memory command. . memory

When you try to read a dataset larger than current memory size, Stata might give you a warning message such as: No room to add more observations

If this happens, you will need to increase the memory size using the .set memory command so that Stata has an enough room for the large data set. Current Stata by default manages memory.You can also change the maximum number of variables and matrices size. . set maxvar 10000 . set matsize 1000

However, increasing memory size does not always improve the overall performance of Stata. The optimal memory size depends upon computing resources and the size of the dataset. 1.7 Default extensions The following table summarizes Stata’s default extensions that are often omitted. Default File Types Related Commands .dta Stata data file .use and .save .do Stata do file .do and .doedit .ado Automatically loaded do file .doedit .log Log file in text mode .log .smcl Log file in SMCL format .cmdlog .raw ASCII text file .infile, . infix, and . insheet .out Files saved by the .outsheet .outsheet .dct Stata dictionary file .infix .gph Graph image .graph

1.8 Updating Official Stata Files (.update)

Stata supports Internet functionality through the .update , .net, and .ado commands. These commands make it easier to update Stata files and user-written files through the Internet. Accordingly, your computer needs to be hooked on the network in order to use these commands. The .update command is used to update official Stata files, which means by the Stata executable file and ado files that produced by Stata company. The .update command without any option reports on the current update status and gives recommendation, if necessary.



. update Stata executable folder: C:\Wins\Stata8\ name of file: wstata.exe currently installed: 24 Apr 2003 Ado-file updates folder: C:\Wins\Stata8\ado\updates\ names of files: (various) currently installed: 09 Sep 2003 Recommendation Type -update query- to compare these dates with what is available from http://www.stata.com.

The all option compares the current ado files and the Stata executable file with those available from Stata company, and then downloads and installs the update files, if necessary. . update all

You may check ado files or the Stata executable separately. The executable option compares the current Stata executable file with the corresponding official update, and then downloads it, if necessary. The ado option update ado files only. Consequently, the all option is more convenient and recommended than the executable and ado options. . update ado . update executable (contacting http://www.stata.com) Executable update log 1. verifying "C:\Wins\Stata8\" is writeable 2. downloading new executable New executable successfully downloaded Instructions 1. Type -update swap-

Stata stores the new executable as wstata.bin where wstata.exe is located. You may manually delete the old executable file and rename the new executable. But the .update swap performs that task for you. 1.9 Managing User-Written Files (.net and .ado) The .net command allows users to find out useful user-written resources from the Internet or media, and then download and install them to Stata. The resources includes packages and ancillary files. A package in Stata is a collection of ado and help files that provides a new feature in a command. Ancillary files are additional files, such as datasets and example files. . net search spost

The above command searches packages associated with the keyword “spost,” and then lists them with their URLs.



Once finding useful one, you may need to view the contents of the package using the .net from command. . net from http://www.indiana.edu/~jslsoc/stata/

From the list of contents, choose one you need. Then, use the .net describe command to get more information about the package. . net describe spostado . net describe spostst8

You may check that the “spostado” is a package including ado and its help files, and that the “spostst8” has a set of example files. You need to install the package “spostado” using the .net install command and copy the set of ancillary files “spostst8” using the .net get command. You may not switch the commands. . net install spostado . net get spostst8

If you know the right URLs, specify them using the from option. . net install spostado, from(http://www.indiana.edu/~jslsoc/stata) . net get spostst8, from(http://www.indiana.edu/~jslsoc/stata)

Now, you are ready to use commands supported by the package. You may double-check if the package is available using the .ado command. The .ado command manages the packages you have installed using the .net command. This command allows you to list and remove the packages installed. . ado . ado describe spostado . ado uninstall spostado



The first command lists the packages you have installed. The second shows the contents of the package, while the third remove it. You may load or copy a dataset from web pages. . use http://mypage.iu.edu/~kucc625/documents/cancer.dta . copy http://mypage.iu.edu/~kucc625/documents/cancer.dta c:\cancer.dta . type http://mypage.iu.edu/~kucc625/documents/cancer.txt

Note that the last command is to view the contents of an ASCII text file “cancer.txt.” 1.10 Backward Compatibility Stata 8.0 introduces new or remarkably enhanced features that have not been supported in previous releases. Among the features are the point-and-click command mode and the .graph command. You may check the differences across Stata releases by running the following command. . help version

The .graph command in Stata 8.0 provides higher quality graphs at the expense of changes in its syntaxes. In other words, the old . graph command does not work correctly in release 8. If you still wish to use old style syntax in Stata 8, you can either change the command interpreter to a lower version with the .version command, or use the .graph7 command instead of the .graph. . version 7 . graph score, bin(10) normal

Above commands draw a histogram of variable “score” with a normal curve overlapping. They are equivalent to the following command. . graph7 score, bin(10) normal

You may change the command interpreter back to release 8.0 by specifying the version number. The following .histogram command is equivalent to the above graph commands. . version 8 . histogram score, bin(10) normal

1.11 On-line Help and Internet Resources Stata’s help system is well organized and resourceful. The .help command lists contents of commands and functions on the Stata results window. You may attach a command whose usage you want to know. The second command below, for instance, shows the syntax, explanation, and examples of the .regress command. . help . help regress



You may look at help in the point-and-click mode by running the .view help command. The Stata viewer allows you to navigate the entire command system organized in a hierarchical order. . view help You may check what has been added since releasing 8.0 by running the following command. . help whatsnew Stata provides various services through the Internet.

http://www.stata.com (Stata Webpage) http://www.stata.com/support/ (Support) http://www.stata.com/support/faqs/ (Frequently Asked Questions) http://www.stata-journal.com/ (Stata Journal) http://www.stata-press.com/ (Stata Bookstore)

You may find very useful internet resources for using Stata.

http://www.princeton.edu/~erp/stata/main.html (Princeton University) http://www.indiana.edu/~jslsoc (J. Scott Long, Indiana University) http://sobek.colorado.edu/LAB/STATS/stata_help.html (University of Colorado)



CHAPTER 2

COMMAND, OPERATOR, AND FUNCTION This chapter explains how to communicate with Stata in three different ways. Then, major commands, operators, functions, and data types are listed. In addition, we also come up with how to specify a subset of a dataset, how to repeat a command on groups, and how to use the .display command as a calculator and a probability distribution table. 2.1 Interface Modes Stata supports interactive, non-interactive (batch mode), Graphic interface modes. 2.1.1 Interactive mode Basically Stata is a command-driven application. In other words, users need to type in a command and hit ENTER to run the command as in UNIX and DOS. Then, Stata interprets the command, processes the job, and return its result to users. This interactive mode enables users to communicate with Stata step by step. GAUSS, S-Plus, Matlab, and Maple also use this interactive mode of communication. Stats’ systematic grammar structure and abbreviation rules provide highly flexible and efficient ways of communication. Following command runs a linear regression of “lung” on “cigar.”

This mode has several advantages. First, it makes it efficient to perform many tasks, such as recoding variables and listing observations. Stata must come in pretty handy especially for “data cooking.” Imaging you can use Stata as a calculator or probability distribution tables. See the Section 11 for the details. The second strength originates from the way that interpreters works. Unlike compilers, Stata command interpreter keeps analysis results in memory even after executing commands so that users can conduct necessary follow-up analyses without running entire analyses again. For example, you can run a linear regression model, and then check its results. You may feel like getting predicted values using the .predict command and conducting hypothesis tests using the .test command. It means that the coefficient matrix and the variance-covariance matrix remain in the memory. In SAS and SPSS, by contrast, you have to run the regression again after making proper changes for predicted values and hypothesis testing. 2.1.2 Non-interactive mode (batch mode) The non-interactive mode runs a set of commands written in a text file. Classical statistical software such as SAS and SPSS uses this mode of communication. Stata non-interactive mode supports two kinds of programs: “.do” and “.ado” files. Users can write a “.do” file, a



batch file, in which a set of Stata commands are organized. You may run the entire commands individually in the Stata command window. But writing a do file is more efficient especially when you have a bundle of commands to be repeated many times. Like C and Java program sources, Stata programs (i.e., do and ado files) may be written in a text editor (e.g., Notepad) or a wordprocessor (e.g., Wordperfect), but they should be stored in a plain ASCII text format. Of course, you may use Stata Do-file editor by clicking WindowDo-file Editor or pressing Ctrl+8. Alternatively, run the .doedit command or click

the Do-file editor icon . . doedit . doedit cigar.do

The first command above creates a new .do file, while the second reads and edits an existing .do file “cigar.do.”

Once a .do file is ready (edited and saved), you can execute the batch job by running the .do command in the command window. Like in SAS, alternatively, you may choose ToolsDo

menu (Ctrl+D) or click in the Do-file Editor window. When you wish to execute only a part of commands, highlight the block of commands using a mouse, and choose ToolsDo Selection menu. . do cancer.do

Another type of programs is the “.ado” file, a source of Stata commands. Put differently, many Stata commands, such as .logit, .regress, and .recode, are based on ado files. Stata company provides basic .ado files that are installed under “…stata8\ado\base” directory. But users can write .ado programs as well. That is, users can add their own commands to Stata. Unlike .do files, .ado programs need to be written in the Stata ado language, which looks similar to C.



2.1.3 GUI (Graphical User Interface) Users can benefit from the point-and-click environment, which is supported since Stata 8.0. Users pull down Stata menus and select a proper menu to invoke a dialog box to run a command. Stata’ GUI builds a command on the basis of information provided in dialog boxes. Thus, the command is echoed on the Results window, allowing users to compare the point-and-clicking with its corresponding command. So GUI mode seems quite useful in particular for Stata beginners. Most statistical software (e.g., SAS and SPSS) nowadays supports this mode of communication. Users may use shortcut instead of pointing and clicking menus. For example, Ctrl+S (pressing S key while the Ctrl key is pressed) is equivalent to choosing FILESave. Interestingly, you may invoke a proper dialog box by executing the .db command instead of using pull-down pop-up menus. . db regress

The above command is equivalent to choosing STATISTICLinear Regression and relatedLinear Regression (See the screenshot). 2.2 Rules of Commands 2.2.1 Stata is casesensitive. The commands are lowercased. In order words, “REGRESS” and “Regress” do not work at all; use the ”regress.”



2.2.2 Stata commands, variable names, and options can be abbreviated to the shortest string of characters as long as they are uniquely identified. This highly flexible abbreviation is one of Stata’s fascinating features. The minimum abbreviations are underlined in help and manuals (e.g., tabulate). However, some commands like the .replace cannot be abbreviated. For example, the .regress command can be abbreviated as .reg, .regr, .regre, and .regres. Similarly, the nolabel option can be reduced to nol. A variable name “gender” may be referred as “gen” unless there are variable names in the current dataset beginning with “gen” (e.g., “gene” and “genre”). You may use wildcards (i.e., ?, *, and ~) when abbreviating variable names. See the Section 4 for the details of wildcards. 2.2.3 Syntax Structure: In general, a Stata command consists of (a) a command, (b) a list of variables, (c) qualifiers, and (d) options. Some commands may have their subcommands. The in and if qualifiers are used to specify a subset of datasets to which a command is applied. See the Section 8 for the details of the qualifiers. . list . list state-lung k* . list if area==4 . list in 10/l . list, nolabel noobs separator(10) . list state-lung k* in 10/l if area==4, nol noo sep(10)

Omitting a list of variables implies all variables (the first command). You may use wildcards when listing variables (the second). The third and fourth are examples of the if and in qualifiers. The fifth shows how a series of options is listed. The last combines all of these components of a command. See the Chapter 7 for the details of the .list command. 2.2.4 A dependent variable precedes a set of independent variables. In the following example, “yesno” is the dependent variable, whereas “income,” “education,” and “occupation” are independent variables. . logit yesno income education occupation if gender==1,robust

2.2.5 Comma: A command and its options should be separated by a comma. But, there is no comma in the list of variables and the list of options. . tabulate grade degree, chi2 expected gamma Note that in the above .tabulate command, the chi2, expected, gamma are all options that might be omitted.2 2.3 Major Commands This section classifies major Stata commands in comparison with those of SAS and SPSS. 2.3.1 Descriptive Statistics Stata Commands SAS Procedures summarize; tabstat; inspect UNIVARIATE; CAPABILITY sktest; swilk; sfrancia UNIVARIATE

2 The “chi2” conducts chi-square test; the “expected” computes the expected frequencies of cells; the “gamma” shows the gamma statistic, a measure of association for ordinal variables.



summarize; tabstat MEANS; SUMMARY tabulate FREQ tabulate TABULATE list; browse PRINT; REPORT graph; dotplot; histogram CHART; PLOT

2.3.2 Data Management Stata Commands SAS Procedures describe CONTENTS use; save; edit DATA (SET) generate; replace; recode DATA keep; drop DATA (KEEP; DROP) label; format DATA; FORMAT append; merge DATA (MERGE) rename DATA (RENAME) collapse; reshape MEANS infile; insheet; infix DATA (INFILE); IMPORT outfile; outsheet EXPORT odbc SQL (SAS/SQL) sort; order SORT

2.3.3 Regression Models Stata Commands SAS Procedures regress REG logistic; logit; probit LOGISTIC; PROBIT ologit; mlogit; clogit GENMOD; CATMOD; MDC nl NLIN tobit; streg; stcox LIFEREG; PHREG ivreg; mvreg; reg3; sureg SYSLIN poisson; nbreg; zip; zinb GENMOD

2.3.4 ANOVA and Multivariate Stata Commands SAS Procedures ttest TTEST oneway; anova ANOVA glm; manova GLM; CATMOD; GENMOD factor; pca FACTOR; PRINCOMP correlate; pwcorr; alpha CORR cannon CANCORR cluster CLUSTER

* ANCOVA is conducted by the .anova in Stata, but by GLM in SAS and SPSS 2.3.5 Nonparametrics and Others Stata Commands SAS Procedures ksmirnov; kwallis; ranksum NPAR1WAY tab; tabi; kappa FREQ DISCRIM matrix IML (SAS/IML)

2.4 Operators, Wildcards, System Variables This section summarizes operators, wildcards, and system variables. 2.3.1 Operators Types Operators Arithmetic + (addition), - (subtraction), * (multiplication), / (division), ^ (raise to a

power) Relational > (greater than), >= (greater than or equal), < (less than), <= (less than or

equal), == (equal), != or ~= (not equal)



Logical & (and), | (or), ~ (not) Others = (assignment), + (string concatenation),

L#.variable (backward shift for time series data) 2.4.2 Wildcards and Other Symbols

Meaning Examples * Any characters re* ? Any character measure? ~ zero or more characters mil~um - Specifying range of variables gender-rank / Specifying range of observations (in the in

qualifier) in 1/100

// Comments (Programming) // to explain /// join the next line with the current line in do

and ado programs (Programming) .regress y x1 x2 x3, /// beta robust // options

|| Overlapping graphs (Graphics) || scatter … /*... */ Comments (Programming) /* to explain */

2.4.3 System Variables Example Meaning _all _all All variables _n _n Current observation number _N _N Total number of observations _coef (or _b) _coef[cigar] Coefficient of the variable “cigar” _se _se[cigar] Standard error of the coefficient of the variable _cons _b[_cons] Equal to 1 or the intercept term _pi _pi Value of _pred _pred _rc _rc Return code from the capture command _skip _skip

2.5 Functions Followings are the lists of major functions that are commonly used. 2.5.1 Mathematic Functions Functions Meaning abs(x) Absolute value sin(x), cos(x), tan(x) Sine, cosine, tangent ceil(x), floor(x) Unique value int(x), round(x) Truncations comb(n, k) Combinational function exp(x) Exponential function ln(x) or log(x) Natural logarithm logit(x), invlogit(x) Log of the odd ratio and its inverse max(x), min(x) Maximum and minimum values mod(x,y) Modulus of x with respect to y sign(x) Sign sqrt(x) Square root



sum(x) Sum 2.5.2 String functions Functions Meaning char(n) Character corresponding to ASCII code n index(s, key) Position in s at which key is first found; otherwise zero length(s) The length of a string lower(s) Lowercase string ltrim(s) A string without leading blanks real(s) To convert a string to a number reverse(s) A reversed string rtrim(s) A string without trailing blanks string(n), string(n, s) To convert a number to a (formatted) string substr(s, n1, n2) Substring of s starting at n1 for a length of n2 trim(s) String without leading and tailing blanks upper(s) Uppercase string word(s) The number of words in a string

* Also see Chapter 9. Section 6 Handling String Variables 2.5.3 Probability Functions Functions Meaning binorm(h, k, p) Joint cumulative distribution of bivariate normal chi2(d, x) Cumulative chi square distribution chi2tail(d, x) Reverse cumulative (upper-tail) chi square distribution F(d1, d2, f) Cumulative F distribution Fden(d1, d2, f) Probability density function of the F distribution Ftail(d1, d2, f) Reverse cumulative (upper-tail) F distribution norm(z) Cumulative standard normal distribution normden(z) Standard normal density normden(z, s) Rescaled standard normal density tden(d, t) Probability density function of Student’s t distribution ttail(d, t) Reverse cumulative (upper-tail) Student’s t distribution

* Also see Section 11.Using the .display Command 2.5.4 Other Useful Functions Functions Meaning autocode() Grouping observations group(#) Grouping observations recode Grouping observations uniform() Uniform pseudo-random numbers

2.6. Data Types Stata has six different data types, which are grouped into real number, integer, and string. In order for efficient memory management, use the appropriate type for your data. For example, int and byte are better than float and double if you have five point Likert scale variables. The latter types consume more memory than the formers. Keyword Type Range



float Real number 8.5 digits of precision double Real number 16.5 digits of precision int Integer -32767 ~ 32740 long Integer -2,147,483,647 ~ 2,147,483,620 byte Integer -127 ~ 100 str# String str1 through str244*

* Intercooled Stata and Small Stata support up to str80. 2.7 Rules of Naming Variables Naming is a beginning point of data analyses. Bad naming may frequently bother you during the analyses. Please take enough time to get good names; it will pay back soon. (1) Use characters (a through z and A through Z), numbers (0 though 9), or underscore (_). Do not use special characters, such as space, -, $, #, @, &, and ~. (2) A variable name should begin with a letter. Any number cannot come first. It is not recommended using underscore as the first letter as long as the variable name is similar to any system variable (i.e., _all, _b, _coef, _cons, _n, _N, _pi, _pred, _rc, _skip, and _se). (3) Avoid reserved words or keywords, such as byte, double, float, in, int, long, using, with, regress, anova, display, and tabulate. (4) Variable names need to have some meanings indicating what the variable is for. (5) The shorter the better, although Stata allows up to 32 characters (6) Use lower cases unless necessary or required. Keep in mind that Stata is case-sensitive. (7) Use group names so that you can take advantage of wildcards (e.g., score? and score1-score9). Category Good Bad, If not Invalid

(1) gnp_2002 gnp of 2002; gnp-2002; gnp#2002; gnp~2002 (2) score1; gnp_2003 1st_score; 2003_gnp (2) interest _interest (3) gender; education double; int; using; logit; glm; ttest; tabulate (4) invest_2003 x; y; z; xxx; yyy; zmdje; ej93nx6 (5) rInt_2003 real_interest_rate_of_in_2003 (6) income; sales_IBM INCOME; InCoMe; sales_ibm (6) rInt_US2003 rint_us2003; RINT_US2003 (7) score1; score2; score3… math, physics, history, management…

2.8 Specifying a Subset of a Dataset (if and in Qualifiers) The if and in qualifiers specify subset of a dataset in different ways. The if qualifier selects observations to which a command is applied by imposing conditions that the observations need to satisfy. You may use & and/or | relational operators to provide more than one condition. Consider the following examples.



. sum cigar-kidney if area==1

. list state cigar lung if (area==4) & (lung >= 10)

. regress bladder cigar if (area==2) | (area==3)

The in qualifier specifies the range of observations to which a command is applied. You may use observation numbers or some keywords indicating particular observations. . sum cigar-kidney in f/10 . list state cigar lung in 7 . regress bladder cigar in f/l

The first command returns summary statistics for the first ten observations. The second lists the values of the 7th observation. The third command of regressing “bladder” on “cigar” is applied to all observations (from the first through the last); so you may omit the “in f/l.” Keywords Example Meaning n in 10 The 10th observation -n in -10 The 10th observation from the last 1 (or f) in 1/10; in f/10 From the first observation through the 10th -1 (or l) in 15/-1; in 15/l From the 15th observation through the last

However, you may not list more than one observation numbers without the / operator, nor specify observation numbers as well as the range of observations at the same time. Accordingly, following commands do not work at all. . list state cigar lung in 7 9 18 (invalid commands) . regress bladder cigar in 7 9/-5 (invalid commands) 2.9 Repeating a Command on Groups (.bysort and .by) You may wish to run the same command on each group instead of the entire dataset. Let us get the summary statistics (e.g., mean and standard deviation) of variables “cigar” and “lung” in each area. . sum cigar lung if area==1 . sum cigar lung if area==2 . sum cigar lung if area==3 . sum cigar lung if area==4

This approach works, but it will be burdensome when there are many groups. Here is the rationale the .bysort (or .bys) and .by commands are needed. The .bysort repeats Stata command on each group without the if qualifier. Group variables needs to be sorted in some ways. There are three equivalent ways of repeating a command group by group. . bysort area: sum cigar lung _______________________________________________________________________________ -> area = 1 _______________________________________________________________________________ Variable | Obs Mean Std. Dev. Min Max -------------+-------------------------------------------------------- cigar | 8 27.94625 2.297881 23.78 31.1 lung | 8 21.72375 4.262283 12.11 25.95 _______________________________________________________________________________ -> area = 2 Variable | Obs Mean Std. Dev. Min Max -------------+-------------------------------------------------------- cigar | 12 23.70667 2.762431 19.96 27.91 lung | 12 18.31667 3.68153 12.12 22.8



...

The .bysort command first sorts the variable “area” in an ascending order, and then repeats the command on groups. Note that colon (:) separates the .bysort or .by from the command to be repeated. . by area, sort: sum cigar lung

The .by command above gives us the same result as that of the first command. You may omit the sort (or s) option, if you sort the variable separately as follow. . sort area . by area: sum cigar lung

You can also run various analysis commands with the .bysort and .by commands. . bys gender: regress income education age occupation

However not every Stata command can be used with the .bysort and .by commands. The .sktest command, for example, cannot be combined with them. When writing a .do or .ado program, you may need to repeat a set of commands. See the Chapter 5. Stata Programming for looping commands (i.e., .while, .forvalues, .foreach) and the .if command. 2.10 Using Explicit Subscripts You may wish to refer individual observations of variables. For example, “What is the value of cigar of the tenth observation?” You may add a subscript enclosed with brackets to a variable name as follow. . display cigar[10] 20.1

The first command below creates variable “cigar2,” and then copy the value of “cigar” of the 10th observation. That is, a particular value of 20.1 is copied to the variable “cigar2” for all observations. The first differ from the second in that the latter copy each value of “cigar” to a new variable “cigar3.” . generate cigar2=cigar[10] . generate cigar3=cigar

This feature of explicit subscripts enable user to easily create a variable which has observation numbers using the system variable _n. It is also straightforward to generate a lagged variable if you use _n-1 as a subscript. . gen serial=state[_n] . gen cigar_lag=cigar[_n-1] However, zero, negative numbers, and numbers larger than _N (i.e., the total number of observations) result in a missing value. 2.11 Using the .display Command



The .display (or .di) command displays strings and values of various expressions. The first example below displays the values of system variables _pi and _cons. . display _pi " and " _cons 3.1415927 and 1

The .display command can list values of variables using explicit subscripts mentioned in the previous section. . display "The Cigar Consumption of " state[12] " State is " cigar[12] The Cigar Consumption of IN State is 26.18

You can use Stata as a calculator using the .display command. Consider the following examples. From the first to the third, they result in 78.5, .02210445, and .44271887. . display 5*5*3.14 . display (1.3)^(1/12)-1 . di (6.4-5.0)/sqrt(10)

You can also get p-values without referring probability distribution tables. See the Section 5 for detailed probability functions. . di norm(1.96) . di (1-norm(1.96))*2

The norm(z) returns the cumulative probability of the standard normal distribution. So the second command gives you the p-value of z score 1.96 for a two-tail test. The above commands respectively return .9750021 and .04999579. . di ttail(20, 2.086) . di (1-ttail(20, 2.086))*2

The ttail(df ,t) returns the reverse cumulative (upper-tail only) Student’s t distribution. So the second command gives you the p-value of the t value 2.086 with degree of freedom 20 for two-tail test. Thus, the above commands give you .02499818 and 1.9500036, respectively. . di F(5, 10, 3.325) . di Ftail(5, 10, 3.325) . di Ftail(5, 10, 3.325)*2

The F(df1, df2, F) shows the cumulative F distribution, while the Ftail(df1, df2, F) returns reverse cumulative (upper-tail only) F distribution. Note that the F is uppercased and that the first and second degrees of freedom are of numerator and denominator, respectively. Thus, the third give you the p-value of the F value 3.325 for a two-tail test. The three commands above respectively return .94996612, .05003388, and .10006777. . disp chi2(10, 18.307) . disp chi2tail(10, 18.307)

Similarly, the chi2(df, c) returns the cumulative chi square distribution, while chi2tail(df, c) gives you the reverse cumulative (upper-tail) chi-square distribution. The commands give you .94999941 and .05000059, respectively. 2.12 Using the .format Command



The .format command specifies the format of variables to be displayed. But this command does not affect actual values of variables. When a variable is copied, its format is also copied. You may check the current display format of each variable by execute .describe command, which shows variable names, types, formats, and labels. . describe In general, a format begins with % that is followed by a number (the total number of digits), period, a number (the number of digits below the decimal point), and letters indicating types of format. Let us put a comma in variable in order to make numbers more readable. In the following example, the first number “10” indicates the total number of digits including the decimal point, while the second “2” sets the number of digits below the decimal point. The letter “f” and “c” respectively mean “fixed format” and “comma format.” . format gnp2 gdp2 %10.2fc . list gnp gnp2 gdp gdp2 +-----------------------------------------------+ | gnp gnp2 gdp gdp2 | |-----------------------------------------------| 1. | 1600.929 1,600.93 3420.02 3,420.02 | 2. | 251.0714 251.07 3559.387 3,559.39 | 3. | 469 469.00 3569.177 3,569.18 | 4. | 227.7857 227.79 3910.404 3,910.40 | 5. | 339.8571 339.86 4649.005 4,649.00 | ... Note that “gnp” and “gdp” are displayed in their default format. The following is an example of a numeric format without any digit below the decimal point. . format l* %5.0f

If you wish to fill leading zero, add “0” right after the %. Note that wildcards * and - are used to list variables efficiently. . format cigar-kidney %010.2f

Now, you may want string variables to be left-justified. Use the “-“ and “s” to indicate “left-justified format” and “string format,” respectively. Again the “15” indicates the total number of characters of the variables to be displayed. . format last_name first_name %-15s

You may take “-“ out in order to get back to the default right-justified format. But, do not use “+.” . format last_name first_name %15s

For detailed formats, run the .help format command. 2.13 Handling Missing values



The missing value of a numeric variable is denoted by a single period (.). In string, missing value is expressed as “”. Any arithmetic operation on a missing value results in a missing value. You may wish to exclude missing values using the if qualifier. You may ask whether a variable is less than period (.). . sum cigar lung kidney if cigar<. . list cigar lung kidney if (cigar==.) | (lung>.) | (kidney>=.)

The first command above produces summary statistics of those observations whose variable “cigar” is not missing. The second lists the values of three variables when any one of the three is missing. Note that the three different usages of relational operators equally detect missing values in each variable. You may want to detect observations that have missing values in any variables specified. The .mark and .markout commands are useful for marking observations with the missing. The former command creates a dummy variable to be used by the latter. The .markout command sets 0 in the marking variable created by the .markout command if an observation has missing values. . mark yn_miss // to create a marking variable (dummy) . mark yn_miss cigar lung kidney . tab yn_miss, missing // to double-check flagging marks . drop if yn_miss==0 // to drop observations with missing values 2.14 Using Comments Using comments in Do-files is very useful when documenting the files. Comment can also be used to debug the Stata Do-files. Stata offers three ways of documentation. Asterisk (*) and double slash (//) put comments in single command line, while /*…*/, like in C and Java, can include multiple lines of comments. Any command should not come before *, whereas // must follow a command. // does not affect its preceding command. If asterisk is used in front of a command, Stata just ignores the command. Consider the following examples. . * This document is for statistical and econometric data analyses.

. recode year (1 2=0 ) (3 4=1), gen(class) // recoding to low and upper classes

. *recode year (1 2=0 ) (3 4=1), gen(class) Note that the // does not work in the command window (interactive mode), but works only in a Do-file. Sometimes you may wish to put detail information longer than single line in a Do-file. Like in SAS, it is the case of /* … */. Any command or comment between /* … */ is ignored when Stata executes the Do-file. /* This Do-file is to recode several key variables (recode_01142004.do) Date: Wednesday, January 14, 2004 */ use project003.dta,clear



...

2.15 Macros (.global and .local) Like SAS/Macro, Stata Macro enables to use programming variables in do and ado programs. As such, users can reduce human errors and tedious typing jobs. A macro consists of a macro name and its content. Once a macro is called, its macro content is substituted for the macro. A macro can be string or numeric. Macros are grouped into global macro or local macro. . global js = 625 // to declare a numeric macro . local fruit="Grape Pear Apple" // to declare a string macro

Local macros, frequently used in many most cases, exist only a program or a module in which they are declared. A local macro is called by its name with a left and a right quote surrounded. Note that the left quote is got by pressing ` key (the same as the Tilda). A global macro needs a $ in front of the macro name. You may use {} to clarify meaning or form nested constructions. . di `fruit’ . local fruit=$js . gen ph$js=id // equal to ph625=id

Macros are used in both expressions and commands. . local LHS "gnp" . local RHS "interest consume inflate" . regress `LHS’ `RHS’

If you want to see the list of macros declared, type in the .macro list command. The .macro drop command removes the macro specified. . .macro list .macro drop fruit

2.16 Looping (.foreach and .forvalues) If you wish to repeat a set of commands, take advantages of looping structures, such as .foreach, .forvalues, and .while commands.3 The .foreach command executes a set of commands enclosed in braces for each element of the macro, variables, or numbers specified. Let us list numbers from 1 to 100. The numlist indicates that the following list is of number. You may list all numbers as 1 2 3 4 … 100. . foreach n of numlist 1/100 { disp `n' } Alternatively, you may benefit from using the .forvalues command, if repetition is not determined by macros and variables, but by numbers. The following .forvalues command works in exactly the same manner as the above. . forvalues n= 1/100 { // from 1 through 100 in step of 1 disp `n' }

3 The usage of the .foreach command is quite similar in PHP and PERL.



You may use three alternative ways of specifying the range of numbers, which are equally use in in the .foreach command. Note that the .forv is an abbreviation of the .forvalues. forvalues n= 1(1)100 { // from 1 through 100 in step of 1 forv n= 1 2 : 100 { // from 1 through 100 in step of 2-1 forv n= 1 2 to 100 { // from 1 through 100 in step of 2-1 Let us go over to variables. The followings commands produce the identical result. foreach var in cigar bladder lung kidney leukemi { sum `var' if area==1 } foreach var of varlist cigar-leukemi { sum `var' if area==1 } What is the difference between the in subcommand and the of subcommand? The former is general in listing values, variables, or macros, while the latter should specify the type (i.g., global, local, varlist, newlist, and numlist) of argument. Thus, varlist of the second command should not be omitted. Consider the following commands that create five random variables. foreach var of newlist random1-random5 { gen `var' = uniform() } foreach var in random1 random2 random3 random4 random5 { gen `var' = uniform() } Note that in the first command the newlist to create new variables cannot be omitted and the usage of “random1-random5” is not allowed in the second command. Now, it is time for macros. The following three commands produce the identical result. The double quotes around `str’ cannot be omitted since the values are string. local fruit "Grape Pear Apple" foreach str of local fruit { di "`str'" } foreach str in `fruit' { di "`str'" } foreach str in "Grape" "Pear" "Apple" { di "`str'" } Note that the macro name in the first command is not enclosed with single quotes, whereas it was in the second command. For information about the .while loop command and the .if conditional command, see the Programming Stata. 2.17 Using Operating System Commands



Following table summarizes the useful operating system commands available in Stata Command Meaning Examples .cd (or pwd) to change a directory . cd ..\data

. cd ~/data

.copy to copy files . copy a.dta b.dta

.dir (or ls) to list directories and files . dir *.dta . ls ~/data/*.do

.erase (or .rm) to remove files . erase ..\temp.dta . rm ../data/temp.dta

.mkdir to create a directory . mkdir cancer

.shell to invoke operating system temporarily . shell

.type to view contents of a text file . type cancer.dct

* Note that the .pwd and .rm respectively work only in Stata for Mac OS and UNIX.

chapter 1 · pdf file© 2003-present hun myoung park (1/26/2013) statistical and...

Documents