being productive with stata and va data give me six hours to chop down a tree and i will spend the...
TRANSCRIPT
Being Productive with StataBeing Productive with Stataand VA Dataand VA Data
Give me six hours to chop down a tree and I will spend the Give me six hours to chop down a tree and I will spend the first four sharpening the axe. first four sharpening the axe.
--Abraham Lincoln--Abraham Lincoln
Todd WagnerTodd WagnerAugust 2008August 2008
OutlineOutline
Database manipulation in StataDatabase manipulation in Stata
Data Analysis in StataData Analysis in Stata
Working Interactively and .do filesWorking Interactively and .do files
You can issue commands directly into the You can issue commands directly into the command line.command line.
Unless you save your commands into a Unless you save your commands into a batch file (a .do file), you’ll lose your batch file (a .do file), you’ll lose your code once you close Stata.code once you close Stata.
I often work interactively and then save I often work interactively and then save the “right” commands in a do file.the “right” commands in a do file.
Editing a .do file in StataEditing a .do file in Stata
Any ASCII text editor will workAny ASCII text editor will work Stata has a built in text editor, but it is Stata has a built in text editor, but it is
limited.limited. I recommend exploring your optionsI recommend exploring your options
http://fmwww.bc.edu/repec/bocode/t/textEditors.htmlhttp://fmwww.bc.edu/repec/bocode/t/textEditors.html
Handling DataHandling Data
SAS processes one record at a timeSAS processes one record at a time Stata processes all the records at the same Stata processes all the records at the same
timetime– Loops are commonly used in SASLoops are commonly used in SAS
– Loops are very rarely used in StataLoops are very rarely used in Stata
Loading Data into MemoryLoading Data into Memory
Stata reads the data into memoryStata reads the data into memory– set mem 100m set mem 100m (before you load the data)(before you load the data)
You must have enough memory for your You must have enough memory for your datasetdataset
With large datasets:With large datasets:– drop unnecessary variablesdrop unnecessary variables– Use the compress command (but don’t compress Use the compress command (but don’t compress
SCRSSN)SCRSSN)
Stata AbbreviationsStata Abbreviations
Stata commands can be abbreviated with Stata commands can be abbreviated with the first three lettersthe first three letters– regression income education femaleregression income education female
could be writtencould be written– reg income education femalereg income education female
Stata HelpStata Help
Stata’s built in help is greatStata’s built in help is great– Help <command>Help <command>
Stata manuals are great because they Stata manuals are great because they review theoryreview theory
Stata and the WebStata and the Web
Stata is “web aware”Stata is “web aware” Check for updates periodicallyCheck for updates periodically–update allupdate all
You can search for user-written programsYou can search for user-written programs–findit outputfindit output–findit outregfindit outreg (click to install) (click to install)
Stata in WindowsStata in Windows
Page up scrolls through the previous Page up scrolls through the previous commandscommands
There is a graphical user interface There is a graphical user interface (menus) if you forget a command(menus) if you forget a command
In Unix, you can all Stata’s functionality In Unix, you can all Stata’s functionality if you use x-windows (e.g., Cygwin).if you use x-windows (e.g., Cygwin).
Sysdir, ls and cdSysdir, ls and cd Stata recognizes some unix commands, such as ls and Stata recognizes some unix commands, such as ls and
cdcd Sysdir provides a listing of Stata’s working Sysdir provides a listing of Stata’s working
directoriesdirectoriessysdirsysdirSTATA: C:\Program Files\Stata9\STATA: C:\Program Files\Stata9\UPDATES: C:\ProgramFiles\Stata9\ado\updates\UPDATES: C:\ProgramFiles\Stata9\ado\updates\BASE: C:\Program Files\Stata9\ado\base\BASE: C:\Program Files\Stata9\ado\base\SITE: C:\Program Files\Stata9\ado\site\SITE: C:\Program Files\Stata9\ado\site\PLUS: c:\ado\stbplus\PLUS: c:\ado\stbplus\PERSONAL: c:\ado\personal\PERSONAL: c:\ado\personal\OLDPLACE: c:\ado\OLDPLACE: c:\ado\
Store your data on a VA server– not on your PC or Store your data on a VA server– not on your PC or laptop!laptop!
DelimitersDelimiters
SAS recognizes “;” as a delimiterSAS recognizes “;” as a delimiter Stata recognizes the carriage returnStata recognizes the carriage return
– Always add a carriage return after your last Always add a carriage return after your last commandcommand
You can change delimiters to ; You can change delimiters to ; #delimit ;#delimit ;
Missing DataMissing Data
Stata and SAS both use “.” as missingStata and SAS both use “.” as missing Stata implicitly values a missing as a very Stata implicitly values a missing as a very
large numberlarge number SAS implicitly values a missing as a very SAS implicitly values a missing as a very
small numbersmall number
Generating and Recoding VariablesGenerating and Recoding Variables
In SAS you typeIn SAS you typequality=0; quality=0;
If VA=1 then quality=1;If VA=1 then quality=1; In Stata you typeIn Stata you typegen quality=0 gen quality=0
recode quality 0=1 if VA==1 recode quality 0=1 if VA==1 oror
replace quality=1 if VA==1 replace quality=1 if VA==1
Boolean LogicBoolean Logic
Stata is picky about Boolean logicStata is picky about Boolean logic
gen y=x if a==bgen y=x if a==b (must use two ==) (must use two ==)
gen y=x if a>b & b>10gen y=x if a>b & b>10 (must use &) (must use &)
gen y=x if a<=bgen y=x if a<=b (< or > must be before =) (< or > must be before =)
Creating Dummy VariablesCreating Dummy Variables
Goal: create dummy variable for genderGoal: create dummy variable for gender
gen male=sex==“M”gen male=sex==“M”
tab sex, gen(sex_)tab sex, gen(sex_) This second command automatically creates 2 This second command automatically creates 2
dummy variablesdummy variables Be careful about missing data– missing data Be careful about missing data– missing data
are assigned to 0, unless you use “if” or are assigned to 0, unless you use “if” or “recode”“recode”
DropDrop
Drop <varnames>Drop <varnames> (drops variables) (drops variables)
Drop if X==1Drop if X==1 (drop cases where (drop cases where value is 1)value is 1)
egen Commandsegen Commands
You want to generate total costs for a medical You want to generate total costs for a medical centercenter
In SAS this is done by proc summaryIn SAS this is done by proc summary In Stata, you can typeIn Stata, you can typecollapse (sum) costs, by (stan3)collapse (sum) costs, by (stan3) oror
sort sta3nsort sta3n
by sta3n: egen sumcost=total(cost)by sta3n: egen sumcost=total(cost)
ICD-9 CodesICD-9 Codes
Stata has capabilities to handle ICD-9 Stata has capabilities to handle ICD-9 diagnosis and procedure codesdiagnosis and procedure codes
You can You can – check to see if codes are validcheck to see if codes are valid
– generate identifiers based on codes or generate identifiers based on codes or ranges of codesranges of codes
Combining DataCombining Data MergeMerge
– this automatically creates a variable called _mergethis automatically creates a variable called _merge– merge==1 obs. from master data merge==1 obs. from master data – merge==2 obs. from only one using dataset merge==2 obs. from only one using dataset – merge==3 obs. from at least two datasets, master or merge==3 obs. from at least two datasets, master or
using using
merge scrssn admitday disday using data_ymerge scrssn admitday disday using data_y
Append (stacking data)Append (stacking data)
Explicit SubscriptingExplicit Subscripting
Identify the most recent encounter in an Identify the most recent encounter in an encounter databaseencounter database
gsort id -dategsort id -date
by id : gen n=_nby id : gen n=_n
by id : gen N=_Nby id : gen N=_N
gen select=n==1gen select=n==1
Ascending sort by ID and reverse by date
Record counter from 1 to N per person
Total number of records per person
Set, Clear and MoreSet, Clear and More
Set: sets system parametersSet: sets system parameters– Need to set memory size to open a databaseNeed to set memory size to open a database
set mem 100mset mem 100m ClearClear erases data from memory erases data from memory When output is >1 page, you are asked to When output is >1 page, you are asked to
continue (continue (set more offset more off))
Summarizing DataSummarizing Data
. sum gender age educ
Variable | Obs Mean Std. Dev. Min Max-------------+-------------------------------------------------------- gender | 4085 1.496206 .5000468 1 2 age | 4085 64.5601 9.451724 50 94 educ | 4085 4.398286 1.662883 1 9
Sum < >, dSum < >, d provides more details on each provides more details on each variablevariable
Tabstat provides summary info, including Tabstat provides summary info, including totalstotals
Tabulating DataTabulating Data. tab gender. tab gender
gender | Freq. Percent Cum.gender | Freq. Percent Cum.------------+-----------------------------------------------+----------------------------------- 1 | 2,058 50.38 50.381 | 2,058 50.38 50.38 2 | 2,027 49.62 100.002 | 2,027 49.62 100.00------------+-----------------------------------------------+----------------------------------- Total | 4,085 100.00Total | 4,085 100.00
. table gender. table gender-------------------------------------------- gender | Freq.gender | Freq.----------+---------------------+----------- 1 | 2,0581 | 2,058 2 | 2,0272 | 2,027--------------------------------------------
Tabulating DataTabulating Datatab gender agetab gender agetoo many valuestoo many valuesr(134);r(134);
tab age gendertab age gender | gender| gender age | 1 2 | Totalage | 1 2 | Total-----------+----------------------+---------------------+----------------------+---------- 50 | 49 69 | 118 50 | 49 69 | 118 51 | 72 71 | 14351 | 72 71 | 143……
94 | 1 0 | 1 94 | 1 0 | 1 -----------+----------------------+---------------------+----------------------+---------- Total | 2,058 2,027 | 4,085 Total | 2,058 2,027 | 4,085
TabstatTabstat. tabstat age, by (gender). tabstat age, by (gender)
gender | meangender | mean---------+-------------------+---------- 1 | 64.774541 | 64.77454 2 | 64.342382 | 64.34238---------+-------------------+---------- Total | 64.5601Total | 64.5601----------------------------------------
. table gender, c(mean age). table gender, c(mean age)
---------------------------------------------- gender | mean(age)gender | mean(age)----------+----------------------+------------ 1 | 64.774541 | 64.77454 2 | 64.342382 | 64.34238----------------------------------------------
GraphingGraphing
Diagnostic graphicsDiagnostic graphics
Presenting Presenting
resultsresults
wtp
Density-.072394.072394
0
75
500
stage: 1
Density-.060237.060237
0
100
500
stage: 2
Density-.05479 .05479
0
100
500
stage: 3
Density-.055777.055777
0
125
500
stage: 4
Density-.062437.062437
0
75
500
stage: 5
Basic Analytical FunctionsBasic Analytical Functions
OLS (reg)OLS (reg) Logistic, probit, count data (e.g., CLAD)Logistic, probit, count data (e.g., CLAD) MultinomialsMultinomials GLM/HLMGLM/HLM Duration modelsDuration models Semi and non-parametric modelsSemi and non-parametric models
Creating Publishable TablesCreating Publishable Tables
Outreg commandOutreg command
Outputs data to a delimited fileOutputs data to a delimited file Delimited file can be read into ExcelDelimited file can be read into Excel Very flexibleVery flexible Creates publishable tables easilyCreates publishable tables easily
BecaplerminBecaplermin
June 2006, FDA issued a Boxed Warning June 2006, FDA issued a Boxed Warning for becaplerim (a treatment for lower for becaplerim (a treatment for lower extremity diabetic ulcers)extremity diabetic ulcers)
Warning raised potential risk of cancer Warning raised potential risk of cancer related mortalityrelated mortality
Analytical GoalAnalytical Goal
Case-control study for becaplerminCase-control study for becaplermin Sample is all patients with a diabetic Sample is all patients with a diabetic
ulcer of the lower extremityulcer of the lower extremity Exposure is quantity of becaplermin Exposure is quantity of becaplermin
prescriptionsprescriptions Multivariate analysis, stratifying for Multivariate analysis, stratifying for
patients with prior history of cancerpatients with prior history of cancer
Pulling VA DataPulling VA Data VA utilization data extracts reside in SAS. I VA utilization data extracts reside in SAS. I
extract my sample using SAS and then moved extract my sample using SAS and then moved the data into Stata.the data into Stata.
VA Data:VA Data:– Sample: All encounters with a diabetic ulcer Sample: All encounters with a diabetic ulcer
principal diagnosis in NPCD and PTF (FY02-07)principal diagnosis in NPCD and PTF (FY02-07)– Exposure: All prescriptions from DSS pharmacy Exposure: All prescriptions from DSS pharmacy
FY02-07 for Becaplermin feeder codeFY02-07 for Becaplermin feeder code– Outcome: All encounters with a neoplasm Outcome: All encounters with a neoplasm
principal diagnosis (FY97-07)principal diagnosis (FY97-07)
Transferring DataTransferring Data
Stattransfer or DBMS copy workStattransfer or DBMS copy work Stattransfer often seeks to optimize the Stattransfer often seeks to optimize the
Stata dataset by defaultStata dataset by default– If transferring data with SCRSSN, If transferring data with SCRSSN, FORCEFORCE
Stattransfer to transfer SCRSSN as double Stattransfer to transfer SCRSSN as double precisionprecision
– http://www.stata.com/support/faqs/data/prec.htmlhttp://www.stata.com/support/faqs/data/prec.html
Diabetic Ulcer SampleDiabetic Ulcer Sample Goal: turn encounter level data into person level dataGoal: turn encounter level data into person level data
cd R:\twagner\customer\becapcd R:\twagner\customer\becapuse ulcer, clearuse ulcer, clearsort scrssnsort scrssnby scrssn: gen n=_nby scrssn: gen n=_ntab ntab nkeep if n==1keep if n==1keep scrssnkeep scrssnsort scrssnsort scrssngen ulcer=1gen ulcer=1save finder, replacesave finder, replace
Alternative CodeAlternative Code
sort scrssnsort scrssn
by scrssn: gen n=_nby scrssn: gen n=_n
by scrssn: gen num_ulcervisits=_Nby scrssn: gen num_ulcervisits=_N
sort scrssnsort scrssn
by scrssn: gen newepisode=vizday[_n]-by scrssn: gen newepisode=vizday[_n]-vizday[_n-1]>60vizday[_n-1]>60
recode newepisode .=1 if n==1recode newepisode .=1 if n==1
by scrssn: egen episodes=sum(newepisode)by scrssn: egen episodes=sum(newepisode)
Step 2: Merge Ulcer Sample and Step 2: Merge Ulcer Sample and Cancer CasesCancer Cases
use neo, clearuse neo, cleargen cancer=1gen cancer=1sort scrssnsort scrssnmerge scrssn using findermerge scrssn using finderdrop if _m==1drop if _m==1
Merge command creates a new variable:_m=1 data only in master data_m=2 data only in using data_m=3 data merged in both
sort scrssn admitday disday sta3n adtimesort scrssn admitday disday sta3n adtimeby scrssn: egen firstcancer=min(admitday) if cancer==1by scrssn: egen firstcancer=min(admitday) if cancer==1
gen diedihcan=disto==-2 & cancer==1gen diedihcan=disto==-2 & cancer==1gen dod_can=disday if diedihcan==1gen dod_can=disday if diedihcan==1
gen cancerstays=1 if cancer==1gen cancerstays=1 if cancer==1recode cancerstays .=0recode cancerstays .=0collapse (min) firstcancer (sum) cancerstays (max) collapse (min) firstcancer (sum) cancerstays (max)
diedihcan dod_can cancer, by (scrssn)diedihcan dod_can cancer, by (scrssn)sort scrssnsort scrssndrop _mdrop _msave diabcancer, replacesave diabcancer, replace
Step 2: continuedStep 2: continued
Merge in Exposure dataMerge in Exposure datause becap, clearuse becap, cleargen numrx=1gen numrx=1sort scrssn svc_dtesort scrssn svc_dteby scrssn: egen firstbecap=min(svc_dte)by scrssn: egen firstbecap=min(svc_dte)by scrssn: egen lastbecap=max(svc_dte)by scrssn: egen lastbecap=max(svc_dte)collapse (min) firstbecap (max) lastbecap (sum) collapse (min) firstbecap (max) lastbecap (sum)
day_supply numrx , by (scrssn)day_supply numrx , by (scrssn)gen becap=1gen becap=1sort scrssnsort scrssnsave becapsum, replacesave becapsum, replace
use diabcanceruse diabcancermerge scrssn using becapsummerge scrssn using becapsum