ta: natalia shestakova october, 2007 labor economics exercise session # 1 artificial data generation
TRANSCRIPT
![Page 1: TA: Natalia Shestakova October, 2007 Labor Economics Exercise session # 1 Artificial Data Generation](https://reader036.vdocuments.net/reader036/viewer/2022072014/56649e865503460f94b88ffe/html5/thumbnails/1.jpg)
TA: Natalia Shestakova October, 2007
Labor EconomicsExercise session # 1
Artificial Data Generation
![Page 2: TA: Natalia Shestakova October, 2007 Labor Economics Exercise session # 1 Artificial Data Generation](https://reader036.vdocuments.net/reader036/viewer/2022072014/56649e865503460f94b88ffe/html5/thumbnails/2.jpg)
Overview
Generating random variables
Graphing
Throwing seeds
Generating random dummy variables from sample
Drawing from multivariate distributions
Loops and distribution of estimated coefficients
![Page 3: TA: Natalia Shestakova October, 2007 Labor Economics Exercise session # 1 Artificial Data Generation](https://reader036.vdocuments.net/reader036/viewer/2022072014/56649e865503460f94b88ffe/html5/thumbnails/3.jpg)
Random-number functions:uniform() returns uniformly distributed pseudorandom
numbers on the interval [0,1). uniform() takes no arguments, but the parentheses must be typed.
invnormal(uniform()) returns normally distributed random numbers with mean 0 and standard deviation 1.
Reminder: Discrete uniform distribution: all values of a finite set of possible values are equally
probable, continuous: all intervals of the same length are equally probable Normal distribution: family of continuous probability distributions. Each member of
the family may be defined by two parameters, location and scale: the mean ("average") and standard deviation ("variability"), respectively
Generating random variables-1
![Page 4: TA: Natalia Shestakova October, 2007 Labor Economics Exercise session # 1 Artificial Data Generation](https://reader036.vdocuments.net/reader036/viewer/2022072014/56649e865503460f94b88ffe/html5/thumbnails/4.jpg)
Examples:500 draws from the uniform distribution on [0,1]
set obs 500 gen x1 = uniform()
500 draws from the standard normal distribution, mean 0, variance 1gen x2 = invnorm(uniform())
500 draws from the distribution N(1,2)gen x3 = 1 + 4*invnorm(uniform())
500 draws from the uniform distribution between 3 and 12gen x4 = 3 + 9*uniform()
500 observations of the variable that is a linear combination of other variablesgen z = 4 - 3*x4 + 8*x2
Generating random variables-2
![Page 5: TA: Natalia Shestakova October, 2007 Labor Economics Exercise session # 1 Artificial Data Generation](https://reader036.vdocuments.net/reader036/viewer/2022072014/56649e865503460f94b88ffe/html5/thumbnails/5.jpg)
Graphing0
.51
1.5
De
nsity
0 .2 .4 .6 .8 1x1
010
2030
Fre
que
ncy
0 .2 .4 .6 .8 1x1
-4-2
02
4
x1 x2
0.2
.4.6
.81
cx1
0 .2 .4 .6 .8 1x1
![Page 6: TA: Natalia Shestakova October, 2007 Labor Economics Exercise session # 1 Artificial Data Generation](https://reader036.vdocuments.net/reader036/viewer/2022072014/56649e865503460f94b88ffe/html5/thumbnails/6.jpg)
Throwing seeds
=> Allows you to generate a particular sample anytime again:set obs 500
set seed 2
gen z1 = invnorm(uniform())
set seed 2
gen z2 = invnorm(uniform())
set seed 19840607
gen z3 = invnorm(uniform())
dotplot z1 z2 z3
![Page 7: TA: Natalia Shestakova October, 2007 Labor Economics Exercise session # 1 Artificial Data Generation](https://reader036.vdocuments.net/reader036/viewer/2022072014/56649e865503460f94b88ffe/html5/thumbnails/7.jpg)
Task: generate a variable that characterizes whether an individual smokes (smoke=1) or does not (smoke=0) smoke.
(a) for period 1, assume that (s)he smokes with probability 30%, (b) for each of the following 30 periods, there is a 65% chance that a
smoker keeps smoking and a 5% chance that a non-smoker starts smoking
Solution: (a) Note, that a uniformly distributed at [0,1) variable is less than 0.3 with
30% chance. Then: gen smoke = uniform()<.3(b) first, for every individual, give her/him an ID and create observations
for 30 years (they will be the same); then, step by step, update probabilities to smoke in every year for every ID:
by pid: replace smoke=uniform()<(.05+.6*smoke[_n-1]) if _n>1
Generating random dummy variables from sample
![Page 8: TA: Natalia Shestakova October, 2007 Labor Economics Exercise session # 1 Artificial Data Generation](https://reader036.vdocuments.net/reader036/viewer/2022072014/56649e865503460f94b88ffe/html5/thumbnails/8.jpg)
Task: generate a number of variables that are correlated with each other (have multivariate distribution)
Solution:(a) drawnorm: draws a sample from a multivariate normal distribution with
desired means and covariance matrix
drawnorm x y, n(1000) means(m) corr(C)
(b) corr2data: creates an artificial dataset with a specified correlation structure (is not a sample from an underlying population with the summary statistics specified)
corr2data x y, n(1000) means(m) corr(C)
Note: matrices m and C can be specified using mat
Drawing from multivariate distributions
![Page 9: TA: Natalia Shestakova October, 2007 Labor Economics Exercise session # 1 Artificial Data Generation](https://reader036.vdocuments.net/reader036/viewer/2022072014/56649e865503460f94b88ffe/html5/thumbnails/9.jpg)
Why to use loops?-> low probability that one randomly drawn sample coincides with the real
one-> drawing more samples for estimating a coefficient of interest and taking
the average of these coefficients makes the estimate closer to the real oneHow to use loops?gen b1=0 /* all observations of b1 are assigned 0 valuelocal i=1 /* i is a counter variable in the following loopset more off /* useful command so we do not have to hit enter every time the regression runswhile `i'<=500 { /* command to start a loop of 500 repeatitions drop _all /* drop all specified observations so we can randomly generate them again /*generate random variables /*regression scalar d =_b[x1] /* store the output of regression into a variable replace b1 = scalar(d) if _n==`i‘ /* put the estimated coefficient in the ith regression into ith observation of variable b1
local i=`i'+1 /* adds 1 to the counter} /*end of the loop
Loops and distribution of estimated coefficients
![Page 10: TA: Natalia Shestakova October, 2007 Labor Economics Exercise session # 1 Artificial Data Generation](https://reader036.vdocuments.net/reader036/viewer/2022072014/56649e865503460f94b88ffe/html5/thumbnails/10.jpg)
Any questions???