ta: natalia shestakova october, 2007 labor economics exercise session # 1 artificial data generation

10
TA: Natalia Shestakova October, 2007 Labor Economics Exercise session # 1 Artificial Data Generation

Upload: emerald-cain

Post on 30-Dec-2015

213 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: TA: Natalia Shestakova October, 2007 Labor Economics Exercise session # 1 Artificial Data Generation

TA: Natalia Shestakova October, 2007

Labor EconomicsExercise session # 1

Artificial Data Generation

Page 2: TA: Natalia Shestakova October, 2007 Labor Economics Exercise session # 1 Artificial Data Generation

Overview

Generating random variables

Graphing

Throwing seeds

Generating random dummy variables from sample

Drawing from multivariate distributions

Loops and distribution of estimated coefficients

Page 3: TA: Natalia Shestakova October, 2007 Labor Economics Exercise session # 1 Artificial Data Generation

Random-number functions:uniform() returns uniformly distributed pseudorandom

numbers on the interval [0,1). uniform() takes no arguments, but the parentheses must be typed.

invnormal(uniform()) returns normally distributed random numbers with mean 0 and standard deviation 1.

Reminder: Discrete uniform distribution: all values of a finite set of possible values are equally

probable, continuous: all intervals of the same length are equally probable Normal distribution: family of continuous probability distributions. Each member of

the family may be defined by two parameters, location and scale: the mean ("average") and standard deviation ("variability"), respectively

Generating random variables-1

Page 4: TA: Natalia Shestakova October, 2007 Labor Economics Exercise session # 1 Artificial Data Generation

Examples:500 draws from the uniform distribution on [0,1]

set obs 500 gen x1 = uniform()

500 draws from the standard normal distribution, mean 0, variance 1gen x2 = invnorm(uniform())

500 draws from the distribution N(1,2)gen x3 = 1 + 4*invnorm(uniform())

500 draws from the uniform distribution between 3 and 12gen x4 = 3 + 9*uniform()

500 observations of the variable that is a linear combination of other variablesgen z = 4 - 3*x4 + 8*x2

Generating random variables-2

Page 5: TA: Natalia Shestakova October, 2007 Labor Economics Exercise session # 1 Artificial Data Generation

Graphing0

.51

1.5

De

nsity

0 .2 .4 .6 .8 1x1

010

2030

Fre

que

ncy

0 .2 .4 .6 .8 1x1

-4-2

02

4

x1 x2

0.2

.4.6

.81

cx1

0 .2 .4 .6 .8 1x1

Page 6: TA: Natalia Shestakova October, 2007 Labor Economics Exercise session # 1 Artificial Data Generation

Throwing seeds

=> Allows you to generate a particular sample anytime again:set obs 500

set seed 2

gen z1 = invnorm(uniform())

set seed 2

gen z2 = invnorm(uniform())

set seed 19840607

gen z3 = invnorm(uniform())

dotplot z1 z2 z3

Page 7: TA: Natalia Shestakova October, 2007 Labor Economics Exercise session # 1 Artificial Data Generation

Task: generate a variable that characterizes whether an individual smokes (smoke=1) or does not (smoke=0) smoke.

(a) for period 1, assume that (s)he smokes with probability 30%, (b) for each of the following 30 periods, there is a 65% chance that a

smoker keeps smoking and a 5% chance that a non-smoker starts smoking

Solution: (a) Note, that a uniformly distributed at [0,1) variable is less than 0.3 with

30% chance. Then: gen smoke = uniform()<.3(b) first, for every individual, give her/him an ID and create observations

for 30 years (they will be the same); then, step by step, update probabilities to smoke in every year for every ID:

by pid: replace smoke=uniform()<(.05+.6*smoke[_n-1]) if _n>1

Generating random dummy variables from sample

Page 8: TA: Natalia Shestakova October, 2007 Labor Economics Exercise session # 1 Artificial Data Generation

Task: generate a number of variables that are correlated with each other (have multivariate distribution)

Solution:(a) drawnorm: draws a sample from a multivariate normal distribution with

desired means and covariance matrix

drawnorm x y, n(1000) means(m) corr(C)

(b) corr2data: creates an artificial dataset with a specified correlation structure (is not a sample from an underlying population with the summary statistics specified)

corr2data x y, n(1000) means(m) corr(C)

Note: matrices m and C can be specified using mat

Drawing from multivariate distributions

Page 9: TA: Natalia Shestakova October, 2007 Labor Economics Exercise session # 1 Artificial Data Generation

Why to use loops?-> low probability that one randomly drawn sample coincides with the real

one-> drawing more samples for estimating a coefficient of interest and taking

the average of these coefficients makes the estimate closer to the real oneHow to use loops?gen b1=0 /* all observations of b1 are assigned 0 valuelocal i=1 /* i is a counter variable in the following loopset more off /* useful command so we do not have to hit enter every time the regression runswhile `i'<=500 { /* command to start a loop of 500 repeatitions drop _all /* drop all specified observations so we can randomly generate them again /*generate random variables /*regression scalar d =_b[x1] /* store the output of regression into a variable replace b1 = scalar(d) if _n==`i‘ /* put the estimated coefficient in the ith regression into ith observation of variable b1

local i=`i'+1 /* adds 1 to the counter} /*end of the loop

Loops and distribution of estimated coefficients

Page 10: TA: Natalia Shestakova October, 2007 Labor Economics Exercise session # 1 Artificial Data Generation

Any questions???