In their simplest form, weights are the inverse probability of selection. If \(p_i\) is the probability of selection, then the weight is defined as \(w_i = \frac{1}{p_i}\).
Simulate Population Data
Show the code
clearallsetseed 3846 // random seedsetobs 10000 // observationsgenerate x = rnormal(100, 10) // random normal xgenerate z = rbinomial(1, .25) // dichotomous zgeneratee = rnormal(0, 10) // random errorreplace x = x - 20 if z == 1 // x is 20 lower for z=1generatey = 2 * x + z + e// TRUE relationship in populationdrope// drop errorsave population.dta, replace// save population dataquietly: regressy x i.z // populationeststore population
Show the code
use population.dta, cleardtable x i.z y// descriptive statistics
Summary
--------------------
N 10,000
x 94.945 (13.360)
z
0 7,493 (74.9%)
1 2,507 (25.1%)
y 190.089 (28.239)
--------------------
Random Sample
Show the code
use population.dta, clearsample 100, countby(z) // same count from each groupsavesample.dta, replace// sample datadtable x i.z y// descriptive statistics
(9,800 observations deleted)
file sample.dta saved
--------------------
Summary
--------------------
N 200
x 90.014 (15.310)
z
0 100 (50.0%)
1 100 (50.0%)
y 179.519 (30.502)
--------------------
Generate Weights
Show the code
* p is probability of selection* w = 1/pusesample.dta, cleargeneratep = . // initialize to missingreplacep = 100/250 if z == 1replacep = 100/750 if z == 0generatew = 1/pdtable x i.z ypwsavesample.dta, replace
(200 missing values generated)
(100 real changes made)
(100 real changes made)
--------------------
Summary
--------------------
N 200
x 90.014 (15.310)
z
0 100 (50.0%)
1 100 (50.0%)
y 179.519 (30.502)
p 0.267 (0.134)
w 5.000 (2.506)
--------------------
file sample.dta saved
Descriptive Statistics
We see that in terms of descriptive statistics, the weighted estimates are much better than the unweighted estimates.
Show the code
usesample.dta, clearsvyset [pweight=w] // svyset the datamean x // unweighted estimatesvy: mean x // weighted estimate
Sampling weights: w
VCE: linearized
Single unit: missing
Strata 1: <one>
Sampling unit 1: <observations>
FPC 1: <zero>
Mean estimation Number of obs = 200
--------------------------------------------------------------
| Mean Std. err. [95% conf. interval]
-------------+------------------------------------------------
x | 90.01397 1.082594 87.87914 92.1488
--------------------------------------------------------------
(running mean on estimation sample)
Survey: Mean estimation
Number of strata = 1 Number of obs = 200
Number of PSUs = 200 Population size = 999.999952
Design df = 199
--------------------------------------------------------------
| Linearized
| Mean std. err. [95% conf. interval]
-------------+------------------------------------------------
x | 95.48824 .9906796 93.53466 97.44181
--------------------------------------------------------------
Regressions
Here the decision of whether unweighted or weighted results are better is not so clear.