Weighted Data

Author

Andy Grogan-Kaylor

Published

January 30, 2025

Weights

In their simplest form, weights are the inverse probability of selection. If \(p_i\) is the probability of selection, then the weight is defined as \(w_i = \frac{1}{p_i}\).

Simulate Population Data

Show the code

clear all

set seed 3846 // random seed

set obs 10000 // observations

generate x = rnormal(100, 10) // random normal x

generate z = rbinomial(1, .25) // dichotomous z

generate e = rnormal(0, 10) // random error

replace x = x - 20 if z == 1 // x is 20 lower for z=1

generate y = 2 * x + z + e // TRUE relationship in population

drop e // drop error

save population.dta, replace // save population data

quietly: regress y x i.z // population

est store population

Show the code


use population.dta, clear

dtable x i.z y // descriptive statistics

         Summary    
--------------------
N             10,000
x    94.945 (13.360)
z                   
  0    7,493 (74.9%)
  1    2,507 (25.1%)
y   190.089 (28.239)
--------------------

Random Sample

Show the code


use population.dta, clear

sample 100, count by(z) // same count from each group

save sample.dta, replace // sample data

dtable x i.z y // descriptive statistics

(9,800 observations deleted)

file sample.dta saved


--------------------
         Summary    
--------------------
N                200
x    90.014 (15.310)
z                   
  0      100 (50.0%)
  1      100 (50.0%)
y   179.519 (30.502)
--------------------

Generate Weights

Show the code


* p is probability of selection
* w = 1/p

use sample.dta, clear

generate p = . // initialize to missing

replace p = 100/250 if z == 1

replace p = 100/750 if z == 0

generate w = 1/p

dtable x i.z y p w

save sample.dta, replace

(200 missing values generated)

(100 real changes made)

(100 real changes made)



--------------------
         Summary    
--------------------
N                200
x    90.014 (15.310)
z                   
  0      100 (50.0%)
  1      100 (50.0%)
y   179.519 (30.502)
p      0.267 (0.134)
w      5.000 (2.506)
--------------------

file sample.dta saved

Descriptive Statistics

We see that in terms of descriptive statistics, the weighted estimates are much better than the unweighted estimates.

Show the code


use sample.dta, clear

svyset [pweight=w] // svyset the data

mean x // unweighted estimate

svy: mean x // weighted estimate

Sampling weights: w
             VCE: linearized
     Single unit: missing
        Strata 1: <one>
 Sampling unit 1: <observations>
           FPC 1: <zero>


Mean estimation                            Number of obs = 200

--------------------------------------------------------------
             |       Mean   Std. err.     [95% conf. interval]
-------------+------------------------------------------------
           x |   90.01397   1.082594      87.87914     92.1488
--------------------------------------------------------------

(running mean on estimation sample)

Survey: Mean estimation

Number of strata =   1            Number of obs   =        200
Number of PSUs   = 200            Population size = 999.999952
                                  Design df       =        199

--------------------------------------------------------------
             |             Linearized
             |       Mean   std. err.     [95% conf. interval]
-------------+------------------------------------------------
           x |   95.48824   .9906796      93.53466    97.44181
--------------------------------------------------------------

Regressions

Here the decision of whether unweighted or weighted results are better is not so clear.

Show the code


use sample.dta, clear

quietly: regress y x i.z // unweighted

est store unweighted

quietly: regress y x i.z [pweight = w] // weighted

est store weighted

etable, ///
estimates(population unweighted weighted) ///
column(estimate) ///
showstars showstarsnote

                       population unweighted  weighted 
-------------------------------------------------------
x                        1.986 **   1.858 **   1.892 **
                       (0.010)    (0.066)    (0.075)   
z                                                      
  1                      0.440     -1.122     -0.382   
                       (0.303)    (2.019)    (1.984)   
Intercept                1.446     12.844      9.431   
                       (0.989)    (6.746)    (7.474)   
Number of observations   10000        200        200   
-------------------------------------------------------
** p<.01, * p<.05