Weighted Data


Andy Grogan-Kaylor


January 30, 2025


In their simplest form, weights are the inverse probability of selection. If \(p_i\) is the probability of selection, then the weight is defined as \(w_i = \frac{1}{p_i}\).

Simulate Population Data

Show the code
clear all

set seed 3846 // random seed

set obs 10000 // observations

generate x = rnormal(100, 10) // random normal x

generate z = rbinomial(1, .25) // dichotomous z

generate e = rnormal(0, 10) // random error

replace x = x - 20 if z == 1 // x is 20 lower for z=1

generate y = 2 * x + z + e // TRUE relationship in population

drop e // drop error

save population.dta, replace // save population data

quietly: regress y x i.z // population

est store population
Show the code

use population.dta, clear

dtable x i.z y // descriptive statistics
N             10,000
x    94.945 (13.360)
  0    7,493 (74.9%)
  1    2,507 (25.1%)
y   190.089 (28.239)

Random Sample

Show the code

use population.dta, clear

sample 100, count by(z) // same count from each group

save sample.dta, replace // sample data

dtable x i.z y // descriptive statistics
(9,800 observations deleted)

file sample.dta saved

N                200
x    90.014 (15.310)
  0      100 (50.0%)
  1      100 (50.0%)
y   179.519 (30.502)

Generate Weights

Show the code

* p is probability of selection
* w = 1/p

use sample.dta, clear

generate p = . // initialize to missing

replace p = 100/250 if z == 1

replace p = 100/750 if z == 0

generate w = 1/p

dtable x i.z y p w

save sample.dta, replace
(200 missing values generated)

(100 real changes made)

(100 real changes made)

N                200
x    90.014 (15.310)
  0      100 (50.0%)
  1      100 (50.0%)
y   179.519 (30.502)
p      0.267 (0.134)
w      5.000 (2.506)

file sample.dta saved

Descriptive Statistics

We see that in terms of descriptive statistics, the weighted estimates are much better than the unweighted estimates.

Show the code

use sample.dta, clear

svyset [pweight=w] // svyset the data

mean x // unweighted estimate

svy: mean x // weighted estimate 
Sampling weights: w
             VCE: linearized
     Single unit: missing
        Strata 1: <one>
 Sampling unit 1: <observations>
           FPC 1: <zero>

Mean estimation                            Number of obs = 200

             |       Mean   Std. err.     [95% conf. interval]
           x |   90.01397   1.082594      87.87914     92.1488

(running mean on estimation sample)

Survey: Mean estimation

Number of strata =   1            Number of obs   =        200
Number of PSUs   = 200            Population size = 999.999952
                                  Design df       =        199

             |             Linearized
             |       Mean   std. err.     [95% conf. interval]
           x |   95.48824   .9906796      93.53466    97.44181


Here the decision of whether unweighted or weighted results are better is not so clear.

Show the code

use sample.dta, clear

quietly: regress y x i.z // unweighted

est store unweighted

quietly: regress y x i.z [pweight = w] // weighted

est store weighted

etable, ///
estimates(population unweighted weighted) ///
column(estimate) ///
showstars showstarsnote
                       population unweighted  weighted 
x                        1.986 **   1.858 **   1.892 **
                       (0.010)    (0.066)    (0.075)   
  1                      0.440     -1.122     -0.382   
                       (0.303)    (2.019)    (1.984)   
Intercept                1.446     12.844      9.431   
                       (0.989)    (6.746)    (7.474)   
Number of observations   10000        200        200   
** p<.01, * p<.05