Cleaning Data With Stata

Andy Grogan-Kaylor

25 Sep 2024 10:34:07

Background

It sometimes seems like 80% of our work as data analysts is cleaning the data, while only 20% is the actual analysis. Here are some Stata commands that are useful in cleaning data.

First we simulate some data to work with, and to clean.

Simulate Some Data

This section is provided for illustration only, as it may be helpful to see how the data was simulated, and the decisions that went into simulating the data. You may also safely ignore this section if you like.

. clear all 
. set obs 100 // 100 observations
Number of observations (_N) was 0, now 100.
. generate id = _n // random id 
. generate age = rnormal(50,10) // random generated age
. replace age = 200 in 1 // someone is 200 years old!
(1 real change made)
. generate happy = runiformint(1,5) // randomly generated happiness
. replace happy = 999 in 10 // simulate a missing value
(1 real change made)
. generate somethingelse = rnormal(0, 1) // something else!

Look At Some Of The Data

. list in 1/10 // list first 10 observations

     ┌───────────────────────────────────┐
     │ id        age   happy   somethi~e │
     ├───────────────────────────────────┤
  1. │  1        200       1      .06784 │
  2. │  2   60.42631       3    1.459888 │
  3. │  3   52.97712       4   -.7060056 │
  4. │  4   32.77868       5    .0529082 │
  5. │  5     42.708       1   -.2976007 │
     ├───────────────────────────────────┤
  6. │  6   58.61826       3   -.2165111 │
  7. │  7   47.60646       2    1.299809 │
  8. │  8   55.16549       5    .1418742 │
  9. │  9   31.87984       3    1.550615 │
 10. │ 10   39.84875     999    .2515597 │
     └───────────────────────────────────┘

Clean The Data!

Look at The Data and Think About The Data (describe, summarize, tabulate, codebook, browse)

When we look at variables we are looking for values that don’t make sense, or that are outside the plausible range. As we are working with the data, it may sometimes be helpful to browse the data.

. describe // describe the data

Contains data
 Observations:           100                  
    Variables:             4                  
───────────────────────────────────────────────────────────────────────────────────────────
Variable      Storage   Display    Value
    name         type    format    label      Variable label
───────────────────────────────────────────────────────────────────────────────────────────
id              float   %9.0g                 
age             float   %9.0g                 
happy           float   %9.0g                 
somethingelse   float   %9.0g                 
───────────────────────────────────────────────────────────────────────────────────────────
Sorted by: 
     Note: Dataset has changed since last saved.
. summarize // descriptive statistics

    Variable │        Obs        Mean    Std. dev.       Min        Max
─────────────┼─────────────────────────────────────────────────────────
          id │        100        50.5    29.01149          1        100
         age │        100    49.85175    17.83505   24.33709        200
       happy │        100        12.9    99.61598          1        999
somethinge~e │        100   -.0085413    1.072256  -2.685881   2.794861
. tabulate happy // tabulation of this particular categorical variable

      happy │      Freq.     Percent        Cum.
────────────┼───────────────────────────────────
          1 │         21       21.00       21.00
          2 │         19       19.00       40.00
          3 │         23       23.00       63.00
          4 │         17       17.00       80.00
          5 │         19       19.00       99.00
        999 │          1        1.00      100.00
────────────┼───────────────────────────────────
      Total │        100      100.00
. codebook happy // VERY detailed view of this particular categorical variable

───────────────────────────────────────────────────────────────────────────────────────────
happy                                                                           (unlabeled)
───────────────────────────────────────────────────────────────────────────────────────────

                  Type: Numeric (float)

                 Range: [1,999]                       Units: 1
         Unique values: 6                         Missing .: 0/100

            Tabulation: Freq.  Value
                           21  1
                           19  2
                           23  3
                           17  4
                           19  5
                            1  999

Notice that…

Remember that the command lookfor is often very helpful in looking for a particular variable. e.g. lookfor happy.

Only keep The Variables Of Interest

We may only be interested in keeping some variables to keep our analytic data set more manageable.

For this particular analysis we may wish to drop the variable called somethingelse.

. keep id age happy // keep only relevant variables

We could also have said drop somethingelse.

Add Variable Labels (label variable "...")

. label variable id "ID" // label variable
. label variable age "Age in Years" // label variable
. label variable happy "Happiness Scale" // label variable

Create Value Labels (label define ...)

. label define happy 1 "Rarely" 2 "Sometimes" 3 "Often" 4 "Always" // create value label

Attach Value Labels To Variables (label values ...)

Variables and value labels can have the same names but are different things. We add the variable label happy to the variable named happy.

. label values happy happy // attach VALUE LABEL happy to VARIABLE happy

Recode Outliers, Values That Are Errors, Or Values That Should Be Coded As Missing (recode)

. recode happy (999 = .) // recode values as missing
(1 changes made to happy)
. recode age (100/max = 100) // age is topcoded at 100 (may or may not be plausible)
(1 changes made to age)

We describe and summarize The Data And See The Changes That Have Been Made

. describe

Contains data
 Observations:           100                  
    Variables:             3                  
───────────────────────────────────────────────────────────────────────────────────────────
Variable      Storage   Display    Value
    name         type    format    label      Variable label
───────────────────────────────────────────────────────────────────────────────────────────
id              float   %9.0g                 ID
age             float   %9.0g                 Age in Years
happy           float   %9.0g      happy      Happiness Scale
───────────────────────────────────────────────────────────────────────────────────────────
Sorted by: 
     Note: Dataset has changed since last saved.
. summarize

    Variable │        Obs        Mean    Std. dev.       Min        Max
─────────────┼─────────────────────────────────────────────────────────
          id │        100        50.5    29.01149          1        100
         age │        100    48.85175    10.71257   24.33709        100
       happy │         99    2.939394    1.412901          1          5