Cleaning Data With Stata

Andy Grogan-Kaylor

Background

It sometimes seems like 80% of our work as data analysts is cleaning the data, while only 20% is the actual analysis. Here are some Stata commands that are useful in cleaning data.

First we simulate some data to work with, and to clean.

Simulate Some Data

This section is provided for illustration only, as it may be helpful to see how the data was simulated, and the decisions that went into simulating the data. You may also safely ignore this section if you like.

clear all 

set obs 100 // 100 observations

generate id = _n // random id 

generate age = rnormal(50, 10) // random generated age

replace age = 200 in 1 // someone is 200 years old!

generate happy = runiformint(1,5) // randomly generated happiness
    
replace happy = 999 in 10 // simulate a missing value
    
generate somethingelse = rnormal(0, 1) // something else!

Look At Some Of The Data

list in 1/10 // list first 10 observations

     | id        age   happy   somethi~e |
     |-----------------------------------|
  1. |  1        200       1      .06784 |
  2. |  2   60.42631       3    1.459888 |
  3. |  3   52.97712       4   -.7060056 |
  4. |  4   32.77868       5    .0529082 |
  5. |  5     42.708       1   -.2976007 |
     |-----------------------------------|
  6. |  6   58.61826       3   -.2165111 |
  7. |  7   47.60646       2    1.299809 |
  8. |  8   55.16549       5    .1418742 |
  9. |  9   31.87984       3    1.550615 |
 10. | 10   39.84875     999    .2515597 |
     +-----------------------------------+

Clean The Data!

Look at The Data and Think About The Data (`describe`, `summarize`, `tabulate`, `codebook`, `browse`)

Tip

When we look at variables we are looking for values that don’t make sense, or that are outside the plausible range. As we are working with the data, it may sometimes be helpful to browse the data.

describe // describe the data

Contains data
 Observations:           100                  
    Variables:             4                  
-------------------------------------------------------------------------------
Variable      Storage   Display    Value
    name         type    format    label      Variable label
-------------------------------------------------------------------------------
id              float   %9.0g                 
age             float   %9.0g                 
happy           float   %9.0g                 
somethingelse   float   %9.0g                 
-------------------------------------------------------------------------------
Sorted by: 
     Note: Dataset has changed since last saved.

summarize // descriptive statistics

    Variable |        Obs        Mean    Std. dev.       Min        Max
-------------+---------------------------------------------------------
          id |        100        50.5    29.01149          1        100
         age |        100    49.85175    17.83505   24.33709        200
       happy |        100        12.9    99.61598          1        999
somethinge~e |        100   -.0085413    1.072256  -2.685881   2.794861

tabulate happy // tabulation of this particular categorical variable

      happy |      Freq.     Percent        Cum.
------------+-----------------------------------
          1 |         21       21.00       21.00
          2 |         19       19.00       40.00
          3 |         23       23.00       63.00
          4 |         17       17.00       80.00
          5 |         19       19.00       99.00
        999 |          1        1.00      100.00
------------+-----------------------------------
      Total |        100      100.00

codebook happy // VERY detailed view of this particular categorical variable

happy                                                               (unlabeled)
-------------------------------------------------------------------------------

                  Type: Numeric (float)

                 Range: [1,999]                       Units: 1
         Unique values: 6                         Missing .: 0/100

            Tabulation: Freq.  Value
                           21  1
                           19  2
                           23  3
                           17  4
                           19  5
                            1  999

Notice that…

There are variables in which we may not have interest.
None of the variables are labelled informatively.
Variables do not seem to have informative value labels.
Someone appears to 200 years old.
There appear to be missing values in the variable happy that need to be re-coded.

Tip

Remember that the command lookfor is often very helpful in looking for a particular variable. e.g. lookfor happy.

Only `keep` The Variables Of Interest

We may only be interested in keeping some variables to keep our analytic data set more manageable.

For this particular analysis we may wish to drop the variable called somethingelse.

keep id age happy // keep only relevant variables

We could also have said drop somethingelse.

Add Variable Labels (`label variable "..."`)

label variable id "ID" // label variable

label variable age "Age in Years" // label variable

label variable happy "Happiness Scale" // label variable

Create Value Labels (`label define ...`)

label define happy 1 "Rarely" 2 "Sometimes" 3 "Often" 4 "Always" // create value label

Attach Value Labels To Variables (`label values ...`)

Variables and value labels can have the same names but are different things. We add the variable label happy to the variable named happy.

label values happy happy // attach VALUE LABEL happy to VARIABLE happy

Recode Outliers, Values That Are Errors, Or Values That Should Be Coded As Missing (`recode`)

recode happy (999 = .) // recode values as missing

recode age (100/max = 100) // age is topcoded at 100 (may or may not be plausible)

(1 changes made to happy)

(1 changes made to age)

We `describe` and `summarize` The Data And See The Changes That Have Been Made

describe

Contains data
 Observations:           100                  
    Variables:             3                  
-------------------------------------------------------------------------------
Variable      Storage   Display    Value
    name         type    format    label      Variable label
-------------------------------------------------------------------------------
id              float   %9.0g                 ID
age             float   %9.0g                 Age in Years
happy           float   %9.0g      happy      Happiness Scale
-------------------------------------------------------------------------------
Sorted by: 
     Note: Dataset has changed since last saved.

summarize

    Variable |        Obs        Mean    Std. dev.       Min        Max
-------------+---------------------------------------------------------
          id |        100        50.5    29.01149          1        100
         age |        100    48.85175    10.71257   24.33709        100
       happy |         99    2.939394    1.412901          1          5

Background

Look At Some Of The Data

Clean The Data!

Look at The Data and Think About The Data (describe, summarize, tabulate, codebook, browse)

Notice that…

Only keep The Variables Of Interest

Add Variable Labels (label variable "...")

Create Value Labels (label define ...)

Attach Value Labels To Variables (label values ...)

Recode Outliers, Values That Are Errors, Or Values That Should Be Coded As Missing (recode)

We describe and summarize The Data And See The Changes That Have Been Made