It sometimes seems like 80% of our work as data analysts is cleaning
the data, while only 20% is the actual analysis. Here are some Stata
commands that are useful in cleaning data.
First we simulate some data to work with, and to clean.
Simulate Some Data
This section is provided for illustration only, as it may be helpful
to see how the data was simulated, and the decisions that went
into simulating the data. You may also safely ignore this
section if you like.
. clear all
. set obs 100 // 100 observations
Number of observations (_N) was 0, now 100.
. generate id = _n // random id
. generate age = rnormal(50,10) // random generated age
. replace age = 200 in 1 // someone is 200 years old!
(1 real change made)
Look
at The Data and Think About The Data (describe,
summarize, tabulate, codebook,
browse)
When we look at variables we are looking for values that don’t make
sense, or that are outside the plausible range. As we are working with
the data, it may sometimes be helpful to browse the
data.
. describe // describe the data
Contains data
Observations: 100
Variables: 4
───────────────────────────────────────────────────────────────────────────────────────────
Variable Storage Display Value
name type format label Variable label
───────────────────────────────────────────────────────────────────────────────────────────
id float %9.0g
age float %9.0g
happy float %9.0g
somethingelse float %9.0g
───────────────────────────────────────────────────────────────────────────────────────────
Sorted by:
Note: Dataset has changed since last saved.
. summarize // descriptive statistics
Variable │ Obs Mean Std. dev. Min Max
─────────────┼─────────────────────────────────────────────────────────
id │ 100 50.5 29.01149 1 100
age │ 100 49.85175 17.83505 24.33709 200
happy │ 100 12.9 99.61598 1 999
somethinge~e │ 100 -.0085413 1.072256 -2.685881 2.794861
Attach
Value Labels To Variables
(label values ...)
Variables and value labels can have the same names
but are different things. We add the variable label happy
to the variable named happy.
. label values happy happy // attach VALUE LABEL happy to VARIABLE happy
Recode
Outliers, Values That Are Errors, Or Values That Should Be Coded As
Missing (recode)
. recode happy (999 = .) // recode values as missing
(1 changes made to happy)
. recode age (100/max = 100) // age is topcoded at 100 (may or may not be plausible)
(1 changes made to age)
We
describe and summarize The Data And See The
Changes That Have Been Made
. describe
Contains data
Observations: 100
Variables: 3
───────────────────────────────────────────────────────────────────────────────────────────
Variable Storage Display Value
name type format label Variable label
───────────────────────────────────────────────────────────────────────────────────────────
id float %9.0g ID
age float %9.0g Age in Years
happy float %9.0g happy Happiness Scale
───────────────────────────────────────────────────────────────────────────────────────────
Sorted by:
Note: Dataset has changed since last saved.
. summarize
Variable │ Obs Mean Std. dev. Min Max
─────────────┼─────────────────────────────────────────────────────────
id │ 100 50.5 29.01149 1 100
age │ 100 48.85175 10.71257 24.33709 100
happy │ 99 2.939394 1.412901 1 5