clear all
set obs 100 // 100 observations
generate id = _n // random id
generate age = rnormal(50, 10) // random generated age
replace age = 200 in 1 // someone is 200 years old!
generate happy = runiformint(1,5) // randomly generated happiness
replace happy = 999 in 10 // simulate a missing value
generate somethingelse = rnormal(0, 1) // something else!Background
It sometimes seems like 80% of our work as data analysts is cleaning the data, while only 20% is the actual analysis. Here are some Stata commands that are useful in cleaning data.
First we simulate some data to work with, and to clean.
This section is provided for illustration only, as it may be helpful to see how the data was simulated, and the decisions that went into simulating the data. You may also safely ignore this section if you like.
Look At Some Of The Data
list in 1/10 // list first 10 observations | id age happy somethi~e |
|-----------------------------------|
1. | 1 200 1 .06784 |
2. | 2 60.42631 3 1.459888 |
3. | 3 52.97712 4 -.7060056 |
4. | 4 32.77868 5 .0529082 |
5. | 5 42.708 1 -.2976007 |
|-----------------------------------|
6. | 6 58.61826 3 -.2165111 |
7. | 7 47.60646 2 1.299809 |
8. | 8 55.16549 5 .1418742 |
9. | 9 31.87984 3 1.550615 |
10. | 10 39.84875 999 .2515597 |
+-----------------------------------+
Clean The Data!
Look at The Data and Think About The Data (describe, summarize, tabulate, codebook, browse)
When we look at variables we are looking for values that don’t make sense, or that are outside the plausible range. As we are working with the data, it may sometimes be helpful to browse the data.
describe // describe the dataContains data
Observations: 100
Variables: 4
-------------------------------------------------------------------------------
Variable Storage Display Value
name type format label Variable label
-------------------------------------------------------------------------------
id float %9.0g
age float %9.0g
happy float %9.0g
somethingelse float %9.0g
-------------------------------------------------------------------------------
Sorted by:
Note: Dataset has changed since last saved.
summarize // descriptive statistics
Variable | Obs Mean Std. dev. Min Max
-------------+---------------------------------------------------------
id | 100 50.5 29.01149 1 100
age | 100 49.85175 17.83505 24.33709 200
happy | 100 12.9 99.61598 1 999
somethinge~e | 100 -.0085413 1.072256 -2.685881 2.794861
tabulate happy // tabulation of this particular categorical variable happy | Freq. Percent Cum.
------------+-----------------------------------
1 | 21 21.00 21.00
2 | 19 19.00 40.00
3 | 23 23.00 63.00
4 | 17 17.00 80.00
5 | 19 19.00 99.00
999 | 1 1.00 100.00
------------+-----------------------------------
Total | 100 100.00
codebook happy // VERY detailed view of this particular categorical variablehappy (unlabeled)
-------------------------------------------------------------------------------
Type: Numeric (float)
Range: [1,999] Units: 1
Unique values: 6 Missing .: 0/100
Tabulation: Freq. Value
21 1
19 2
23 3
17 4
19 5
1 999
Notice that…
- There are variables in which we may not have interest.
- None of the variables are labelled informatively.
- Variables do not seem to have informative value labels.
- Someone appears to 200 years old.
- There appear to be missing values in the variable
happythat need to be re-coded.
Remember that the command lookfor is often very helpful in looking for a particular variable. e.g. lookfor happy.
Only keep The Variables Of Interest
We may only be interested in keeping some variables to keep our analytic data set more manageable.
For this particular analysis we may wish to drop the variable called somethingelse.
keep id age happy // keep only relevant variablesWe could also have said drop somethingelse.
Add Variable Labels (label variable "...")
label variable id "ID" // label variable
label variable age "Age in Years" // label variable
label variable happy "Happiness Scale" // label variableCreate Value Labels (label define ...)
label define happy 1 "Rarely" 2 "Sometimes" 3 "Often" 4 "Always" // create value labelAttach Value Labels To Variables (label values ...)
Variables and value labels can have the same names but are different things. We add the variable label happy to the variable named happy.
label values happy happy // attach VALUE LABEL happy to VARIABLE happyRecode Outliers, Values That Are Errors, Or Values That Should Be Coded As Missing (recode)
recode happy (999 = .) // recode values as missing
recode age (100/max = 100) // age is topcoded at 100 (may or may not be plausible)(1 changes made to happy)
(1 changes made to age)
We describe and summarize The Data And See The Changes That Have Been Made
describeContains data
Observations: 100
Variables: 3
-------------------------------------------------------------------------------
Variable Storage Display Value
name type format label Variable label
-------------------------------------------------------------------------------
id float %9.0g ID
age float %9.0g Age in Years
happy float %9.0g happy Happiness Scale
-------------------------------------------------------------------------------
Sorted by:
Note: Dataset has changed since last saved.
summarize
Variable | Obs Mean Std. dev. Min Max
-------------+---------------------------------------------------------
id | 100 50.5 29.01149 1 100
age | 100 48.85175 10.71257 24.33709 100
happy | 99 2.939394 1.412901 1 5