In Excel, Google Sheets, R and Stata
A great deal of data analysis and visualization involves the same core set of steps.
\[\text{have a question} \rightarrow \text{get data} \rightarrow \text{process and clean data} \rightarrow \text{analyze data}\] Below we describe some simple data cleaning, and simple analysis with 4 tools: Excel, Google Sheets, R, and Stata.
Tool | Cost | Ease of Use | Analysis Capabilities | Suitability for Large Data | Keep Track of Complicated Workflows |
---|---|---|---|---|---|
Excel | Comes installed on many computers | Easy | Limited | Difficult when N > 100 | Difficult to Impossible |
Google Sheets | Free with a Google account | Easy | Limited | Difficult when N > 100 | Difficult to Impossible |
R | Free | Challenging | Extensive | Excellent with large datasets | Yes, with script |
Stata | Some cost | Learning Curve but Intuitive | Extensive | Excellent with large datasets | Yes, with command file |
We take a look at our simulated data, which has an id
number, age
, and happiness
(on a 5 point scale, with 5 being the happiest.)
id | group | age | happy | somethingelse |
---|---|---|---|---|
1 | Group B | 40.25 | -99 | -1.175 |
2 | Group A | 200 | 3 | -0.3674 |
3 | Group B | 51.55 | -99 | 0.9082 |
4 | Group A | 56.84 | 4 | -1.038 |
5 | Group A | 45.79 | 2 | -0.2735 |
6 | Group B | 55.77 | 3 | -1.102 |
Notice that…
somethingelse
).There are some basic data cleaning steps that are common to many projects.
Select the column, or columns, of data that you wish to remove, and right click, or control click, to delete them.
We are unable to add informative labels to variables in Excel or Google Sheets.
We are unable to add informative labels to values in Excel or Google Sheets.
We are likely going to have to use find and replace to manually replace problematic values. For example, we will want to replace the
200
in theage
column with a.
orNA
for missing. Similarly, we will want to replace the values of-99
in thehappy
column with a.
orNA
for missing.
For small data sets, this will not be difficult, but for larger data sets–especially data with many different kinds of values that need to be recoded–this process will become more difficult and cumbersome.
Much of R’s functionality is accomplished through writing code, that is saved in a script. Notice how–as our tasks get more and more complicated–the saved script provides documentation for the decisions that we have made with the data.
We can easily accomplish this with the
subset
function
id | group | age | happy |
---|---|---|---|
1 | Group B | 40.25 | -99 |
2 | Group A | 200 | 3 |
3 | Group B | 51.55 | -99 |
4 | Group A | 56.84 | 4 |
5 | Group A | 45.79 | 2 |
6 | Group B | 55.77 | 3 |
Adding variable labels is not well established in R. There are libraries that can add variable labels for some purposes, but not every library in R recognizes variable labels.
In contrast, value labels are straightforward in R, and can be accomplished by creating a factor variable. Below we demonstrate how to do this with the happy variable.
id | group | age | happy | happyFACTOR |
---|---|---|---|---|
1 | Group B | 40.25 | -99 | NA |
2 | Group A | 200 | 3 | Neutral |
3 | Group B | 51.55 | -99 | NA |
4 | Group A | 56.84 | 4 | Somewhat Happy |
5 | Group A | 45.79 | 2 | Somewhat Unhappy |
6 | Group B | 55.77 | 3 | Neutral |
We can easily accomplish this using Base R’s syntax for recoding:
data$variable[rule] <- newvalue
.
mynewdata$age[mynewdata$age >= 100] <- NA # recode > 100 to NA
mynewdata$happy[mynewdata$happy == -99] <- NA # recode -99 to NA
id | group | age | happy | happyFACTOR |
---|---|---|---|---|
1 | Group B | 40.25 | NA | NA |
2 | Group A | NA | 3 | Neutral |
3 | Group B | 51.55 | NA | NA |
4 | Group A | 56.84 | 4 | Somewhat Happy |
5 | Group A | 45.79 | 2 | Somewhat Unhappy |
6 | Group B | 55.77 | 3 | Neutral |
Much of Stata’s functionality is accomplished through writing code, that is saved in a script, which Stata calls a do file. Notice how–as our tasks get more and more complicated–the saved script provides documentation for the decisions that we have made with the data.
This is easily accomplished with Stata’s
drop
command. We could also choose tokeep
our variables of interest.
drop somethingelse // drop extraneous variable(s)
Variable labels can easily be added in Stata.
label variable age "Respondent's Age" // variable label for age
label variable happy "Happiness Score" // variable label for happy
describe // describe the data
data from mydata.dta
Contains
Observations: 100
Variables: 4
-------------------------------------------------------------------------------
Variable Storage Display Valuename type format label Variable label
-------------------------------------------------------------------------------long %9.0g id
id group long %9.0g group group
double %9.0g Respondent's Age
age double %9.0g Happiness Score
happy -------------------------------------------------------------------------------
Value labels are a natural part of Stata.
label define happy /// create value label for happy
"Very Unhappy" ///
5 "Somewhat Unhappy" ///
4 "Neutral" ///
3 "Somewhat Happy" ///
2 "Very Happy"
1
label values happy happy // assign value label happy to variable happy
Note that the code above is actually a mistake.
5
should be assigned to Very Happy, not to Very Unhappy.1
should be assigned to Very Unhappy. Having our thinking written down in the form of code or syntax allows us to even see that this mistake has been made–and to ultimately correct it–when we are double checking our workflow.
list in 1/10 // list first 10 lines of data
+--------------------------------------------+group age happy |
| id
|--------------------------------------------|
1. | 1 Group A 47.99102 -99 |
2. | 2 Group A 200 Somewhat Unhappy |
3. | 3 Group B 49.13296 -99 |
4. | 4 Group A 36.66538 Somewhat Unhappy |
5. | 5 Group B 57.67317 Neutral |
|--------------------------------------------|
6. | 6 Group A 54.10485 Very Happy |
7. | 7 Group A 39.23244 Very Unhappy |
8. | 8 Group B 44.18384 Somewhat Unhappy |
9. | 9 Group A 35.08472 Neutral |
10. | 10 Group A 52.86285 Somewhat Unhappy | +--------------------------------------------+
recode age (100 / max = .) // recode ages > 100
recode happy (-99 = .) // recode -99 to missing
list in 1/10 // list first 10 lines of data
+--------------------------------------------+group age happy |
| id
|--------------------------------------------|
1. | 1 Group A 47.99102 . |
2. | 2 Group A . Somewhat Unhappy |
3. | 3 Group B 49.13296 . |
4. | 4 Group A 36.66538 Somewhat Unhappy |
5. | 5 Group B 57.67317 Neutral |
|--------------------------------------------|
6. | 6 Group A 54.10485 Very Happy |
7. | 7 Group A 39.23244 Very Unhappy |
8. | 8 Group B 44.18384 Somewhat Unhappy |
9. | 9 Group A 35.08472 Neutral |
10. | 10 Group A 52.86285 Somewhat Unhappy |
+--------------------------------------------+
Our first step in analysis is to discover what kind of variables we have. We need to make a distinction between continuous variables that measure things like mental health or neighborhood safety, or age, and categorical variables that measure non-ordered categories like religious identity or gender identity.
In Excel and Google Sheets, our ability to do data analysis is very limited. In general, we are only able to easily calculate the average of continuous variables. There are various add-ins that can calculate other quantities, but their availability, usability, and ongoing stable development, tends to be inconsistent.
As a mostly command based language, R relies on the idea of
do_something(dataset$variable)
.
summary(mynewdata$age) # descriptive statistics for age
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
24.24 45.59 50.52 50.94 57.25 83.32 1
table(mynewdata$group) # frequency table of group
Group A Group B
53 47
As a mostly command based language, Stata relies on the idea of
do_something variable(s), options
.
summarize age // descriptive statistics for age
dev. Min Max
Variable | Obs Mean Std.
-------------+--------------------------------------------------------- age | 99 47.30858 9.615174 29.53682 71.51345
tabulate group // frequency table of group
group | Freq. Percent Cum.
------------+-----------------------------------
Group A | 46 46.00 46.00
Group B | 54 54.00 100.00
------------+----------------------------------- Total | 100 100.00