In Excel, Google Sheets, R and Stata
A great deal of data analysis and visualization involves the same core set of steps.
\[\text{have a question} \rightarrow \text{get data} \rightarrow \text{process and clean data} \rightarrow \text{analyze data}\] Below we describe some simple data cleaning, and simple analysis with 4 tools: Excel, Google Sheets, R, and Stata.
| Tool | Cost | Ease of Use | Analysis Capabilities | Suitability for Large Data | Keep Track of Complicated Workflows | 
|---|---|---|---|---|---|
| Excel | Comes installed on many computers | Easy | Limited | Difficult when N > 100 | Difficult to Impossible | 
| Google Sheets | Free with a Google account | Easy | Limited | Difficult when N > 100 | Difficult to Impossible | 
| R | Free | Challenging | Extensive | Excellent with large datasets | Yes, with script | 
| Stata | Some cost | Learning Curve but Intuitive | Extensive | Excellent with large datasets | Yes, with command file | 
We take a look at our simulated data, which has an id number, age, and happiness (on a 5 point scale, with 5 being the happiest.)
| id | group | age | happy | somethingelse | 
|---|---|---|---|---|
| 1 | Group B | 40.25 | -99 | -1.175 | 
| 2 | Group A | 200 | 3 | -0.3674 | 
| 3 | Group B | 51.55 | -99 | 0.9082 | 
| 4 | Group A | 56.84 | 4 | -1.038 | 
| 5 | Group A | 45.79 | 2 | -0.2735 | 
| 6 | Group B | 55.77 | 3 | -1.102 | 
Notice that…
somethingelse).There are some basic data cleaning steps that are common to many projects.
Select the column, or columns, of data that you wish to remove, and right click, or control click, to delete them.
We are unable to add informative labels to variables in Excel or Google Sheets.
We are unable to add informative labels to values in Excel or Google Sheets.
We are likely going to have to use find and replace to manually replace problematic values. For example, we will want to replace the
200in theagecolumn with a.orNAfor missing. Similarly, we will want to replace the values of-99in thehappycolumn with a.orNAfor missing.
For small data sets, this will not be difficult, but for larger data sets–especially data with many different kinds of values that need to be recoded–this process will become more difficult and cumbersome.
Much of R’s functionality is accomplished through writing code, that is saved in a script. Notice how–as our tasks get more and more complicated–the saved script provides documentation for the decisions that we have made with the data.
We can easily accomplish this with the
subsetfunction
| id | group | age | happy | 
|---|---|---|---|
| 1 | Group B | 40.25 | -99 | 
| 2 | Group A | 200 | 3 | 
| 3 | Group B | 51.55 | -99 | 
| 4 | Group A | 56.84 | 4 | 
| 5 | Group A | 45.79 | 2 | 
| 6 | Group B | 55.77 | 3 | 
Adding variable labels is not well established in R. There are libraries that can add variable labels for some purposes, but not every library in R recognizes variable labels.
In contrast, value labels are straightforward in R, and can be accomplished by creating a factor variable. Below we demonstrate how to do this with the happy variable.
| id | group | age | happy | happyFACTOR | 
|---|---|---|---|---|
| 1 | Group B | 40.25 | -99 | NA | 
| 2 | Group A | 200 | 3 | Neutral | 
| 3 | Group B | 51.55 | -99 | NA | 
| 4 | Group A | 56.84 | 4 | Somewhat Happy | 
| 5 | Group A | 45.79 | 2 | Somewhat Unhappy | 
| 6 | Group B | 55.77 | 3 | Neutral | 
We can easily accomplish this using Base R’s syntax for recoding:
data$variable[rule] <- newvalue.
mynewdata$age[mynewdata$age >= 100] <- NA # recode > 100 to NA
mynewdata$happy[mynewdata$happy == -99] <- NA # recode -99 to NA
| id | group | age | happy | happyFACTOR | 
|---|---|---|---|---|
| 1 | Group B | 40.25 | NA | NA | 
| 2 | Group A | NA | 3 | Neutral | 
| 3 | Group B | 51.55 | NA | NA | 
| 4 | Group A | 56.84 | 4 | Somewhat Happy | 
| 5 | Group A | 45.79 | 2 | Somewhat Unhappy | 
| 6 | Group B | 55.77 | 3 | Neutral | 
Much of Stata’s functionality is accomplished through writing code, that is saved in a script, which Stata calls a do file. Notice how–as our tasks get more and more complicated–the saved script provides documentation for the decisions that we have made with the data.
This is easily accomplished with Stata’s
dropcommand. We could also choose tokeepour variables of interest.
drop somethingelse // drop extraneous variable(s)Variable labels can easily be added in Stata.
label variable age "Respondent's Age" // variable label for age
label variable happy "Happiness Score" // variable label for happy
describe // describe the data
Contains data from mydata.dta
 Observations:           100                              
    Variables:             4                  
-------------------------------------------------------------------------------
Variable      Storage   Display    Value
    name         type    format    label      Variable label
-------------------------------------------------------------------------------
id              long    %9.0g                 id
group           long    %9.0g      group      group
age             double  %9.0g                 Respondent's Age
happy           double  %9.0g                 Happiness Score
-------------------------------------------------------------------------------Value labels are a natural part of Stata.
label define happy /// create value label for happy
5 "Very Unhappy" ///
4 "Somewhat Unhappy" ///
3 "Neutral" ///
2 "Somewhat Happy" ///
1 "Very Happy"
label values happy happy // assign value label happy to variable happy Note that the code above is actually a mistake.
5should be assigned to Very Happy, not to Very Unhappy.1should be assigned to Very Unhappy. Having our thinking written down in the form of code or syntax allows us to even see that this mistake has been made–and to ultimately correct it–when we are double checking our workflow.
list in 1/10 // list first 10 lines of data
     +--------------------------------------------+
     | id     group        age              happy |
     |--------------------------------------------|
  1. |  1   Group A   47.99102                -99 |
  2. |  2   Group A        200   Somewhat Unhappy |
  3. |  3   Group B   49.13296                -99 |
  4. |  4   Group A   36.66538   Somewhat Unhappy |
  5. |  5   Group B   57.67317            Neutral |
     |--------------------------------------------|
  6. |  6   Group A   54.10485         Very Happy |
  7. |  7   Group A   39.23244       Very Unhappy |
  8. |  8   Group B   44.18384   Somewhat Unhappy |
  9. |  9   Group A   35.08472            Neutral |
 10. | 10   Group A   52.86285   Somewhat Unhappy |
     +--------------------------------------------+
recode age (100 / max = .) // recode ages > 100
recode happy (-99 = .) // recode -99 to missing
list in 1/10 // list first 10 lines of data
     +--------------------------------------------+
     | id     group        age              happy |
     |--------------------------------------------|
  1. |  1   Group A   47.99102                  . |
  2. |  2   Group A          .   Somewhat Unhappy |
  3. |  3   Group B   49.13296                  . |
  4. |  4   Group A   36.66538   Somewhat Unhappy |
  5. |  5   Group B   57.67317            Neutral |
     |--------------------------------------------|
  6. |  6   Group A   54.10485         Very Happy |
  7. |  7   Group A   39.23244       Very Unhappy |
  8. |  8   Group B   44.18384   Somewhat Unhappy |
  9. |  9   Group A   35.08472            Neutral |
 10. | 10   Group A   52.86285   Somewhat Unhappy |
     +--------------------------------------------+
  Our first step in analysis is to discover what kind of variables we have. We need to make a distinction between continuous variables that measure things like mental health or neighborhood safety, or age, and categorical variables that measure non-ordered categories like religious identity or gender identity.
In Excel and Google Sheets, our ability to do data analysis is very limited. In general, we are only able to easily calculate the average of continuous variables. There are various add-ins that can calculate other quantities, but their availability, usability, and ongoing stable development, tends to be inconsistent.
As a mostly command based language, R relies on the idea of
do_something(dataset$variable).
summary(mynewdata$age) # descriptive statistics for age
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
  24.24   45.59   50.52   50.94   57.25   83.32       1 
table(mynewdata$group) # frequency table of group
Group A Group B 
     53      47 
As a mostly command based language, Stata relies on the idea of
do_something variable(s), options.
summarize age // descriptive statistics for age
    Variable |        Obs        Mean    Std. dev.       Min        Max
-------------+---------------------------------------------------------
         age |         99    47.30858    9.615174   29.53682   71.51345
tabulate group // frequency table of group
      group |      Freq.     Percent        Cum.
------------+-----------------------------------
    Group A |         46       46.00       46.00
    Group B |         54       54.00      100.00
------------+-----------------------------------
      Total |        100      100.00