%%{
  init: {
    'theme': 'base',
    'themeVariables': {
      'primaryColor': '#FFC20E',
      'primaryTextColor': '#000000',
      'primaryBorderColor': '#2D2926',
      'lineColor': '#2D2926',
      'secondaryColor': '#2D2926',
      'secondaryTextColor': '#000000',
      'tertiaryColor': '#F2F2F2',
      'tertiaryBorderColor': '#2D2926'
    }
  }
}%%
flowchart LR
  A(have a <br>question) --> B(get data)
  B --> B2(select <br>variables)
  B2 --> C(process and <br>clean data) 
  C --> D(visualize <br>data)
  D --> E(analyze <br>data)
  E --> F(make <br>conclusions)
  F --> G(share <br>ideas)
  
6 Quantitative Data Analysis
6.1 Introduction
A great deal of data analysis (and visualization) involves the same core set of steps.
6.2 Some Tools for Analysis
Below we describe some simple data cleaning with R. We begin, however, by comparing several different tools for analysis including: Excel, Google Sheets, R, and Stata.
| Tool | Cost | Ease of Use | Analysis Capabilities | Suitability for Large Data | Keep Track of Complicated Workflows | 
|---|---|---|---|---|---|
| Excel | Comes installed on many computers | Easy | Limited | Difficult when N > 100 | Difficult to Impossible | 
| Google Sheets | Free with a Google account | Easy | Limited | Difficult when N > 100 | Difficult to Impossible | 
| R | Free | Challenging | Extensive | Excellent with large datasets | Yes, with script | 
| Stata | Some cost | Learning Curve but Intuitive | Extensive | Excellent with large datasets | Yes, with command file | 
6.3 Working With R
6.3.1 Our Data
We take a look at our simulated data.
load("./simulate-data/MICSsimulated.RData") # data in R formatlabelled::look_for(MICSsimulated) # look at variables and variable labels pos variable   label                   col_type missing values
 1   id         id                      int      0             
 2   country    country                 int      0             
 3   GII        Gender Inequality Index int      0             
 4   HDI        Human Development Index int      0             
 5   cd1        spank                   int      0             
 6   cd2        beat                    int      0             
 7   cd3        shout                   int      0             
 8   cd4        explain                 int      0             
 9   aggression aggression              int      0             head(MICSsimulated) # look at top (head) of data  id country GII HDI cd1 cd2 cd3 cd4 aggression
1  1       1  20  24   0   0   1   1          1
2  2       1  20  24   0   0   1   1          1
3  3       1  20  24   0   0   1   1          1
4  4       1  20  24   0   0   0   0          1
5  5       1  20  24   1   0   1   1          0
6  6       1  20  24   0   0   1   1          16.3.2 Cleaning Data
There are some basic data cleaning steps that are common to many projects.
- Only keep the variables of interest. Section 6.3.2.1
- Add variable labels (if we can). Section 6.3.2.2
- Add value labels (if we can). Section 6.3.2.3
- Recode outliers, values that are errors, or values that should be coded as missing Section 6.3.2.4
Much of R’s functionality is accomplished through writing code, that is saved in a script. Notice how–as our tasks get more and more complicated–the saved script provides documentation for the decisions that we have made with the data. A sample R script for the steps found in this chapter can be found in Appendix A.
6.3.2.1 Only keep the variables of interest.
We can easily accomplish this with the
subsetfunction
mynewdata <- subset(MICSsimulated,
                    select = c(id, country, aggression)) # subset of datahead(mynewdata) # look at top (head) of data  id country aggression
1  1       1          1
2  2       1          1
3  3       1          1
4  4       1          1
5  5       1          0
6  6       1          16.3.2.2 Add variable labels (if we can).
Adding variable labels is still somewhat new in R. The
labelledlibrary allows us to add or change variable labels. However, not every library in R recognizes variable labels.
library(labelled) # variable labels
var_label(MICSsimulated$id) <- "id"
var_label(MICSsimulated$country) <- "country"
var_label(MICSsimulated$cd4) <- "explain"6.3.2.3 Add value labels (if we can).
In contrast, value labels are straightforward in R, and can be accomplished by creating a factor variable. Below we demonstrate how to do this with the happy variable.
MICSsimulated$cd4 <- factor(MICSsimulated$cd4,
                             levels = c(0, 1),
                             labels = c("Did not explain",
                                        "Explained"))head(MICSsimulated) # head (top) of data  id country GII HDI cd1 cd2 cd3             cd4 aggression
1  1       1  20  24   0   0   1       Explained          1
2  2       1  20  24   0   0   1       Explained          1
3  3       1  20  24   0   0   1       Explained          1
4  4       1  20  24   0   0   0 Did not explain          1
5  5       1  20  24   1   0   1       Explained          0
6  6       1  20  24   0   0   1       Explained          16.3.2.4 Recode outliers, values that are errors, or values that should be coded as missing.
We can easily accomplish this using Base R’s syntax for recoding:
data$variable[rule] <- newvalue.
MICSsimulated$aggression[MICSsimulated$aggression > 1] <- NA # recode > 1 to NA
MICSsimulated$GII[MICSsimulated$GII > 100] <- NA # recode > 100 to NAhead(MICSsimulated) # head (top) of data  id country GII HDI cd1 cd2 cd3             cd4 aggression
1  1       1  20  24   0   0   1       Explained          1
2  2       1  20  24   0   0   1       Explained          1
3  3       1  20  24   0   0   1       Explained          1
4  4       1  20  24   0   0   0 Did not explain          1
5  5       1  20  24   1   0   1       Explained          0
6  6       1  20  24   0   0   1       Explained          16.3.3 Simple Analysis
Our first step in analysis is to discover what kind of variables we have. We need to make a distinction between continuous variables that measure things like mental health or neighborhood safety, or age, and categorical variables that measure non-ordered categories like religious identity or gender identity.
Sometimes deciding whether a variable is continuous or categorical involves some hard thinking, or referring to the documentation for the data. In this data, all of the forms of discipline, as well as
aggressionare1/0variables, so likely best conceptualized as categorical variables. In contrast,GIIandHDIare best conceptualized as continuous variables.
- For continuous variables, it is most appropriate to take the average or mean.
- For categorical variables, it is most appropriate to generate a frequency table.
As a mostly command based language, R relies on the idea of
do_something(dataset$variable).
summary(MICSsimulated$GII) # descriptive statistics for GII   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   15.0    22.0    24.0    24.2    27.0    31.0 table(MICSsimulated$cd4) # frequency table of cd4
Did not explain       Explained 
            674            2326