# 4 A Quick Introduction to R

## 4.1 Why Use R?

R (R Core Team, 2023) has a reputation for being difficult to learn, and a lot of that reputation is deserved. However, it is possible to teach R in an accessible way, and **a little bit of R can take you a long way**.

R is open source, and therefore free, statistical software that is particularly good at obtaining, analyzing and visualizing data.

R Commands are stored in a *script* or *code* file that usually ends in .R, e.g. `myscript.R`

. The command file is distinct from your actual data, stored in an .RData file, e.g. `mydata.RData`

.

A great deal of data analysis and visualization involves the same core set of steps.

Given the fact that we often want to apply the same core set of tasks to new questions and new data, there are ways to overcome the steep learning curve and learn a replicable set of commands that can be applied to problem after problem. **The same 5 to 10 lines of R code can often be tweaked over and over again for multiple projects.**

## 4.2 Get R

R is available at https://www.r-project.org/. R is a lot easier to run if you run it from RStudio, http://www.rstudio.com.

## 4.3 Get Data

Data may already be in R format, or may come from other types of data files like SPSS, Stata, or Excel. Especially in beginning R programming, getting the data into R can be the most complicated part of your program.

### 4.3.1 Data in R Format

`load("./simulate-data/MICSsimulated.RData") # data in R format`

### 4.3.2 Data in Other Formats

If data are in other formats, slightly different code may be required.

```
library(haven) # library for importing data
<- read_sav("the/path/to/mySPSSfile.sav") # SPSS
mydata <- read_dta("the/path/to/myStatafile.dta") # Stata
mydata
library(readxl) # library for importing Excel files
<- read_excel("the/path/to/mySpreadsheet.xls")
mydata
save(mydata, file = "mydata.RData") # save in R format
```

## 4.4 Process and Clean Data

### 4.4.1 The `$`

Sign

The `$`

sign is a kind of “connector”. `mydata$x`

means: “The variable `x`

in the dataset called `mydata`

”.

### 4.4.2 Recoding Data

Data sometimes need to be recoded. For example, outliers may need to be changed to missing, or a value that is supposed to indicated missing data (e.g. `-9`

) may need to be changed to missing.

Recoding uses the following construction:

`data$variable[condition] <- new value`

For example, change an outlier value: When `cd1`

is `2`

change it to missing (`NA`

).

`$cd1[MICSsimulated$cd1 == 2] <- NA # outlier (2) to NA MICSsimulated`

Change variable cd1 to missing (`NA`

) when it is `-9`

.

`$cd1[MICSsimulated$cd1 == -9] <- NA # missing (-9) to NA MICSsimulated`

### 4.4.3 Numeric and Factor Variables

R makes a strong distinction between *continuous* *numeric* variables that measure scales like mental health or neighborhood safety, and *categorical* *factor variables* that measure non-ordered categories like religious identity or gender identity.

Many statistical and graphical procedures are designed to recognize and work with different variable types. You often *don’t* need to use all of the options. e.g. `mydata$w <- factor(mydata$z)`

will often work just fine. **Changing variables from factor to numeric, and vice versa can sometimes be the simple solution that solves a lot of problems when you are trying to graph your variables.**

```
$aggression <-
MICSsimulatedfactor(MICSsimulated$aggression, # original numeric variable
levels = c(0, 1),
labels = c("no aggression", "aggression"),
ordered = TRUE) # whether order matters
# MICSsimulated$z <- as.numeric(MICSsimulated$w) # factor to numeric
```

## 4.5 Visualize Data

### 4.5.1 Histogram

```
hist(MICSsimulated$GII, # what I'm graphing
main = "Gender Inequality Index", # title
xlab = "GII", # label for x axis
col = "blue") # color
```

You often *don’t* need to use all of the options. e.g. `hist(mydata$x)`

will work just fine.

### 4.5.2 Barplot

```
barplot(table(MICSsimulated$aggression), # what I'm graphing
main = "Child Displays Aggression", # title
xlab = "Aggression", # label for x axis
col = "gold") # color
```

You often *don’t* need to use all of the options. e.g. `barplot(table(mydata$z))`

will work just fine.

## 4.6 Analyze Data: Descriptive Statistics

```
summary(mydata$x) # for continuous or factor variables
table(mydata$z) # especially suitable for factor variables
```

`summary(MICSsimulated$GII)`

```
Min. 1st Qu. Median Mean 3rd Qu. Max.
15.0 22.0 24.0 24.2 27.0 31.0
```

`table(MICSsimulated$aggression)`

```
no aggression aggression
1316 1684
```