Background
dplyr
is a very powerful R library for managing and processing data.
While dplyr
is very powerful, learning to use dplyr
can be very confusing. This guide aims to present some of the most common dplyr
functions and commands in the form of a brief cheatsheet.
Simulated Data
2019 |
NA |
Group B |
90 |
2020 |
50 |
Group C |
90 |
2020 |
NA |
Group A |
100 |
2021 |
70 |
Group A |
110 |
2021 |
80 |
Group B |
120 |
Piping
Pipes %>%
connect pieces of a command e.g. data to data wrangling to a graph command.
dplyr
commands will often look something like the outline below.
mydata %>%
data_wrangling %>%
more_data_wrangling %>%
graph_command
Aggregate Data: group_by()
& summarise()
Notice how when aggregating data, we need to be explicit about whether we are removing NA
values.
mynewdata <- mydata %>%
group_by(year) %>% # group by y
summarise(mean_x = mean(x, na.rm = TRUE), # mean of x; removing NA's
n = n()) # count up
2019 |
NA |
1 |
2020 |
50 |
2 |
2021 |
75 |
2 |
Select A Subset of Variables: select()
mynewdata <- mydata %>%
select(x,y) # select only x and y
NA |
Group B |
50 |
Group C |
NA |
Group A |
70 |
Group A |
80 |
Group B |
Filter A Subset of Rows: filter()
mynewdata <- mydata %>%
filter(year > 2020) # filter on year
2021 |
70 |
Group A |
110 |
2021 |
80 |
Group B |
120 |
Create New Variables: mutate()
mynewdata <- mydata %>%
mutate(myscale = x + z) # create a new variable e.g. a scale
2019 |
NA |
Group B |
90 |
NA |
2020 |
50 |
Group C |
90 |
140 |
2020 |
NA |
Group A |
100 |
NA |
2021 |
70 |
Group A |
110 |
180 |
2021 |
80 |
Group B |
120 |
200 |
Recode Variables: mutate()
Continuous Into Categorical: mutate()
& cut()
mynewdata <- mydata %>%
mutate(zcategorical = cut(z, # cut at breaks
breaks=c(-Inf, 100, Inf),
labels = c("low", "high")))
2019 |
NA |
Group B |
90 |
low |
2020 |
50 |
Group C |
90 |
low |
2020 |
NA |
Group A |
100 |
low |
2021 |
70 |
Group A |
110 |
high |
2021 |
80 |
Group B |
120 |
high |
Categorical Into Categorical: mutate()
& recode()
mynewdata <- mydata %>%
mutate(yrecoded = dplyr::recode(y, # recode values
"Group A" = "Red Group",
"Group B" = "Blue Group",
.default = "Other"))
2019 |
NA |
Group B |
90 |
Blue Group |
2020 |
50 |
Group C |
90 |
Other |
2020 |
NA |
Group A |
100 |
Red Group |
2021 |
70 |
Group A |
110 |
Red Group |
2021 |
80 |
Group B |
120 |
Blue Group |
Rename Variables: rename()
newdata <- mydata %>%
rename(age = x, # rename
mental_health = z)
2019 |
NA |
Group B |
90 |
2020 |
50 |
Group C |
90 |
2020 |
NA |
Group A |
100 |
2021 |
70 |
Group A |
110 |
2021 |
80 |
Group B |
120 |
Drop Missing Values: filter()
newdata <- mydata %>%
filter(!is.na(x)) # filter by x is not missing
2020 |
50 |
Group C |
90 |
2021 |
70 |
Group A |
110 |
2021 |
80 |
Group B |
120 |
Random Sample
newdata <- mydata %>%
sample_frac(.5) # fraction of data to sample
2020 |
50 |
Group C |
90 |
2020 |
NA |
Group A |
100 |
Connecting To Other Packages Like ggplot
Notice how, in the code below, I never actually create the new data set mynewdata
. I simply pipe mydata
into a dplyr
command, and pipe the result directly to ggplot2
.