In this post, we will discuss about the family of apply()
functions. These functions allows you to recursively apply a function across all elements of a vector
, list
, matrix
, or data.frame
. The apply()
family is an interesting alternative to the for
loop because it wraps the loop into a simple function.
The functions in the apply()
family differ in their input and output types:
Function | Description |
---|---|
apply() |
Applies a function to margins of an array , matrix or data.frame (2D objects) |
lapply() |
Applies a function over a list or vector and returns a list |
sapply() |
Wrapper of lapply but returns a vector or matrix (volatile) |
vapply() |
Similar to sapply but safer |
tapply() |
Applies a function to a group of data grouped by one or more factors and returns an array |
NB. Here we won’t talk about sapply()
and vapply()
as there are similar to lapply()
.
Dataset
To illustrate to use of apply()
functions, we will use the palmerpenguins
package. It contains the penguins
dataset with size measurements for three penguin species observed on three islands in the Palmer Archipelago, Antarctica.
These data were collected from 2007 and 2009 by Dr. Kristen Gorman and are released under the CC0 license.
Let’s install the released version of palmerpenguins
package from CRAN:
Now, let’s import the dataset:
# A tibble: 344 × 8
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<fct> <fct> <dbl> <dbl> <int> <int>
1 Adelie Torgersen 39.1 18.7 181 3750
2 Adelie Torgersen 39.5 17.4 186 3800
3 Adelie Torgersen 40.3 18 195 3250
4 Adelie Torgersen NA NA NA NA
5 Adelie Torgersen 36.7 19.3 193 3450
6 Adelie Torgersen 39.3 20.6 190 3650
7 Adelie Torgersen 38.9 17.8 181 3625
8 Adelie Torgersen 39.2 19.6 195 4675
9 Adelie Torgersen 34.1 18.1 193 3475
10 Adelie Torgersen 42 20.2 190 4250
# ℹ 334 more rows
# ℹ 2 more variables: sex <fct>, year <int>
For this post, we will use a subset of this dataset:
## Columns to keep ----
cols <- c("species", "island", "bill_length_mm", "bill_depth_mm",
"body_mass_g")
## Subset data ----
penguins <- penguins[ , cols]
penguins
# A tibble: 344 × 5
species island bill_length_mm bill_depth_mm body_mass_g
<fct> <fct> <dbl> <dbl> <int>
1 Adelie Torgersen 39.1 18.7 3750
2 Adelie Torgersen 39.5 17.4 3800
3 Adelie Torgersen 40.3 18 3250
4 Adelie Torgersen NA NA NA
5 Adelie Torgersen 36.7 19.3 3450
6 Adelie Torgersen 39.3 20.6 3650
7 Adelie Torgersen 38.9 17.8 3625
8 Adelie Torgersen 39.2 19.6 4675
9 Adelie Torgersen 34.1 18.1 3475
10 Adelie Torgersen 42 20.2 4250
# ℹ 334 more rows
The apply()
function
The apply()
lets you perform a function across rows or columns of a data.frame
(or any types of 2-dimension objects).
- the first argument
X
specifies the data - the second argument
MARGIN
specifies the direction (1
for rows,2
for columns) - the third argument
FUN
is the function to apply
Let’s compute the arithmetic mean of the columns bill_length_mm
, bill_depth_mm
and body_mass_g
by applying the mean()
function across columns 3 to 5 of the penguins
dataset.
bill_length_mm bill_depth_mm body_mass_g
NA NA NA
We can pass arguments to the function mean()
by using the argument ...
of the function apply()
. Let’s remove missing values before computing the mean.
bill_length_mm bill_depth_mm body_mass_g
43.92193 17.15117 4201.75439
Note that the apply()
functions are pipe-friendly.
bill_length_mm bill_depth_mm body_mass_g
43.92193 17.15117 4201.75439
We can also use a custom function.
bill_length_mm bill_depth_mm body_mass_g
43.92193 17.15117 4201.75439
Finally, we can define a custom function outside the apply()
function.
## Custom function ----
my_mean <- function(x, na_rm = FALSE) {
mean(x, na.rm = na_rm)
}
apply(penguins[ , 3:5], 2, my_mean, na_rm = TRUE)
bill_length_mm bill_depth_mm body_mass_g
43.92193 17.15117 4201.75439
The output is a vector, but in some cases it can be a matrix
(or an array
).
bill_length_mm bill_depth_mm body_mass_g
[1,] 32.1 13.1 2700
[2,] 59.6 21.5 6300
The lapply()
function
The lapply()
function performs a function on each element of a list
or vector
.
- the first argument
X
specifies thelist
or thevector
- the second argument
FUN
is the function to apply
Let’s try to compute the arithmetic mean of the columns bill_length_mm
, bill_depth_mm
and body_mass_g
.
## Column names ----
columns <- c("bill_length_mm", "bill_depth_mm", "body_mass_g")
## Mean of columns 3, 4 and 5 ----
lapply(columns, function(x) {
penguins[ , x, drop = TRUE] |>
mean(na.rm = TRUE)
})
[[1]]
[1] 43.92193
[[2]]
[1] 17.15117
[[3]]
[1] 4201.754
The output is a list
of same length as X
, and we can simplified it by using unlist()
. We can do this because the output for each iteration is a single value.
## Column names ----
columns <- c("bill_length_mm", "bill_depth_mm", "body_mass_g")
## Mean of columns 3, 4 and 5 ----
values <- lapply(columns, function(x) {
penguins[ , x, drop = TRUE] |>
mean(na.rm = TRUE)
})
## Simplify output ----
unlist(values)
[1] 43.92193 17.15117 4201.75439
And we can name values.
## Column names ----
columns <- c("bill_length_mm", "bill_depth_mm", "body_mass_g")
## Mean of columns 3, 4 and 5 ----
values <- lapply(columns, function(x) {
penguins[ , x, drop = TRUE] |>
mean(na.rm = TRUE)
})
## Simplify output ----
values <- unlist(values)
## Name elements ----
names(values) <- columns
values
bill_length_mm bill_depth_mm body_mass_g
43.92193 17.15117 4201.75439
The lapply()
allows you to perform complex tasks.
## Column names ----
columns <- c("bill_length_mm", "bill_depth_mm", "body_mass_g")
## Mean, min and max of columns 3, 4 and 5 ----
values <- lapply(columns, function(x) {
column <- penguins[ , x, drop = TRUE]
data.frame("trait" = x,
"mean" = mean(column, na.rm = TRUE),
"min" = min(column, na.rm = TRUE),
"max" = max(column, na.rm = TRUE))
})
values
[[1]]
trait mean min max
1 bill_length_mm 43.92193 32.1 59.6
[[2]]
trait mean min max
1 bill_depth_mm 17.15117 13.1 21.5
[[3]]
trait mean min max
1 body_mass_g 4201.754 2700 6300
Let’s simplify the output into a single data.frame
by recursively applying (with do.call()
) the function rbind.data.frame()
to each data.frame
of the list
.
trait mean min max
1 bill_length_mm 43.92193 32.1 59.6
2 bill_depth_mm 17.15117 13.1 21.5
3 body_mass_g 4201.75439 2700.0 6300.0
NB. Here the object penguins
is retrieved from the global environment. But it’s safer to explicitly use it like this:
## Column names ----
columns <- c("bill_length_mm", "bill_depth_mm", "body_mass_g")
## Mean, min and max of columns 3, 4 and 5 ----
values <- lapply(columns, function(x, data) {
column <- data[ , x, drop = TRUE]
data.frame("trait" = x,
"mean" = mean(column, na.rm = TRUE),
"min" = min(column, na.rm = TRUE),
"max" = max(column, na.rm = TRUE))
}, data = penguins)
do.call(rbind.data.frame, values)
trait mean min max
1 bill_length_mm 43.92193 32.1 59.6
2 bill_depth_mm 17.15117 13.1 21.5
3 body_mass_g 4201.75439 2700.0 6300.0
The tapply()
function
The tapply()
allows you to perform a function across specified groups in your data. For dplyr
users, it’s equivalent to the group_by()
and summarize()
functions.
- the first argument
X
specifies the values - the second argument
INDEX
specifies the groups - the third argument
FUN
is the function to apply
Lets’ compute the mean of bill_length_mm
for each species.
## Average bill length for each species ----
tapply(penguins$"bill_length_mm", penguins$"species", function(x) {
mean(x, na.rm = TRUE)
})
Adelie Chinstrap Gentoo
38.79139 48.83382 47.50488
We can group values according to two variables.
## Average bill length for each species ----
tapply(penguins$"bill_length_mm",
list(penguins$"species", penguins$"island"),
function(x) mean(x, na.rm = TRUE))
Biscoe Dream Torgersen
Adelie 38.97500 38.50179 38.95098
Chinstrap NA 48.83382 NA
Gentoo 47.50488 NA NA
Here the output is a matrix
. We can convert it to long data.frame
w/ tidyr::pivot_longer()
.
## Load 'dplyr' package ----
library("tidyr")
## Average bill length for each species and island ----
values <- tapply(penguins$"bill_length_mm",
list(penguins$"species", penguins$"island"),
function(x) mean(x, na.rm = TRUE))
## Convert to data.frame ----
values <- data.frame(values)
values$"species" <- rownames(values)
## Pivot data ----
values |>
pivot_longer(cols = !species,
values_to = "bill_length_mm",
names_to = "island")
# A tibble: 9 × 3
species island bill_length_mm
<chr> <chr> <dbl>
1 Adelie Biscoe 39.0
2 Adelie Dream 38.5
3 Adelie Torgersen 39.0
4 Chinstrap Biscoe NA
5 Chinstrap Dream 48.8
6 Chinstrap Torgersen NA
7 Gentoo Biscoe 47.5
8 Gentoo Dream NA
9 Gentoo Torgersen NA
This is equivalent to dplyr
approach.
## Load 'dplyr' package ----
library("dplyr")
## Summarise data ----
penguins %>%
group_by(species, island) %>%
summarize(bill_length_mm = mean(bill_length_mm,
na.rm = TRUE)) %>%
ungroup()
# A tibble: 5 × 3
species island bill_length_mm
<fct> <fct> <dbl>
1 Adelie Biscoe 39.0
2 Adelie Dream 38.5
3 Adelie Torgersen 39.0
4 Chinstrap Dream 48.8
5 Gentoo Biscoe 47.5