The apply() function family – Tips and tricks

In this post, we will discuss about the family of apply() functions. These functions allows you to recursively apply a function across all elements of a vector, list, matrix, or data.frame. The apply() family is an interesting alternative to the for loop because it wraps the loop into a simple function.

The functions in the apply() family differ in their input and output types:

Function	Description
`apply()`	Applies a function to margins of an `array`, `matrix` or `data.frame` (2D objects)
`lapply()`	Applies a function over a `list` or `vector` and returns a `list`
`sapply()`	Wrapper of `lapply` but returns a `vector` or `matrix` (volatile)
`vapply()`	Similar to `sapply` but safer
`tapply()`	Applies a function to a group of data grouped by one or more factors and returns an `array`

NB. Here we won’t talk about sapply() and vapply() as there are similar to lapply().

Dataset

To illustrate to use of apply() functions, we will use the palmerpenguins package. It contains the penguins dataset with size measurements for three penguin species observed on three islands in the Palmer Archipelago, Antarctica.

These data were collected from 2007 and 2009 by Dr. Kristen Gorman and are released under the CC0 license.

Let’s install the released version of palmerpenguins package from CRAN:

## Install 'palmerpenguins' package ----
install.packages("palmerpenguins")

Now, let’s import the dataset:

## Import 'penguins' dataset ----
library("palmerpenguins")

penguins

# A tibble: 344 × 8
   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
 1 Adelie  Torgersen           39.1          18.7               181        3750
 2 Adelie  Torgersen           39.5          17.4               186        3800
 3 Adelie  Torgersen           40.3          18                 195        3250
 4 Adelie  Torgersen           NA            NA                  NA          NA
 5 Adelie  Torgersen           36.7          19.3               193        3450
 6 Adelie  Torgersen           39.3          20.6               190        3650
 7 Adelie  Torgersen           38.9          17.8               181        3625
 8 Adelie  Torgersen           39.2          19.6               195        4675
 9 Adelie  Torgersen           34.1          18.1               193        3475
10 Adelie  Torgersen           42            20.2               190        4250
# ℹ 334 more rows
# ℹ 2 more variables: sex <fct>, year <int>

For this post, we will use a subset of this dataset:

## Columns to keep ----
cols <- c("species", "island", "bill_length_mm", "bill_depth_mm", 
          "body_mass_g")

## Subset data ----
penguins <- penguins[ , cols]
penguins

# A tibble: 344 × 5
   species island    bill_length_mm bill_depth_mm body_mass_g
   <fct>   <fct>              <dbl>         <dbl>       <int>
 1 Adelie  Torgersen           39.1          18.7        3750
 2 Adelie  Torgersen           39.5          17.4        3800
 3 Adelie  Torgersen           40.3          18          3250
 4 Adelie  Torgersen           NA            NA            NA
 5 Adelie  Torgersen           36.7          19.3        3450
 6 Adelie  Torgersen           39.3          20.6        3650
 7 Adelie  Torgersen           38.9          17.8        3625
 8 Adelie  Torgersen           39.2          19.6        4675
 9 Adelie  Torgersen           34.1          18.1        3475
10 Adelie  Torgersen           42            20.2        4250
# ℹ 334 more rows

The `apply()` function

The apply() lets you perform a function across rows or columns of a data.frame (or any types of 2-dimension objects).

the first argument X specifies the data
the second argument MARGIN specifies the direction (1 for rows, 2 for columns)
the third argument FUN is the function to apply

Let’s compute the arithmetic mean of the columns bill_length_mm, bill_depth_mm and body_mass_g by applying the mean() function across columns 3 to 5 of the penguins dataset.

## Mean of columns 3, 4 and 5 ----
apply(penguins[ , 3:5], 2, mean)

bill_length_mm  bill_depth_mm    body_mass_g 
            NA             NA             NA

We can pass arguments to the function mean() by using the argument ... of the function apply(). Let’s remove missing values before computing the mean.

## Use additional arguments ----
apply(penguins[ , 3:5], 2, mean, na.rm = TRUE)

bill_length_mm  bill_depth_mm    body_mass_g 
      43.92193       17.15117     4201.75439

Note that the apply() functions are pipe-friendly.

## Pipe version ----
penguins[ , 3:5] |> 
  apply(2, mean, na.rm = TRUE)

bill_length_mm  bill_depth_mm    body_mass_g 
      43.92193       17.15117     4201.75439

We can also use a custom function.

## Custom function ----
apply(penguins[ , 3:5], 2, function(x) mean(x, na.rm = TRUE))

bill_length_mm  bill_depth_mm    body_mass_g 
      43.92193       17.15117     4201.75439

Finally, we can define a custom function outside the apply() function.

## Custom function ----
my_mean <- function(x, na_rm = FALSE) {
  mean(x, na.rm = na_rm)
}

apply(penguins[ , 3:5], 2, my_mean, na_rm = TRUE)

bill_length_mm  bill_depth_mm    body_mass_g 
      43.92193       17.15117     4201.75439

The output is a vector, but in some cases it can be a matrix (or an array).

## Different output class ----
apply(penguins[ , 3:5], 2, range, na.rm = TRUE)

     bill_length_mm bill_depth_mm body_mass_g
[1,]           32.1          13.1        2700
[2,]           59.6          21.5        6300

The `lapply()` function

The lapply() function performs a function on each element of a list or vector.

the first argument X specifies the list or the vector
the second argument FUN is the function to apply

Let’s try to compute the arithmetic mean of the columns bill_length_mm, bill_depth_mm and body_mass_g.

## Column names ----
columns <- c("bill_length_mm", "bill_depth_mm", "body_mass_g")

## Mean of columns 3, 4 and 5 ----
lapply(columns, function(x) {
  penguins[ , x, drop = TRUE] |> 
    mean(na.rm = TRUE)
})

[[1]]
[1] 43.92193

[[2]]
[1] 17.15117

[[3]]
[1] 4201.754

The output is a list of same length as X, and we can simplified it by using unlist(). We can do this because the output for each iteration is a single value.

## Column names ----
columns <- c("bill_length_mm", "bill_depth_mm", "body_mass_g")

## Mean of columns 3, 4 and 5 ----
values <- lapply(columns, function(x) {
  penguins[ , x, drop = TRUE] |> 
    mean(na.rm = TRUE)
})

## Simplify output ----
unlist(values)

[1]   43.92193   17.15117 4201.75439

And we can name values.

## Column names ----
columns <- c("bill_length_mm", "bill_depth_mm", "body_mass_g")

## Mean of columns 3, 4 and 5 ----
values <- lapply(columns, function(x) {
  penguins[ , x, drop = TRUE] |> 
    mean(na.rm = TRUE)
})

## Simplify output ----
values <- unlist(values)

## Name elements ----
names(values) <- columns

values

bill_length_mm  bill_depth_mm    body_mass_g 
      43.92193       17.15117     4201.75439

The lapply() allows you to perform complex tasks.

## Column names ----
columns <- c("bill_length_mm", "bill_depth_mm", "body_mass_g")

## Mean, min and max of columns 3, 4 and 5 ----
values <- lapply(columns, function(x) {
  column <- penguins[ , x, drop = TRUE]
  data.frame("trait" = x,
             "mean"  = mean(column, na.rm = TRUE),
             "min"   = min(column, na.rm = TRUE),
             "max"   = max(column, na.rm = TRUE))
})

values

[[1]]
           trait     mean  min  max
1 bill_length_mm 43.92193 32.1 59.6

[[2]]
          trait     mean  min  max
1 bill_depth_mm 17.15117 13.1 21.5

[[3]]
        trait     mean  min  max
1 body_mass_g 4201.754 2700 6300

Let’s simplify the output into a single data.frame by recursively applying (with do.call()) the function rbind.data.frame() to each data.frame of the list.

## Simplify output ----
values <- do.call(rbind.data.frame, values)

values

           trait       mean    min    max
1 bill_length_mm   43.92193   32.1   59.6
2  bill_depth_mm   17.15117   13.1   21.5
3    body_mass_g 4201.75439 2700.0 6300.0

NB. Here the object penguins is retrieved from the global environment. But it’s safer to explicitly use it like this:

## Column names ----
columns <- c("bill_length_mm", "bill_depth_mm", "body_mass_g")

## Mean, min and max of columns 3, 4 and 5 ----
values <- lapply(columns, function(x, data) {
  column <- data[ , x, drop = TRUE]
  data.frame("trait" = x,
             "mean"  = mean(column, na.rm = TRUE),
             "min"   = min(column, na.rm = TRUE),
             "max"   = max(column, na.rm = TRUE))
}, data = penguins)

do.call(rbind.data.frame, values)

           trait       mean    min    max
1 bill_length_mm   43.92193   32.1   59.6
2  bill_depth_mm   17.15117   13.1   21.5
3    body_mass_g 4201.75439 2700.0 6300.0

The `tapply()` function

The tapply() allows you to perform a function across specified groups in your data. For dplyr users, it’s equivalent to the group_by() and summarize() functions.

the first argument X specifies the values
the second argument INDEX specifies the groups
the third argument FUN is the function to apply

Lets’ compute the mean of bill_length_mm for each species.

## Average bill length for each species ----
tapply(penguins$"bill_length_mm", penguins$"species", function(x) {
  mean(x, na.rm = TRUE)
})

   Adelie Chinstrap    Gentoo 
 38.79139  48.83382  47.50488

We can group values according to two variables.

## Average bill length for each species ----
tapply(penguins$"bill_length_mm", 
       list(penguins$"species", penguins$"island"), 
       function(x) mean(x, na.rm = TRUE))

            Biscoe    Dream Torgersen
Adelie    38.97500 38.50179  38.95098
Chinstrap       NA 48.83382        NA
Gentoo    47.50488       NA        NA

Here the output is a matrix. We can convert it to long data.frame w/ tidyr::pivot_longer().

## Load 'dplyr' package ----
library("tidyr")

## Average bill length for each species and island ----
values <- tapply(penguins$"bill_length_mm", 
                 list(penguins$"species", penguins$"island"), 
                 function(x) mean(x, na.rm = TRUE))

## Convert to data.frame ----
values <- data.frame(values)
values$"species" <- rownames(values)


## Pivot data ----  
values |> 
  pivot_longer(cols      = !species,
               values_to = "bill_length_mm",
               names_to  = "island")

# A tibble: 9 × 3
  species   island    bill_length_mm
  <chr>     <chr>              <dbl>
1 Adelie    Biscoe              39.0
2 Adelie    Dream               38.5
3 Adelie    Torgersen           39.0
4 Chinstrap Biscoe              NA  
5 Chinstrap Dream               48.8
6 Chinstrap Torgersen           NA  
7 Gentoo    Biscoe              47.5
8 Gentoo    Dream               NA  
9 Gentoo    Torgersen           NA

This is equivalent to dplyr approach.

## Load 'dplyr' package ----
library("dplyr")

## Summarise data ----
penguins %>%
  group_by(species, island) %>%
  summarize(bill_length_mm = mean(bill_length_mm, 
                                  na.rm = TRUE)) %>%
  ungroup()

# A tibble: 5 × 3
  species   island    bill_length_mm
  <fct>     <fct>              <dbl>
1 Adelie    Biscoe              39.0
2 Adelie    Dream               38.5
3 Adelie    Torgersen           39.0
4 Chinstrap Dream               48.8
5 Gentoo    Biscoe              47.5

Dataset

The apply() function

The lapply() function

The tapply() function

The `apply()` function

The `lapply()` function

The `tapply()` function