Split-Apply-Combine – Harvey Lieberman

A common task when working with data is split-apply-combine – splitting a dataset up, applying a function and then recombining the results. It’s often performed in creating summary tables around a factor.

First let’s define the data to be used. I’m interested in a scenario where I have nS samples at nC concentrations repeated nR times. For each of these I’d like nF pieces of data (features). The table to construct will have nS.nC.nR rows and nF columns of data.

library(data.table)
library(dplyr)
library(tidyr)
library(naturalsort)

### Generate table

nF <- 50  # number of features
nS <- 20  # number of samples
nR <- 5  # number of replicates
conc <- c(0, 5, 10, 20)  # concentration per sample
normalizeConc <- conc[1]  # concentration for normalization

dt <- data.table(
  Sample = rep(paste0('Sample_', seq(nS)), each = nR * length(conc)),
  Concentration = rep(rep(conc, each = nR), times = nS)
)
for (i in seq(nF)) dt[, paste0('F', i) := runif(n = nS * nR * length(conc), min = 0, max = 100)]
df <- as.data.frame(dt)

The top of the table is shown below.

Table dimensions = 400 x 52.

The data is stored as both a data frame (df) and a data table (dt) for comparison later.

For my application, I’d like to take a table, split or group it by Sample and Concentration, calculate the mean of the replicates for each sample-concentration pair and then divide all data by the corresponding sample at concentration = 0. This can readily be accomplished under base R by using the split.data.frame function which takes a data frame and splits it into a list of data frames defined by a vector. The method below splits the data frame by Sample and then again by Concentration, performing the ratio calculation for each Concentration under each Sample. It then recombines all the data frames using rbind.

## Using base R
base_method <- function(df, normalizeConc) {
  split.data <- split.data.frame(df, df$Sample)  ## split by sample
  features <- names(df)[grep('F[0-9]+', names(df))]  ## grab column names corresponding to features
  l.split <- lapply(split.data, function(x) {  ## loop over samples
    v.means <- colMeans(x[x$Concentration == normalizeConc, features])  ## calculate average of normalizing concentration
    split.data2 <- split.data.frame(x, x$Concentration)  ## split by concentration
    l.splitConc <- lapply(split.data2, function(y) {  ## loop over concentrations within a sample
      df.ratio <- colMeans(y[, features]) / v.means  ## calc ratio of average of each concentration to average of normalizing concentration
    })
    df.splitConc <- data.frame(Sample = x[['Sample']][1], Concentration = as.numeric(names(l.splitConc)), do.call('rbind', l.splitConc), stringsAsFactors = FALSE)  ## construct a data frame of all concentration ratios for a single sample
    df.splitConc <- df.splitConc[df.splitConc$Concentration != normalizeConc, ]  ## get rid of the normalizing concentration (ratio = 1)
  })
  names(l.split) <- NULL
  do.call('rbind', l.split)  ## combine all into a single data frame
}

The method under base R works but it’s fairly long and not very efficient, using nested lapply loops. A quick look using profvis suggests that the time to run is distributed fairly equally throughout the function so it would be difficult to optimize.

Cleaner code can be written using the data.table package. In this case the table is grouped by Sample and Concentration and the means are calculated in a single step:

## Using data.table
dt_method <- function(dt, normalizeConc) {
  features <- names(dt)[grep('F[0-9]+', names(dt))]  ## grab column names corresponding to features
  dt1 <- dt[, lapply(.SD, mean), by = list(Sample, Concentration), .SDcols = features]  ## calculate means for each sample concentration
  dt2 <- dt1[, lapply(.SD, function(x) {x / x[Concentration == normalizeConc]} ), by = Sample, .SDcols = features]  ## calculate ratio to the normalizing concentration
  dt2[, Concentration := dt1[, Concentration]]  ## add the concentration back as a column
  dt3 <- dt2[Concentration != normalizeConc]  ## get rid of the normalizing concentration (ratio = 1)
  setcolorder(dt3, c('Sample', 'Concentration', features))  ## reorder columns
  return(dt3)
}

The beauty is in the use of .SD which is a subset of the input data.table.

Of course, all this can be accomplished in a single line using dplyr and tidyr. I’m a big fan of the tidyverse and the more I use dplyr, the happy I become.

## using dplyr and tidyr
tidyverse_method <- function(df, normalizeConc) {
  df %>% 
    gather(feature, value, starts_with('F')) %>%  ## wide -> long
    group_by(Sample, feature, Concentration) %>%  ## group by Sample, feature and concentration
    summarise(meanVal = mean(value)) %>%   ## calculate the mean concentration for each sample/feature pair
    mutate(ratio = meanVal / meanVal[Concentration == normalizeConc]) %>%  ## determine ratio to specific concentration
    filter (Concentration != normalizeConc) %>%  ## get rid of the rows with ratio = 1
    select(-meanVal) %>%  ## remove meanVal column
    spread(feature, ratio) %>%  ## long -> wide
    arrange(as.numeric(sub('.*_', '', Sample))) %>%  ## rearrange sample ID by number
    select(Sample, Concentration, one_of(names(df[3:ncol(df)])))  ## original feature order
}

data.table wins out on speed but tidyverse is cleaner and more straightforward to modify.

library(microbenchmark)
mb <- microbenchmark(
  base_method(df, 0),
  dt_method(dt, 0),
  tidyverse_method(df, 0),
  times = 10
)

print(mb)

Timing (milliseconds)

expr	min	lq	mean	median	uq	max	nval
base_method	186.05	206.71	266.95	239.96	330.96	399.50	10
dt_method	8.59	8.90	17.13	16.84	20.22	36.87	10
tidyverse_method	72.06	75.83	101.01	82.10	106.67	194.53	10