parallel dplyr – Harvey Lieberman

Introduction

I’m a big fan of the R packages in the tidyverse and dplyr in particular for performing routine data analysis. I’m currently working on a project that requires fitting dose-repsonse curves to data. All works well when the amount of data is small but as the dataset grows so does the computational time. Fortunately there’s a library (multidplyr) to perform dplyr operations under parallel conditions. By way of example we’ll create a set of dummy data and compare curve fitting using dplyr and multidplyr.

The dataset used is the spinach dataset that comes with the drc package. It’s a data frame containing 105 observations in 5 groups. Each group consists of 7 concentrations run in triplicate. To compare dplyr and multidplyr we’ll take these measurements and copy them 1000 times with some jitter.

Code

## Libraries
library(drc)
library(dplyr)
library(multidplyr)
library(microbenchmark)
library(ggplot2)
library(tidyr)

## Create a dummy dataset
data <- spinach
for (i in 1:1000) {
  addData <- spinach
  addData$CURVE <- addData$CURVE + 5 * i
  addData$SLOPE <- sapply(addData$SLOPE, function(x) jitter(x, factor = 10))
  data <- rbind(data, addData)
}

## Define some functions
makeFit <- function(d) {
  tryCatch(drm(SLOPE ~ DOSE, data = d, fct = LL.4()), error = function(e) NA)
}

fit_dplyr <- function(data, n) {
  data %>%
    filter(CURVE <= n) %>%
    group_by(CURVE) %>%
    do(fit = makeFit(.))
}

fit_multidplyr <- function(data, n) {
  data %>%
    filter(CURVE <= n) %>%
    partition(CURVE) %>%
    cluster_copy(makeFit) %>%
    cluster_library('drc') %>%
    do(fit = makeFit(.)) %>%
    collect(unique_indexes = 'CURVE')
}

## Benchmark our data
microbenchmark(fit_dplyr(data, 10), times = 3)
microbenchmark(fit_dplyr(data, 100), times = 3)
microbenchmark(fit_dplyr(data, 1000), times = 3)
microbenchmark(fit_dplyr(data, 5000), times = 3)
microbenchmark(fit_multidplyr(data, 10), times = 3)
microbenchmark(fit_multidplyr(data, 100), times = 3)
microbenchmark(fit_multidplyr(data, 1000), times = 3)
microbenchmark(fit_multidplyr(data, 5000), times = 3)

## Conclude with a table and graph
df.graph <- data.frame(n = rep(c(10, 100, 1000, 5000), 2), library=rep(c('dplyr', 'multidplyr'), each=4), timing=c(0.20, 3.04, 39.07, 212.89, 0.13, 1.13, 10.13, 49.29))

ggplot(df.graph, aes(x=n, y=timing, colour=library)) +
  geom_point(size = 2) +
  geom_line() +
  labs(x = 'number of groups', y = 'timing (seconds)')

df.table <- df.graph %>%
  spread(library, timing) %>%
  mutate(enhancement = round(dplyr / multidplyr, 2))

Conclusion

In this case, multidplyr runs up to 4.3 times faster on a 16 core PC. The speed enchancement increases with increasing size of the dataset.

n	dplyr (secs)	multidplyr (secs)
10	0.20	0.13
100	3.04	1.13
1000	39.07	10.13
5000	212.89	49.29