parallel dplyr
Introduction
I’m a big fan of the R packages in the tidyverse and dplyr in particular for performing routine data analysis. I'm currently working on a project that requires fitting dose-repsonse curves to data. All works well when the amount of data is small but as the dataset grows so does the computational time. Fortunately there’s a library (multidplyr) to perform dplyr operations under parallel conditions. By way of example we’ll create a set of dummy data and compare curve fitting using dplyr and multidplyr.
The dataset used is the spinach dataset that comes with the drc package. It's a data frame containing 105 observations in 5 groups. Each group consists of 7 concentrations run in triplicate. To compare dplyr and multidplyr we'll take these measurements and copy them 1000 times with some jitter.
Code
1
2## Libraries
3library(drc)
4library(dplyr)
5library(multidplyr)
6library(microbenchmark)
7library(ggplot2)
8library(tidyr)
9
10## Create a dummy dataset
11data <- spinach
12for (i in 1:1000) {
13 addData <- spinach
14 addData$CURVE <- addData$CURVE + 5 * i
15 addData$SLOPE <- sapply(addData$SLOPE, function(x) jitter(x, factor = 10))
16 data <- rbind(data, addData)
17}
18
19## Define some functions
20makeFit <- function(d) {
21 tryCatch(drm(SLOPE ~ DOSE, data = d, fct = LL.4()), error = function(e) NA)
22}
23
24fit_dplyr <- function(data, n) {
25 data %>%
26 filter(CURVE <= n) %>%
27 group_by(CURVE) %>%
28 do(fit = makeFit(.))
29}
30
31fit_multidplyr <- function(data, n) {
32 data %>%
33 filter(CURVE <= n) %>%
34 partition(CURVE) %>%
35 cluster_copy(makeFit) %>%
36 cluster_library('drc') %>%
37 do(fit = makeFit(.)) %>%
38 collect(unique_indexes = 'CURVE')
39}
40
41## Benchmark our data
42microbenchmark(fit_dplyr(data, 10), times = 3)
43microbenchmark(fit_dplyr(data, 100), times = 3)
44microbenchmark(fit_dplyr(data, 1000), times = 3)
45microbenchmark(fit_dplyr(data, 5000), times = 3)
46microbenchmark(fit_multidplyr(data, 10), times = 3)
47microbenchmark(fit_multidplyr(data, 100), times = 3)
48microbenchmark(fit_multidplyr(data, 1000), times = 3)
49microbenchmark(fit_multidplyr(data, 5000), times = 3)
50
51## Conclude with a table and graph
52df.graph <- data.frame(n = rep(c(10, 100, 1000, 5000), 2), library=rep(c('dplyr', 'multidplyr'), each=4), timing=c(0.20, 3.04, 39.07, 212.89, 0.13, 1.13, 10.13, 49.29))
53
54ggplot(df.graph, aes(x=n, y=timing, colour=library)) +
55 geom_point(size = 2) +
56 geom_line() +
57 labs(x = 'number of groups', y = 'timing (seconds)')
58
59df.table <- df.graph %>%
60 spread(library, timing) %>%
61 mutate(enhancement = round(dplyr / multidplyr, 2))
62
Conclusion
In this case, multidplyr runs up to 4.3 times faster on a 16 core PC. The speed enchancement increases with increasing size of the dataset.
n | dplyr (secs) | multidplyr (secs) |
---|---|---|
10 | 0.20 | 0.13 |
100 | 3.04 | 1.13 |
1000 | 39.07 | 10.13 |
5000 | 212.89 | 49.29 |