Rapid ANOVA p-value Calculation

Fast calculation of multiple p-values
R
Author

Harvey

Published

March 20, 2017

There are times when we need to perform ANOVA calculations on multiple sets of features in order to generate a list of p-values. A traditional approach is to use the base function aov and extract the p-value from it. This can be slow when analyzing thousands of features. An alternative is to use the multtest library and calculate the p-values from the F-values.

An example is shown below using a matrix of 20 samples (5 groups of 4) and 10000 features. Calculating p-values within an apply loop takes about 18 seconds. Calculating p-values from the F-values using the multtest library takes about 15 milliseconds. Both were determined on a Z640.

## create an (g x s) x f matrix where g is the number of groups, s is the number of samples in each group and f is the number of features
g <- 5
s <- 4
f <- 10000
df.raw <- replicate(g * s, sample(100:100000, f, rep = TRUE))  ## matrix of data
groups <- factor(rep(seq(g), each = s))  ## column ids of groups
## using an apply loop
multAnova1 <- function(df.raw, groups) {
 pVector <- apply(df.raw, 1, function(x) {
   submodel <- lm(x~groups)
   subanova <- aov(submodel)
   summary(subanova)[[1]][['Pr(>F)']][[1]]
 })
 p.adjust(pVector, method='BH')
}

library(multtest)
## using multtest
multAnova2 <- function(df.raw, groups) {
 F_new <- mt.teststat(df.raw, groups, test='f')
 P_new <- pf(F_new, length(levels(groups))-1, length(groups)-length(levels(groups)), lower.tail = F)
 p.adjust(P_new, method='BH')
}

## benchmarking
library(microbenchmark)
library(ggplot2)
mb <- microbenchmark(
 multAnova1(df.raw, groups),
 multAnova2(df.raw, groups),
 times = 10
)
print(mb)
autoplot(mb)

Timing (milliseconds)

expr min lq mean median uq max nval
multAnova1(df.raw, groups) 17467.11 17609.23 18085.73 17882.93 18203.27 19628.12 10
multAnova2(df.raw, groups) 13.89 14.25 15.66 15.68 16.31 18.72 10