Rapid ANOVA p-value Calculation

Mar 20, 2017 · 2 min read · R ·

There are times when we need to perform ANOVA calculations on multiple sets of features in order to generate a list of p-values. A traditional approach is to use the base function aov and extract the p-value from it. This can be slow when analyzing thousands of features. An alternative is to use the multtest library and calculate the p-values from the F-values.

An example is shown below using a matrix of 20 samples (5 groups of 4) and 10000 features. Calculating p-values within an apply loop takes about 18 seconds. Calculating p-values from the F-values using the multtest library takes about 15 milliseconds. Both were determined on a Z640.

1## create an (g x s) x f matrix where g is the number of groups, s is the number of samples in each group and f is the number of features
2g <- 5
3s <- 4
4f <- 10000
5df.raw <- replicate(g * s, sample(100:100000, f, rep = TRUE))  ## matrix of data
6groups <- factor(rep(seq(g), each = s))  ## column ids of groups

 1## using an apply loop
 2multAnova1 <- function(df.raw, groups) {
 3 pVector <- apply(df.raw, 1, function(x) {
 4   submodel <- lm(x~groups)
 5   subanova <- aov(submodel)
 6   summary(subanova)[[1]][['Pr(>F)']][[1]]
 7 })
 8 p.adjust(pVector, method='BH')
 9}
10
11library(multtest)
12## using multtest
13multAnova2 <- function(df.raw, groups) {
14 F_new <- mt.teststat(df.raw, groups, test='f')
15 P_new <- pf(F_new, length(levels(groups))-1, length(groups)-length(levels(groups)), lower.tail = F)
16 p.adjust(P_new, method='BH')
17}
18
19## benchmarking
20library(microbenchmark)
21library(ggplot2)
22mb <- microbenchmark(
23 multAnova1(df.raw, groups),
24 multAnova2(df.raw, groups),
25 times = 10
26)
27print(mb)
28autoplot(mb)

Timing (milliseconds)

expr	min	lq	mean	median	uq	max	nval
multAnova1(df.raw, groups)	17467.11	17609.23	18085.73	17882.93	18203.27	19628.12	10
multAnova2(df.raw, groups)	13.89	14.25	15.66	15.68	16.31	18.72	10