Rapid ANOVA p-value Calculation
There are times when we need to perform ANOVA calculations on multiple sets of features in order to generate a list of p-values. A traditional approach is to use the base function aov and extract the p-value from it. This can be slow when analyzing thousands of features. An alternative is to use the multtest library and calculate the p-values from the F-values.
An example is shown below using a matrix of 20 samples (5 groups of 4) and 10000 features. Calculating p-values within an apply loop takes about 18 seconds. Calculating p-values from the F-values using the multtest library takes about 15 milliseconds. Both were determined on a Z640.
1## create an (g x s) x f matrix where g is the number of groups, s is the number of samples in each group and f is the number of features
2g <- 5
3s <- 4
4f <- 10000
5df.raw <- replicate(g * s, sample(100:100000, f, rep = TRUE)) ## matrix of data
6groups <- factor(rep(seq(g), each = s)) ## column ids of groups
1## using an apply loop
2multAnova1 <- function(df.raw, groups) {
3 pVector <- apply(df.raw, 1, function(x) {
4 submodel <- lm(x~groups)
5 subanova <- aov(submodel)
6 summary(subanova)[[1]][['Pr(>F)']][[1]]
7 })
8 p.adjust(pVector, method='BH')
9}
10
11library(multtest)
12## using multtest
13multAnova2 <- function(df.raw, groups) {
14 F_new <- mt.teststat(df.raw, groups, test='f')
15 P_new <- pf(F_new, length(levels(groups))-1, length(groups)-length(levels(groups)), lower.tail = F)
16 p.adjust(P_new, method='BH')
17}
18
19## benchmarking
20library(microbenchmark)
21library(ggplot2)
22mb <- microbenchmark(
23 multAnova1(df.raw, groups),
24 multAnova2(df.raw, groups),
25 times = 10
26)
27print(mb)
28autoplot(mb)
Timing (milliseconds)
expr | min | lq | mean | median | uq | max | nval |
---|---|---|---|---|---|---|---|
multAnova1(df.raw, groups) | 17467.11 | 17609.23 | 18085.73 | 17882.93 | 18203.27 | 19628.12 | 10 |
multAnova2(df.raw, groups) | 13.89 | 14.25 | 15.66 | 15.68 | 16.31 | 18.72 | 10 |