*TLDR:* A q value defines the proportion of false positives there are within all discoveries that were called significant up to the current item.

**High throughput techniques** like microarrays with its successor RNA-Seq and mass spectrometry proteomics lead to an huge data amount.
Thousands of features (e.g. transcripts or proteins) are measured simultaneously. **Differential expression analysis** aims to identify features, that change significantly
between two conditions. A common experimental setup is the analysis of which genes are over- or underexpressed between e.g. a wild type and a mutant.

Hypothesis tests aim to identify differences between two or more samples. The most common statistical test is the **t test** that tests a difference of means. Hypothesis tests report
a p value, that correspond the probability of obtaining results at least as extreme as the observed results, assuming that the null hypothesis is correct. In other words:

In [4]:

```
// required for auxiliary functions
#r "nuget: FSharpAux, 2.0.0"
// required for all calculations
#r "nuget: FSharp.Stats, 0.4.12-preview.3"
// required to read the pvalue set
#r "nuget: FSharp.Data, 4.2.7"
// required for charting
#r "nuget: Plotly.NET.Interactive, 4.1.0"
open FSharpAux
open FSharp.Stats
open FSharp.Data
open Plotly.NET
open Plotly.NET.StyleParam
open Plotly.NET.LayoutObjects
```

- FSharp.Data, 4.2.7
- FSharp.Stats, 0.4.12-preview.3
- FSharpAux, 2.0.0
- Plotly.NET.Interactive, 4.1.0

**same** mean and standard deviation.

In [5]:

```
let distributionA = Distributions.Continuous.Normal.Init 10.0 1.0
let distributionB = Distributions.Continuous.Normal.Init 10.0 1.0
let distributionChartAB =
[
Chart.Area(xy = ([5. .. 0.01 .. 15.] |> List.map (fun x -> x,distributionA.PDF x)), Name = "distA")
Chart.Area(xy = ([5. .. 0.01 .. 15.] |> List.map (fun x -> x,distributionB.PDF x)), Name = "distB")
]
|> Chart.combine
|> Chart.withTemplate ChartTemplates.lightMirrored
|> Chart.withXAxisStyle "variable X"
|> Chart.withYAxisStyle "relative count"
|> Chart.withSize (900.,600.)
|> Chart.withTitle "null hypothesis"
distributionChartAB
```

**if a mean difference exist** using a two sample t test where equal variances of the underlying population distribution are assumed.

In [6]:

```
let getSample n (dist: Distributions.ContinuousDistribution<float,float>) =
Vector.init n (fun _ -> dist.Sample())
let sampleA = getSample 5 distributionA
let sampleB = getSample 5 distributionB
let pValue = (Testing.TTest.twoSample true sampleA sampleB).PValue
pValue
```

0.24680735077408789

10,000 tests are performed, each with new randomly drawn samples. This corresponds to an experiment in which **none of the features changed**.
Note, that the mean intensities are arbitrary and must not be the same for all features! In the presented case all feature intensities are in average 10.
The same simulation can be performed with pairwise comparisons from distributions that differ for each feature, but are the same within the feature.
**The resulting p values are uniformly distributed between 0 and 1**.

*Fig 1: p value distribution of the null hypothesis.*

**significant (red box), even though the populations do not differ**. They are called **false
positives (FP)**. Now lets repeat the same experiment, but this time sample 70% of the time from null features (no difference) and **add 30% samples of truly
differing** distributions. Therefore a third populations is generated, that differ in mean, but has an equal standard deviation:

In [8]:

```
let distributionC = Distributions.Continuous.Normal.Init 11.5 1.0
let distributionChartAC =
[
Chart.Area(xy = ([5. .. 0.01 .. 15.] |> List.map (fun x -> x,distributionA.PDF x)), Name = "distA")
Chart.Area(xy = ([5. .. 0.01 .. 15.] |> List.map (fun x -> x,distributionC.PDF x)), Name = "distC")
]
|> Chart.combine
|> Chart.withTemplate ChartTemplates.lightMirrored
|> Chart.withXAxisStyle "variable X"
|> Chart.withYAxisStyle "relative count"
|> Chart.withSize (1000.,600.)
|> Chart.withTitle "alternative hypothesis"
```

*Fig 2: p value distribution of the alternative hypothesis. Blue coloring indicate p values deriving from distribution A and B (null).
Orange coloring indicate p values deriving from distribution A and C (truly differing).*

The pvalue distribution of the tests resulting from truly differing populations are **right skewed**, while the null tests again show a homogeneous distribution between 0 and 1.
Many, but not all of the tests that come from the truly differing populations are below 0.05, and therefore would be reported as significant.
In average 350 null features would be reported as significant even though they derive from null features (blue bars, 10,000 x 0.7 x 0.05 = 350).

The hypothesis testing framework with the p value definition given above was **developed for performing just one test. If many tests are performed, like in modern high throughput studies, the probability to obtain a
false positive result increases.** The probability of at least one false positive is called Familywise error rate (FWER) and can be determined by $FWER=1-(1-\alpha)^m$ where
$\alpha$ corresponds to the significance threshold (here 0.05) and $m$ is the number of performed tests.

In [9]:

```
let bonferroniLine =
Shape.init(ShapeType=ShapeType.Line,X0=0.,X1=35.,Y0=0.05,Y1=0.05,Line=Line.init(Dash=DrawingStyle.Dash))
let fwer =
[1..35]
|> List.map (fun x ->
x,(1. - (1. - 0.05)**(float x))
)
|> Chart.Point
|> Chart.withTemplate ChartTemplates.lightMirrored
|> Chart.withXAxisStyle "#tests"
|> Chart.withYAxisStyle("p(at least one FP)",MinMax=(0.,1.))
|> Chart.withShape bonferroniLine
|> Chart.withTitle "FWER"
fwer
```

*Fig 3: Family wise error rate depending on number of performed tests. The black dashed line indicates the Bonferroni corrected FWER by $p^* = \frac{\alpha}{m}$ .*

When 10,000 null features are tested with a p value threshold of 0.05, in average 500 tests are reported significant even if there is not a single comparisons in which the
population differ. **If some of the features are in fact different, the number of false positives consequently decreases (remember, the p value is defined for tests of null features).**

Why the interpretation of high throughput data based on p values is difficult: The more features are measured, the more false positives you can expect. If 100 differentially expressed genes are identified by p value thresholding, without further information about the magnitude of expected changes and the total number of measured transcripts, the data is useless.

The p value threshold has no straight-forward interpretation when many tests are performed. Of course you could restrict the family wise error rate to 0.05, regardless how many tests are performed. This is realized by dividing the $\alpha$ significance threshold by the number of tests, which is known as Bonferroni correction: $p^* = \frac{\alpha}{m}$. This correction drastically limit the false positive rate, but in an experiment with a huge count of expected changes, it additionally would result in many false negatives. The FWER should be chosen if the costs of follow up studies to tests the candidates are dramatic or there is a huge waste of time to potentially study false positives.

A more reasonable measure of significance with a simple interpretation is the so called false discovery rate (FDR). **It describes the rate of expected false positives within the
overall reported significant features.** The goal is to identify as many sig. features as possible while incurring a relatively low proportion of false positives.
Consequently a set of reported significant features together with the **FDR describes the confidence of this set**, without the requirement to
somehow incorporate the uncertainty that is introduced by the total number of tests performed. In the simulated case of 7,000 null tests and 3,000 tests resulting from truly
differing distributions above, the FDR can be calculated exactly. Therefore at e.g. a p value of 0.05 the number of false positives (blue in red box) are divided by the number
of significant reported tests (false positives + true positives).

*Fig 4: p value distribution of the alternative hypothesis.*

Given the conditions described in the first chapter, the FDR of this experiment with a p value threshold of 0.05 is 0.173. Out of the 2019 reported significant comparisons, in average 350
are expected to be false positives, which gives an straight forward interpretation of the data confidence. In real-world experiments the proportion of null tests and tests
deriving from an actual difference is of course unknown. **The proportion of null tests however tends to be distributed equally in the p value histogram.** By identification of
the average null frequency, a proportion of FP and TP can be determined and the FDR can be defined. This frequency estimate is called $\pi_0$, which leads to an FDR definition of:

*Fig 5: FDR calculation on simulated data.*

Consequently for each presented p value a corresponding FDR can be calculated. The minimum local FDR at each p value is called q value.

$$\hat q(p_i) = min_{t \geq p_i} \hat{FDR}(p_i)$$Since the q value is not monotonically increasing, it is smoothed by assigning the lowest FDR of all p values, that are equal or higher the current one.

**By defining $\pi_0$, all other parameters can be calculated from the given p value distribution to determine the all q values.** The most prominent
FDR-controlling method is known as Benjamini-Hochberg correction. It sets $\pi_0$ as 1, assuming that all features are null. In studies with an expected high proportion of true
positives, a $\pi_0$ of 1 is too conservative, since there definitely are true positives in the data.

A better estimation of $\pi_0$ is given in the following:

**True positives are assumed to be right skewed while null tests are distributed equally between 0 and 1**. Consequently the right flat region of the p value histogram tends to correspond
to the true frequency of null comparisons (Fig 5). As **real world example** 9856 genes were measured in triplicates under two conditions (control and treatment). The p value distribution of two
sample t tests looks as follows:

In [10]:

```
let examplePVals =
let rawData = Http.RequestString @"https://raw.githubusercontent.com/fslaborg/datasets/main/data/pvalExample.txt"
rawData.Split '\n'
|> Array.tail
|> Array.map float
//number of tests
let m =
examplePVals
|> Array.length
|> float
let nullLine =
Shape.init(ShapeType=ShapeType.Line,X0=0.,X1=1.,Y0=1.,Y1=1.,Line=Line.init(Dash=DrawingStyle.Dash))
let empLine =
Shape.init(ShapeType=ShapeType.Line,X0=0.,X1=1.,Y0=0.4,Y1=0.4,Line=Line.init(Dash=DrawingStyle.DashDot,Color=Color.fromHex "#FC3E36"))
let exampleDistribution =
[
[
examplePVals
|> Distributions.Frequency.create 0.025
|> Map.toArray
|> Array.map (fun (k,c) -> k,float c / (m * 0.025))
|> Chart.Column
|> Chart.withTraceInfo "density"
|> Chart.withTemplate ChartTemplates.lightMirrored
|> Chart.withXAxisStyle "p value"
|> Chart.withYAxisStyle "density"
|> Chart.withShapes [nullLine;empLine]
examplePVals
|> Distributions.Frequency.create 0.025
|> Map.toArray
|> Array.map (fun (k,c) -> k,float c)
|> Chart.Column
|> Chart.withTraceInfo "gene count"
|> Chart.withTemplate ChartTemplates.lightMirrored
|> Chart.withXAxisStyle "p value"
|> Chart.withYAxisStyle "gene count"
]
]
|> Chart.Grid()
|> Chart.withSize(1100.,550.)
exampleDistribution
```

*Fig 6: p value distributions of real world example. The frequency is given on the right, its density on the left. The black dashed line indicates the distribution, if all features
were null. The red dash-dotted line indicates the visual estimated pi0.*

By performing t tests for all comparisons 3743 (38 %) of the genes lead to a pvalue lower than 0.05. By eye, you would estimate $\pi_0$ as 0.4, indicating, only a small fraction of the genes are unaltered (null). After q value calculations, you would filter for a specific FDR (e.g. 0.05) and end up with an p value threshold of 0.04613, indicating a FDR of max. 0.05 in the final reported 3642 genes.

```
pi0 = 0.4
m = 9856
D(p) = number of sig. tests at given p
FP(p) = p*0.4*9856
FDR(p) = FP(p) / D(p)
```

FDR(0.04613) = 0.4995

Now let's compare p- to its corresponding q values with ensuring continuity

In [12]:

```
let pi0 = 0.4
let getD p =
examplePVals
|> Array.sumBy (fun x -> if x <= p then 1. else 0.)
let getFP p = p * pi0 * m
let getFDR p = (getFP p) / (getD p)
let qvaluesNotSmoothed =
examplePVals
|> Array.sort
|> Array.map (fun x ->
x, getFDR x)
|> Chart.Line
|> Chart.withTraceInfo "not smoothed"
let qvaluesSmoothed =
let pValsSorted =
examplePVals
|> Array.sortDescending
let rec loop i lowest acc =
if i = pValsSorted.Length then
acc |> List.rev
else
let p = pValsSorted.[i]
let q = getFDR p
if q > lowest then
loop (i+1) lowest ((p,lowest)::acc)
else loop (i+1) q ((p,q)::acc)
loop 0 1. []
|> Chart.Line
|> Chart.withTraceInfo "smoothed"
let eXpos = examplePVals |> Array.filter (fun x -> x <= 0.046135) |> Array.length
let p2qValeChart =
[qvaluesNotSmoothed;qvaluesSmoothed]
|> Chart.combine
|> Chart.withYAxisStyle("",MinMax=(0.,1.))
|> Chart.withTemplate ChartTemplates.lightMirrored
|> Chart.withXAxisStyle "p value"
|> Chart.withYAxisStyle "q value"
|> Chart.withShape empLine
|> Chart.withTitle (sprintf "#[genes with q value < 0.05] = %i" eXpos)
p2qValeChart
```

*Fig 7: FDR calculation on experiment data. Please zoom into the very first part of the curve to inspect the monotonicity.*

For a range of $\lambda$ in e.g. $\{0.0 .. 0.05 .. 0.95\}$, calculate $\hat \pi_0 (\lambda) = \frac {\#[p_j > \lambda]}{m(1 - \lambda)}$

In [13]:

```
let pi0Est =
[|0. .. 0.05 .. 0.95|]
|> Array.map (fun lambda ->
let num =
examplePVals
|> Array.sumBy (fun x -> if x > lambda then 1. else 0.)
let den = float examplePVals.Length * (1. - lambda)
lambda, num/den
)
let pi0EstChart =
pi0Est
|> Chart.Point
|> Chart.withYAxisStyle("",MinMax=(0.,1.))
|> Chart.withXAxisStyle("",MinMax=(0.,1.))
|> Chart.withTemplate ChartTemplates.lightMirrored
|> Chart.withXAxisStyle "$\lambda$"
|> Chart.withYAxisStyle "$\hat \pi_0(\lambda)$"
|> Chart.withMathTex(true)
|> Chart.withConfig(
Config.init(
Responsive=true,
ModeBarButtonsToAdd=[
ModeBarButton.DrawLine
ModeBarButton.DrawOpenPath
ModeBarButton.EraseShape
]
)
)
pi0EstChart
```

*Fig 8: pi0 estimation.*

The resulting diagram shows, that with increasing $\lambda$ its function value $\hat \pi_0(\lambda)$ tends to $\pi_0$. The calculation **relates the actual proportion of tests greater than $\lambda$ to the proportion of $\lambda$ range the corresponding p values are in**.
In Storey & Tibshirani 2003 this curve is fitted with a **cubic spline**. A weighting of the knots by $(1 - \lambda)$ is recommended
but not specified in the final publication. Afterwards the function value at $\hat \pi_0(1)$ is defined as final estimator of $\pi_0$. This is often referred to as the *smoother method*.

Another method (*bootstrap method*) (Storey et al., 2004), that does not depend on fitting is based on **bootstrapping** and was introduced in Storey et al. (2004). It is implemented in FSharp.Stats:

Determine the minimal $\hat \pi_0 (\lambda)$ and call it $min \hat \pi_0$ .

For each $\lambda$, bootstrap the p values (e.g. 100 times) and calculate the mean squared error (MSE) from the difference of resulting $\hat \pi_0^b$ to $min \hat \pi_0$. The minimal MSE indicates the best $\lambda$. With $\lambda$ defined, $\pi_0$ can be determined.

**Note: When bootstrapping an data set of size n, n elements are drawn with replacement.**

In [14]:

```
let getpi0Bootstrap (lambda:float[]) (pValues:float[]) =
let rnd = System.Random()
let m = pValues.Length |> float
let getpi0hat lambda pVals=
let hits =
pVals
|> Array.sumBy (fun x -> if x > lambda then 1. else 0.)
hits / (m * (1. - lambda))
//calculate MSE for each lambda
let getMSE lambda =
let mse =
//generate 100 bootstrap samples of p values and calculate the MSE at given lambda
Array.init 100 (fun b ->
Array.sampleWithReplacement rnd pValues pValues.Length
|> getpi0hat lambda
)
mse
lambda
|> Array.map (fun l -> l,getMSE l)
let minimalpihat =
//FSharp.Stats.Testing.MultipleTesting.Qvalues.pi0hats [|0. .. 0.05 .. 0.96|] examplePVals |> Array.minBy snd |> snd
0.3686417749
let minpiHatShape =
Shape.init(ShapeType=ShapeType.Line,X0=0.,X1=1.,Y0=minimalpihat,Y1=minimalpihat,Line=Line.init(Dash=DrawingStyle.Dash))
let bootstrappedPi0 =
getpi0Bootstrap [|0. .. 0.05 .. 0.95|] examplePVals
|> Array.map (fun (l,x) ->
Chart.BoxPlot(data=x,orientation=Orientation.Vertical,FillColor=Color.fromHex"#1F77B4",MarkerColor=Color.fromHex"#1F77B4",Name=sprintf "%.2f" l))
|> Chart.combine
|> Chart.withShape minpiHatShape
|> Chart.withTemplate ChartTemplates.lightMirrored
|> Chart.withXAxisStyle "lambda"
|> Chart.withYAxisStyle "pi0 hat"
bootstrappedPi0
```

*Fig 9: Bootstrapping for pi0 estimation. The dashed line indicates the minimal pi0 from Fig. 8.
The bootstrapped pi0 distribution that shows the least variation to the dashed line is the optimal. In the presented example it is either 0.8 or 0.85.*

For an $\lambda$, range of $\{0.0 .. 0.05 .. 0.95\}$ the bootstrapping method determines either 0.8 or 0.85 as optimal $\lambda$ and therefore $optimal \hat \pi_0$ is either $0.3703$ or $0.3686$.

The **automated estimation** of $\pi_0$ based on bootstrapping is implemented in `FSharp.Stats.Testing.MultipleTesting.Qvalues`

.

In [15]:

```
open Testing.MultipleTesting
let pi0Stats = Qvalues.pi0BootstrapWithLambda [|0.0 .. 0.05 .. 0.95|] examplePVals
pi0Stats
```

0.3703327922077925

Subsequent to $\pi_0$ estimation the **q values can be determined** from a list of p values.

In [16]:

```
let qValues = Qvalues.ofPValues pi0Stats examplePVals
qValues
```

A robust variant of q value determination exists, that is more conservative for small p values when the total number of p values is low. Here the number of false positives is divided by the number of total discoveries multiplied by the FWER at the current p value. The correction takes into account the probability of a false positive being reported in the first place.

Especially when the population distributions do not follow a perfect normal distribution or the p value distribution looks strange, the usage of the robust version is recommended.

In [17]:

```
let qvaluesRobust =
Testing.MultipleTesting.Qvalues.ofPValuesRobust pi0Stats examplePVals
let qChart =
[
Chart.Line(Array.sortBy fst (Array.zip examplePVals qValues),Name="qValue")
Chart.Line(Array.sortBy fst (Array.zip examplePVals qvaluesRobust),Name="qValueRobust")
]
|> Chart.combine
|> Chart.withTemplate ChartTemplates.lightMirrored
|> Chart.withXAxisStyle "p value"
|> Chart.withYAxisStyle "q value"
```

*Fig 10: Comparison of q values and robust q values, that is more conservative at low p values.*

In [18]:

```
let pi0Line =
Shape.init(ShapeType=ShapeType.Line,X0=0.,X1=1.,Y0=pi0Stats,Y1=pi0Stats,Line=Line.init(Dash=DrawingStyle.Dash))
// relates the q value to each p value
let p2q =
Array.zip examplePVals qValues
|> Array.sortBy fst
|> Chart.Line
|> Chart.withShape pi0Line
|> Chart.withTemplate ChartTemplates.lightMirrored
|> Chart.withXAxisStyle "p value"
|> Chart.withYAxisStyle "q value"
// shows the p values distribution for an visual inspection of pi0 estimation
let pValueDistribution =
let frequencyBins = 0.025
let m = examplePVals.Length |> float
examplePVals
|> Distributions.Frequency.create frequencyBins
|> Map.toArray
|> Array.map (fun (k,c) -> k,float c / frequencyBins / m)
|> Chart.StackedColumn
|> Chart.withTraceInfo "p values"
|> Chart.withTemplate ChartTemplates.lightMirrored
|> Chart.withXAxisStyle "p value"
|> Chart.withYAxisStyle "frequency density"
|> Chart.withShape pi0Line
// shows pi0 estimation in relation to lambda
let pi0Estimation =
//Testing.MultipleTesting.Qvalues.pi0hats [|0. .. 0.05 .. 0.96|] examplePVals
[|0. .. 0.05 .. 0.95|]
|> Array.map (fun lambda ->
let num =
examplePVals
|> Array.sumBy (fun x -> if x > lambda then 1. else 0.)
let den = float examplePVals.Length * (1. - lambda)
lambda, num/den
)
|> Chart.Point
|> Chart.withTemplate ChartTemplates.lightMirrored
|> Chart.withXAxisStyle "$\lambda$"
|> Chart.withYAxisStyle "$\hat \pi_0(\lambda)$"
|> Chart.withMathTex(true)
p2q
```