Basics

Binder Notebook

Summary: this tutorial gives an overview over how to do some of the basic statistical measurements with FSharp.Stats.

Table of contents

Central tendency

A central tendency (or measure of central tendency) is a central or typical value for a probability distribution. It may also be called a center or location of the distribution. Colloquially, measures of central tendency are often called averages.

Mean

For a data set, the arithmetic mean, also called the expected value or average, is the central value of a discrete set of numbers: specifically, the sum of the values divided by the number of values:

\(\bar{x} = \frac{1}{n}\left (\sum_{i=1}^n{x_i}\right ) = \frac{x_1+x_2+\cdots +x_n}{n}\)

mean is available as a Sequence (and other collections) extension, as well as meanBy, which takes an additional converter function:

open FSharp.Stats

let mean1 = 
    [10; 2; 19; 24; 6; 23; 47; 24; 54; 77;]
    |> Seq.meanBy float
28.6
let mean2 = 
    [10.; 2.; 19.; 24.; 6.; 23.; 47.; 24.; 54.; 77.;]
    |> Seq.mean
28.6

Truncated mean

Computes the truncated (trimmed) mean where a given percentage of the highest and lowest values are discarded. In total 2 times the given percentage are discarded:

meanTruncated is available as a Sequence (and other collections) extension, as well as meanTruncatedBy, which takes an additional converter function:

let truncMean1 = 
    [10.; 2.; 19.; 24.; 6.; 23.; 47.; 24.; 54.; 77.;]
    |> Seq.meanTruncated 0.2
24.5
let truncMean2 = 
    [10; 2; 19; 24; 6; 23; 47; 24; 54; 77;]
    |> Seq.meanTruncatedBy float 0.2
34.75

Median

The median is a value separating the higher half from the lower half of a data sample, a population, or a probability distribution. For a data set, it may be thought of as "the middle" value: if you sort the values of a collection by size, the median is the value in central position. Therefore, there are as many bigger values as smaller values than the median in the collection. If there is an even number of observations, then there is no single middle value; the median is then usually defined to be the mean of the two middle values.

median is available as a equence (and other collections) extension:

let median1 = 
    [10; 2; 19; 24; 6; 23; 47; 24; 54; 77;]
    |> Seq.median
23

Harmonic mean

The harmonic mean can be expressed as the reciprocal of the arithmetic mean of the reciprocals of the given set of observations. It is typically appropriate for situations when the average of rates is desired.

\(H = \frac{n}{\frac1{x_1} + \frac1{x_2} + \cdots + \frac1{x_n}} = \frac{n}{\sum\limits_{i=1}^n \frac1{x_i}} = \left(\frac{\sum\limits_{i=1}^n x_i^{-1}}{n}\right)^{-1}.\)

meanHarmonic is available as a sequence (and other collections) extension, as well as meanHarmonicBy, which takes an additional converter function:

let harmonicMean1 = 
    [10.; 2.; 19.; 24.; 6.; 23.; 47.; 24.; 54.; 77.;]
    |> Seq.meanHarmonic
10.01109262
let harmonicMean2 = 
    [10; 2; 19; 24; 6; 23; 47; 24; 54; 77;]
    |> Seq.meanHarmonicBy float
10.01109262

Geometric mean

The geometric mean indicates the central tendency or typical value of a set of numbers by using the product of their values (as opposed to the arithmetic mean which uses their sum). The geometric mean is defined as the nth root of the product of n numbers:

\(\left(\prod_{i=1}^n x_i\right)^\frac{1}{n} = \sqrt[n]{x_1 x_2 \cdots x_n}\)

meanGeometric is available as a sequence (and other collections) extension, as well as meanGeometricBy, which takes an additional converter function:

let geometricMean1 = 
    [10.; 2.; 19.; 24.; 6.; 23.; 47.; 24.; 54.; 77.;]
    |> Seq.meanGeometric
18.9280882
let geometricMean2 = 
    [10; 2; 19; 24; 6; 23; 47; 24; 54; 77;]
    |> Seq.meanGeometricBy float 
 
18.9280882

Dispersion

Dispersion (also called variability, scatter, or spread) is the extent to which a distribution is stretched or squeezed.

Range

The range of a set of data is the difference between the largest and smallest values.

range is available as a sequence (and other collections) extension, as well as rangeBy, which takes an additional converter function:

Note: instead of returning the absolute difference between max and min value, these functions return an interval with these values as boundaries. **

let range1 = 
    [10.; 2.; 19.; 24.; 6.; 23.; 47.; 24.; 54.; 77.;]
    |> Seq.rangeBy float
Closed (2.0, 77.0)
let range2 = 
    [10; 2; 19; 24; 6; 23; 47; 24; 54; 77;]
    |> Seq.rangeBy float
Closed (2.0, 77.0)

Variance and Standard Deviation

The variance

\(s_N^2 = \frac{1}{N} \sum_{i=1}^N \left(x_i - \bar{x}\right)^2\)

and the standard deviation

\(s_N = \sqrt{\frac{1}{N} \sum_{i=1}^N \left(x_i - \bar{x}\right)^2}\)

are measures of dispersion the values of a collection have. While the standard deviation has the same unit as the values of the collection the variance has the squared unit.

varPopulation and stDevPopulation are available as sequence (and other collections) extensions, as well as varPopulationBy and stDevPopulationBy, which take an additional converter function:

let data = [|1.;3.;5.;4.;2.;8.|]

let varPopulation = Seq.varPopulation data
5.138888889
let stdPopulation = Seq.stDevPopulation data
2.266911751

If the full population is not given, the calculation lacks in one degree of freedom, so the Bessel corrected version of the calculation has to be used (results in higher values):

\(s^2 = \frac{1}{N - 1} \sum_{i=1}^N \left(x_i - \bar{x}\right)^2\) for the unbiased variance estimation, and

\(s = \sqrt{\frac{1}{N-1} \sum_{i=1}^N \left(x_i - \bar{x}\right)^2}\) for the corrected standard deviation.

var and stDev are available as sequence (and other collections) extensions, as well as varBy and stDevBy, which take an additional converter function:

let varSample = Seq.var data
6.166666667
let stdSample = Seq.stDev data
2.483277404

Coefficient of variation

The coefficient of variation is the mean-normalized standard deviation:

\(\widehat{c_{\rm v}} = \frac{s}{\bar{x}}\)

It describes the ratio of the standard deviation to the mean. It assists in comparing measurement variability with varying amplitudes. Use only if data is measured with a ratio scale (meaningful zero values and meaningful intervals).

cv is available as a sequence (and other collections) extension, as well as cvBy, which takes an additional converter function:

let sample1 =   [1.;4.;2.;6.;5.;3.;2.;]
let sample2 =   [13.;41.;29.;8.;52.;34.;25.;]

let cvSample1 = Seq.cv sample1
0.5476650327
let cvSample2 = Seq.cv sample2
0.5313890073
Multiple items
namespace FSharp

--------------------
namespace Microsoft.FSharp
namespace FSharp.Stats
val mean1: float
Multiple items
module Seq from FSharp.Stats
<summary> Module to compute common statistical measure </summary>

--------------------
module Seq from Microsoft.FSharp.Collections
<summary>Contains operations for working with values of type <see cref="T:Microsoft.FSharp.Collections.seq`1" />.</summary>

--------------------
type Seq = new: unit -> Seq static member geomspace: start: float * stop: float * num: int * ?IncludeEndpoint: bool -> seq<float> static member linspace: start: float * stop: float * num: int * ?IncludeEndpoint: bool -> seq<float>

--------------------
new: unit -> Seq
val meanBy: f: ('T -> 'U) -> items: seq<'T> -> 'U (requires member get_Zero and member (+) and member DivideByInt and member (/))
<summary> Computes the population mean (Normalized by N)s </summary>
<param name="f">A function applied to transform each element of the sequence.</param>
<param name="items">The input sequence.</param>
<remarks>Returns NaN if data is empty or if any entry is NaN.</remarks>
<returns>population mean (Normalized by N)</returns>
Multiple items
val float: value: 'T -> float (requires member op_Explicit)
<summary>Converts the argument to 64-bit float. This is a direct conversion for all primitive numeric types. For strings, the input is converted using <c>Double.Parse()</c> with InvariantCulture settings. Otherwise the operation requires an appropriate static conversion method on the input type.</summary>
<param name="value">The input value.</param>
<returns>The converted float</returns>
<example id="float-example"><code lang="fsharp"></code></example>


--------------------
[<Struct>] type float = System.Double
<summary>An abbreviation for the CLI type <see cref="T:System.Double" />.</summary>
<category>Basic Types</category>


--------------------
type float<'Measure> = float
<summary>The type of double-precision floating point numbers, annotated with a unit of measure. The unit of measure is erased in compiled code and when values of this type are analyzed using reflection. The type is representationally equivalent to <see cref="T:System.Double" />.</summary>
<category index="6">Basic Types with Units of Measure</category>
val mean2: float
val mean: items: seq<'T> -> 'U (requires member (+) and member get_Zero and member DivideByInt and member (/))
<summary> Computes the population mean (Normalized by N) </summary>
<param name="items">The input sequence.</param>
<remarks>Returns default value if data is empty or if any entry is NaN.</remarks>
<returns>population mean (Normalized by N)</returns>
val truncMean1: float
val meanTruncated: percent: float -> data: seq<'T> -> 'U (requires comparison and member (+) and member get_Zero and member DivideByInt and member (/))
<summary> Computes the truncated (trimmed) mean where x percent of the highest, and x percent of the lowest values are discarded (total 2x) </summary>
<param name="items">The input sequence.</param>
<remarks>Returns NaN if data is empty or if any entry is NaN.</remarks>
<returns>truncated (trimmed) mean</returns>
val truncMean2: float
val meanTruncatedBy: f: ('T -> 'U) -> percent: float -> data: seq<'T> -> 'U (requires comparison and member get_Zero and member (+) and member DivideByInt and member (/))
<summary> Computes the truncated (trimmed) mean </summary>
<param name="items">The input sequence.</param>
<param name="f">A function applied to transform each element of the sequence.</param>
<remarks>Returns NaN if data is empty or if any entry is NaN.</remarks>
<returns>truncated (trimmed) mean</returns>
val median1: int
val median: items: seq<'T> -> 'T (requires comparison and member get_Zero and member get_One and member (+) and member (/) and member (/))
<summary>Sample Median</summary>
<remarks></remarks>
<param name="items"></param>
<returns></returns>
<example><code></code></example>
val harmonicMean1: float
val meanHarmonic: items: seq<'T> -> 'U (requires member (/) and member get_Zero and member get_One and member (+) and comparison and member (/) and member (+))
<summary> Computes harmonic mean </summary>
<param name="items">The input sequence.</param>
<remarks>Returns NaN if data is empty or if any entry is NaN.</remarks>
<returns>harmonic mean</returns>
val harmonicMean2: float
val meanHarmonicBy: f: ('T -> 'U) -> items: seq<'T> -> 'U (requires member get_Zero and member get_One and member (+) and comparison and member (/))
<summary> Computes harmonic mean </summary>
<param name="f">A function applied to transform each element of the sequence.</param>
<param name="items">The input sequence.</param>
<remarks>Returns NaN if data is empty or if any entry is NaN.</remarks>
<returns>harmonic mean</returns>
val geometricMean1: float
val meanGeometric: items: seq<'T> -> 'U (requires member (+) and member Log and member get_Zero and member DivideByInt and member Exp and member (/))
<summary> Computes gemetric mean </summary>
<param name="items">The input sequence.</param>
<remarks>Returns NaN if data is empty or if any entry is NaN.</remarks>
<returns>gemetric mean</returns>
val geometricMean2: float
val meanGeometricBy: f: ('T -> 'a) -> items: seq<'T> -> 'U (requires member (+) and member Log and member get_Zero and member DivideByInt and member Exp and member (/))
<summary> Computes gemetric mean </summary>
<param name="f">A function applied to transform each element of the sequence.</param>
<param name="items">The input sequence.</param>
<remarks>Returns NaN if data is empty or if any entry is NaN.</remarks>
<returns>gemetric mean</returns>
val range1: Interval<float>
val rangeBy: f: ('a -> 'a0) -> items: seq<'a> -> Interval<'a> (requires comparison and comparison)
val range2: Interval<int>
val data: float[]
val varPopulation: float
val varPopulation: items: seq<'T> -> 'U (requires member (-) and member get_Zero and member DivideByInt and member (+) and member ( * ) and member (+) and member (/))
<summary> Computes variance of the given values (denominator N) </summary>
<param name="items">The input sequence.</param>
<remarks>Returns NaN if data is empty or if any entry is NaN.</remarks>
<returns>population variance estimator (denominator N)</returns>
val stdPopulation: float
val stDevPopulation: items: seq<'T> -> 'U (requires member (-) and member get_Zero and member DivideByInt and member (+) and member ( * ) and member (+) and member (/) and member Sqrt)
<summary> Computes the population standard deviation (denominator = N) </summary>
<param name="items">The input sequence.</param>
<remarks>Returns NaN if data is empty or if any entry is NaN.</remarks>
<returns>population standard deviation (denominator = N)</returns>
val varSample: float
val var: items: seq<'T> -> 'U (requires member (-) and member get_Zero and member DivideByInt and member (+) and member ( * ) and member (+) and member (/))
<summary> Computes the sample variance (Bessel's correction by N-1) </summary>
<param name="items">The input sequence.</param>
<remarks>Returns NaN if data is empty or if any entry is NaN.</remarks>
<returns>variance of a sample (Bessel's correction by N-1)</returns>
val stdSample: float
val stDev: items: seq<'T> -> 'U (requires member (-) and member get_Zero and member DivideByInt and member (+) and member ( * ) and member (+) and member (/) and member Sqrt)
<summary> Computes the sample standard deviation </summary>
<param name="items">The input sequence.</param>
<remarks>Returns NaN if data is empty or if any entry is NaN.</remarks>
<returns>standard deviation of a sample (Bessel's correction by N-1)</returns>
val sample1: float list
val sample2: float list
val cvSample1: float
val cv: items: seq<'T> -> 'U (requires member (-) and member get_Zero and member DivideByInt and member (+) and member ( * ) and member Sqrt and member (+) and member (/) and member (/))
<summary> Computes the Coefficient of Variation of a sample (Bessel's correction by N-1) </summary>
<param name="items">The input sequence.</param>
<remarks>Returns NaN if data is empty or if any entry is NaN.</remarks>
<returns>Coefficient of Variation of a sample (Bessel's correction by N-1)</returns>
val cvSample2: float