Deedle


Calculating frame and series statistics

The Stats type contains functions for fast calculation of statistics over series and frames as well as over a moving and an expanding window in a series. The standard statistical functions that are available in the Stats type are overloaded and can be applied to both data frames and series. More advanced functionality is available only for series (but can be applied to frame columns easily using the Frame.getNumericCols function.

Series and frame statistics

In this section, we look at calculating simple statistics over data frame and series. An important aspect is handling of missing values, so we demonstrate that using a data set about air quality that contains missing values. The following snippet loads airquality.csv and shows the values in the Ozone column:

1: 
2: 
let air = Frame.ReadCsv(root + "airquality.csv", separators=";")
let ozone = air?Ozone

Keys

0

1

2

3

4

...

150

151

152

Values

N/A

36

12

18

N/A

...

14

18

20

Series statistics

Given a series ozone, we can use a number of Stats functions to calculate statistics. The following example creates a series (indexed by strings) that stores mean extremes and median of the input series:

1: 
2: 
3: 
4: 
5: 
series [
  "Mean" => round (Stats.mean ozone)
  "Max" => Stats.max ozone
  "Min" => Stats.min ozone
  "Median" => Stats.median ozone ]

Keys

Mean

Max

Min

Median

Values

42

168

1

31

To make the output simpler, we round the value of the mean (although the result is a floating point number). Note that the value is calculated from the available values in the series. All of the statistical functions skip over missing values in the input series.

As the above example demonstrates, Stats.max and Stats.min return option<float> rather than just float. The result value is None when the series contains no values. This makes it possible to use the functions not just on floating point numbers, but also on series of integers and other types. Other statistical functions such as Stats.mean return nan when no values are available.

Frame statistics

Functions such as Stats.mean can be called on series, but also on entire data frames. In that case, they calculate the statistics for each column of a data frame and return Series<'C, float> where 'C is the column key of the original frame.

In the following snippet, we calculate means and standard deviations of all columns of the air data set and build a frame that shows the values (series) in two columns:

1: 
2: 
3: 
4: 
5: 
let info = 
  [ "Min" => Stats.min air
    "Max" => Stats.max air
    "Mean" => Stats.mean air
    "+/-" => Stats.stdDev air ] |> frame

Min

Max

Mean

+/-

Ozone

1

168

42.14

33.13

Solar.R

7

334

185.93

90.06

Wind

1.7

20.7

9.96

3.52

Temp

56

97

77.88

9.47

Month

5

9

6.99

1.42

Day

1

31

15.8

8.86

Missing values are handled in the same way as when calculating statistics of a series and are skipped. If this is not desirable, you can use functions from the Series module for working with missing values to treat missing values in different ways.

The Stats module provides basic statistical functionality such as mean, standard deviation and variance, but also more advanced functions including skewness and kurtosis. You can find a complete list in the Series statistics and Frame statistics sections of the API reference.

Moving window statistics

The Stats type provides an efficient implementation of moving window statistics. The implementation uses an online algorithm so that it does not have to re-calculate the statistics for each window separately, but instead updates the value as it iterates over the input (and so this is faster than using Series.window).

The moving window function names are pre-fixed with the word moving and calculate moving statistics over a window of a fixed length. The following example calculates means over a moving window of length 3:

1: 
2: 
ozone
|> Stats.movingMean 3

Keys

0

1

2

3

4

...

150

151

152

Values

N/A

N/A

24

22

15

...

22

16

17.3333

The keys of the resulting series are the same as the keys of the input series. Statistical moving functions (count, sum, mean, variance, standard deviation, skewness and kurtosis) over a window of size n always mark the first n-1 values with missing (i.e. they only perform the calculation over complete windows). This explains why the value associated with the key 1 is N/A. For the key 2, the mean is calculated from all available values in the window, which is: (36+12)/2.

The boundary behavior of the functions that calculate minimum and maximum over a moving window differs. Rather than returning N/A for the first n-1 values, they return the extreme value over a smaller window:

1: 
2: 
ozone
|> Stats.movingMin 3

Keys

0

1

2

3

4

...

150

151

152

Values

N/A

36

12

12

12

...

14

14

14

Here, the first value is missing, because the one-element window containing just the first value contains only missing values. However, the value for the key 1, because the two-element window (starting from the beginning of the series) contains two elements.

Remarks

The windowing functions in the Stats type support an efficient calculations over a fixed-size windows specified by the size of the window. They also provide one, fixed, boundary behavior. If you need more complex windowing behavior (such as window based on the distance between keys), different handling of boundaries, or chunking (calculation over adjacent chunks), you can use chunking and windowing functions from the Series module such as Series.windowSizeInto or Series.chunkSizeInto. For more information, see Grouping, windowing and chunking section in the API reference.

Expanding windows

Expanding window means that the window starts as a single-element sized window at the beginning of a series and expands as it moves over the series. For a time-series data ordered by time, this gives you statistics calculated over all previous known observations. In other words, the statistics is calculated for all values up to the current key and the result is attached to the key at the end of the window. The expanding window functions are prefixed with expanding.

The following example demonstrates how to calculate expanding mean and expanding standard deviation over the Ozone series. The resulting series has the same keys as the input series. Here, we align the two series using a frame, so that we can easily see the results aligned:

1: 
2: 
3: 
4: 
let exp =
  [ "Ozone" => ozone 
    "Mean" => Stats.expandingMean(ozone)
    "+/-" => Stats.expandingStdDev(ozone) ] |> frame

Ozone

Mean

+/-

0

N/A

N/A

N/A

1

36

36

N/A

2

12

24

16.97

3

18

22

12.49

4

N/A

22

12.49

5

28

23.5

10.63

6

23

23.4

9.21

7

19

22.67

8.43

...

...

...

...

149

N/A

42.8

33.32

150

14

42.55

33.28

151

18

42.33

33.21

152

20

42.14

33.13

As the example illustrates, expanding window statistics typically returns a series that starts with some missing values. Here, the first mean is missing (because one-element window contains no values) and the first two standard deviations are missing (stdDev is define only for two and more values). The only exception is expandingSum, because the sum of no elements is zero.

Multi-level indexed statistics

For a series with multi-level (hierarchical) index, the functions prefixed with level provide a way to apply statistical operation on a single level of the index. Series with multi-level index can be created directly by using a tuple (such as 'K1 * 'K2) as the key, or they can be produced by a grouping operation such as Frame.groupRowsBy.

For example, you can create two-level index that represents time-series data with month as the first part of the key and day as the second part of the key. Then you can use multi-level statistical functions to calculate means (and other statistics) for each month separately.

The following example demonstrates the idea - the air data set contains data for each day between May and September. We can create a frame with two-level row key using Frame.indexRowsUsing and returning a tuple as the index:

1: 
2: 
3: 
let dateFormat = CultureInfo.CurrentCulture.DateTimeFormat
let byMonth = air |> Frame.indexRowsUsing (fun r ->
    dateFormat.GetMonthName(r.GetAs("Month")), r.GetAs<int>("Day"))

The type of the byMonth value is Frame<string * int, string> meaning that the row index has two levels. To make the output a little nicer, we use the GetMonthName function to turn the first level of the index into a string representing the month name.

We can now access individual columns and calculate statistics over the first level (individual months) using functions prefixed with level:

1: 
2: 
byMonth?Ozone
|> Stats.levelMean fst

Keys

May

June

July

August

September

Values

22.92

29.4444

59.1154

59.9615

31.4483

Currently, the Stats type does not include a function that would let you apply multi-level statistical functions on entire data frames, but this can easily be implemented using the Frame.getNumericalCols function and Series.mapValues:

1: 
2: 
3: 
4: 
5: 
byMonth
|> Frame.sliceCols ["Ozone";"Solar.R";"Wind";"Temp"]
|> Frame.getNumericCols
|> Series.mapValues (Stats.levelMean fst)
|> Frame.ofRows

May

June

July

August

September

Ozone

22.92

29.4444

59.1154

59.9615

31.4483

Solar.R

181.2963

190.1667

216.4839

171.8571

167.4333

Wind

11.6226

10.2667

8.9419

8.7935

10.18

Temp

65.5484

79.1

83.9032

83.9677

76.9

If we used Frame.getNumericCols directly, we would also calculate the mean of "Day" and "Month" columns, which does not make much sense in this example. For that reason, the snippet first calls sliceCols to get only relevant columns.

namespace System
namespace System.Globalization
namespace System.IO
Multiple items
namespace FSharp

--------------------
namespace Microsoft.FSharp
Multiple items
namespace FSharp.Data

--------------------
namespace Microsoft.FSharp.Data
namespace Deedle
namespace FSharp.Charting
val root : string
val air : Frame<int,string>
Multiple items
module Frame

from Deedle

--------------------
type Frame =
  static member ReadCsv : stream:Stream * hasHeaders:Nullable<bool> * inferTypes:Nullable<bool> * inferRows:Nullable<int> * schema:string * separators:string * culture:string * maxRows:Nullable<int> * missingValues:string [] * preferOptions:Nullable<bool> -> Frame<int,string>
  static member ReadCsv : location:string * hasHeaders:Nullable<bool> * inferTypes:Nullable<bool> * inferRows:Nullable<int> * schema:string * separators:string * culture:string * maxRows:Nullable<int> * missingValues:string [] * preferOptions:bool -> Frame<int,string>
  static member ReadReader : reader:IDataReader -> Frame<int,string>
  static member CustomExpanders : Dictionary<Type,Func<obj,seq<string * Type * obj>>>
  static member NonExpandableInterfaces : ResizeArray<Type>
  static member NonExpandableTypes : HashSet<Type>

--------------------
type Frame<'TRowKey,'TColumnKey (requires equality and equality)> =
  interface IDynamicMetaObjectProvider
  interface INotifyCollectionChanged
  interface IFsiFormattable
  interface IFrame
  new : names:seq<'TColumnKey> * columns:seq<ISeries<'TRowKey>> -> Frame<'TRowKey,'TColumnKey>
  new : rowIndex:IIndex<'TRowKey> * columnIndex:IIndex<'TColumnKey> * data:IVector<IVector> * indexBuilder:IIndexBuilder * vectorBuilder:IVectorBuilder -> Frame<'TRowKey,'TColumnKey>
  member AddColumn : column:'TColumnKey * series:ISeries<'TRowKey> -> unit
  member AddColumn : column:'TColumnKey * series:seq<'V> -> unit
  member AddColumn : column:'TColumnKey * series:ISeries<'TRowKey> * lookup:Lookup -> unit
  member AddColumn : column:'TColumnKey * series:seq<'V> * lookup:Lookup -> unit
  ...

--------------------
new : names:seq<'TColumnKey> * columns:seq<ISeries<'TRowKey>> -> Frame<'TRowKey,'TColumnKey>
new : rowIndex:Indices.IIndex<'TRowKey> * columnIndex:Indices.IIndex<'TColumnKey> * data:IVector<IVector> * indexBuilder:Indices.IIndexBuilder * vectorBuilder:Vectors.IVectorBuilder -> Frame<'TRowKey,'TColumnKey>
static member Frame.ReadCsv : path:string * ?hasHeaders:bool * ?inferTypes:bool * ?inferRows:int * ?schema:string * ?separators:string * ?culture:string * ?maxRows:int * ?missingValues:string [] * ?preferOptions:bool -> Frame<int,string>
static member Frame.ReadCsv : stream:Stream * ?hasHeaders:bool * ?inferTypes:bool * ?inferRows:int * ?schema:string * ?separators:string * ?culture:string * ?maxRows:int * ?missingValues:string [] * ?preferOptions:bool -> Frame<int,string>
static member Frame.ReadCsv : reader:TextReader * ?hasHeaders:bool * ?inferTypes:bool * ?inferRows:int * ?schema:string * ?separators:string * ?culture:string * ?maxRows:int * ?missingValues:string [] * ?preferOptions:bool -> Frame<int,string>
static member Frame.ReadCsv : stream:Stream * hasHeaders:Nullable<bool> * inferTypes:Nullable<bool> * inferRows:Nullable<int> * schema:string * separators:string * culture:string * maxRows:Nullable<int> * missingValues:string [] * preferOptions:Nullable<bool> -> Frame<int,string>
static member Frame.ReadCsv : location:string * hasHeaders:Nullable<bool> * inferTypes:Nullable<bool> * inferRows:Nullable<int> * schema:string * separators:string * culture:string * maxRows:Nullable<int> * missingValues:string [] * preferOptions:bool -> Frame<int,string>
static member Frame.ReadCsv : path:string * indexCol:string * ?hasHeaders:bool * ?inferTypes:bool * ?inferRows:int * ?schema:string * ?separators:string * ?culture:string * ?maxRows:int * ?missingValues:string [] * ?preferOptions:bool -> Frame<'R,string> (requires equality)
val ozone : Series<int,float>
val series : observations:seq<'a * 'b> -> Series<'a,'b> (requires equality)
val round : value:'T -> 'T (requires member Round)
type Stats =
  static member count : frame:Frame<'R,'C> -> Series<'C,int> (requires equality and equality)
  static member count : series:Series<'K,'V> -> int (requires equality)
  static member describe : series:Series<'K,'V> -> Series<string,float> (requires equality and equality)
  static member expandingCount : series:Series<'K,'V> -> Series<'K,float> (requires equality)
  static member expandingKurt : series:Series<'K,'V> -> Series<'K,float> (requires equality)
  static member expandingMax : series:Series<'K,'V> -> Series<'K,float> (requires equality)
  static member expandingMean : series:Series<'K,'V> -> Series<'K,float> (requires equality)
  static member expandingMin : series:Series<'K,'V> -> Series<'K,float> (requires equality)
  static member expandingSkew : series:Series<'K,'V> -> Series<'K,float> (requires equality)
  static member expandingStdDev : series:Series<'K,'V> -> Series<'K,float> (requires equality)
  ...
static member Stats.mean : frame:Frame<'R,'C> -> Series<'C,float> (requires equality and equality)
static member Stats.mean : series:Series<'K,'V> -> float (requires equality)
static member Stats.max : frame:Frame<'R,'C> -> Series<'C,float> (requires equality and equality)
static member Stats.max : series:Series<'K,'V> -> float (requires equality)
static member Stats.min : frame:Frame<'R,'C> -> Series<'C,float> (requires equality and equality)
static member Stats.min : series:Series<'K,'V> -> float (requires equality)
static member Stats.median : frame:Frame<'R,'C> -> Series<'C,float> (requires equality and equality)
static member Stats.median : series:Series<'K,'V> -> float (requires equality)
val info : Frame<string,string>
static member Stats.stdDev : frame:Frame<'R,'C> -> Series<'C,float> (requires equality and equality)
static member Stats.stdDev : series:Series<'K,'V> -> float (requires equality)
val frame : columns:seq<'a * #ISeries<'c>> -> Frame<'c,'a> (requires equality and equality)
static member Stats.movingMean : size:int -> series:Series<'K,'V> -> Series<'K,float> (requires equality)
static member Stats.movingMin : size:int -> series:Series<'K,'V> -> Series<'K,float> (requires equality)
val exp : Frame<int,string>
static member Stats.expandingMean : series:Series<'K,'V> -> Series<'K,float> (requires equality)
static member Stats.expandingStdDev : series:Series<'K,'V> -> Series<'K,float> (requires equality)
val dateFormat : DateTimeFormatInfo
Multiple items
type CultureInfo =
  new : name:string -> CultureInfo + 3 overloads
  member Calendar : Calendar
  member ClearCachedData : unit -> unit
  member Clone : unit -> obj
  member CompareInfo : CompareInfo
  member CultureTypes : CultureTypes
  member DateTimeFormat : DateTimeFormatInfo with get, set
  member DisplayName : string
  member EnglishName : string
  member Equals : value:obj -> bool
  ...

--------------------
CultureInfo(name: string) : CultureInfo
CultureInfo(culture: int) : CultureInfo
CultureInfo(name: string, useUserOverride: bool) : CultureInfo
CultureInfo(culture: int, useUserOverride: bool) : CultureInfo
property CultureInfo.CurrentCulture: CultureInfo
property CultureInfo.DateTimeFormat: DateTimeFormatInfo
val byMonth : Frame<(string * int),string>
val indexRowsUsing : f:(ObjectSeries<'C> -> 'R2) -> frame:Frame<'R1,'C> -> Frame<'R2,'C> (requires equality and equality and equality)
val r : ObjectSeries<string>
DateTimeFormatInfo.GetMonthName(month: int) : string
member ObjectSeries.GetAs : column:'K -> 'R
member ObjectSeries.GetAs : column:'K * fallback:'R -> 'R
Multiple items
val int : value:'T -> int (requires member op_Explicit)

--------------------
type int = int32

--------------------
type int<'Measure> = int
static member Stats.levelMean : level:('K -> 'L) -> series:Series<'K,'V> -> Series<'L,float> (requires equality and equality)
val fst : tuple:('T1 * 'T2) -> 'T1
val sliceCols : columns:seq<'C> -> frame:Frame<'R,'C> -> Frame<'R,'C> (requires equality and equality)
val getNumericCols : frame:Frame<'R,'C> -> Series<'C,Series<'R,float>> (requires equality and equality)
Multiple items
module Series

from Deedle

--------------------
type Series =
  static member ofNullables : values:seq<Nullable<'a0>> -> Series<int,'a0> (requires default constructor and value type and 'a0 :> ValueType)
  static member ofObservations : observations:seq<'c * 'd> -> Series<'c,'d> (requires equality)
  static member ofOptionalObservations : observations:seq<'K * 'a1 option> -> Series<'K,'a1> (requires equality)
  static member ofValues : values:seq<'a> -> Series<int,'a>

--------------------
type Series<'K,'V (requires equality)> =
  interface IFsiFormattable
  interface ISeries<'K>
  new : pairs:seq<KeyValuePair<'K,'V>> -> Series<'K,'V>
  new : keys:'K [] * values:'V [] -> Series<'K,'V>
  new : keys:seq<'K> * values:seq<'V> -> Series<'K,'V>
  new : index:IIndex<'K> * vector:IVector<'V> * vectorBuilder:IVectorBuilder * indexBuilder:IIndexBuilder -> Series<'K,'V>
  member After : lowerExclusive:'K -> Series<'K,'V>
  member Aggregate : aggregation:Aggregation<'K> * observationSelector:Func<DataSegment<Series<'K,'V>>,KeyValuePair<'TNewKey,OptionalValue<'R>>> -> Series<'TNewKey,'R> (requires equality)
  member Aggregate : aggregation:Aggregation<'K> * keySelector:Func<DataSegment<Series<'K,'V>>,'TNewKey> * valueSelector:Func<DataSegment<Series<'K,'V>>,OptionalValue<'R>> -> Series<'TNewKey,'R> (requires equality)
  member AsyncMaterialize : unit -> Async<Series<'K,'V>>
  ...

--------------------
new : pairs:seq<Collections.Generic.KeyValuePair<'K,'V>> -> Series<'K,'V>
new : keys:seq<'K> * values:seq<'V> -> Series<'K,'V>
new : keys:'K [] * values:'V [] -> Series<'K,'V>
new : index:Indices.IIndex<'K> * vector:IVector<'V> * vectorBuilder:Vectors.IVectorBuilder * indexBuilder:Indices.IIndexBuilder -> Series<'K,'V>
val mapValues : f:('T -> 'R) -> series:Series<'K,'T> -> Series<'K,'R> (requires equality)
static member Frame.ofRows : rows:seq<'R * #ISeries<'C>> -> Frame<'R,'C> (requires equality and equality)
static member Frame.ofRows : rows:Series<'R,#ISeries<'C>> -> Frame<'R,'C> (requires equality and equality)
Fork me on GitHub