Calculating frame and series statistics
The Stats
type contains functions for fast calculation of statistics over
series and frames as well as over a moving and an expanding window in a series.
The standard statistical functions that are available in the Stats
type
are overloaded and can be applied to both data frames and series. More advanced
functionality is available only for series (but can be applied to frame columns
easily using the Frame.getNumericCols
function.
Series and frame statistics
In this section, we look at calculating simple statistics over data frame and
series. An important aspect is handling of missing values, so we demonstrate that
using a data set about air quality that contains missing values. The following
snippet loads airquality.csv
and shows the values in the Ozone
column:
1: 2: 

Keys 
0 
1 
2 
3 
4 
... 
150 
151 
152 

Values 
N/A 
36 
12 
18 
N/A 
... 
14 
18 
20 
Series statistics
Given a series ozone
, we can use a number of Stats
functions to calculate
statistics. The following example creates a series (indexed by strings) that
stores mean extremes and median of the input series:
1: 2: 3: 4: 5: 

Keys 
Mean 
Max 
Min 
Median 

Values 
42 
168 
1 
31 
To make the output simpler, we round the value of the mean (although the result is a floating point number). Note that the value is calculated from the available values in the series. All of the statistical functions skip over missing values in the input series.
As the above example demonstrates, Stats.max
and Stats.min
return option<float>
rather than just float
. The result value is None
when the series contains no values.
This makes it possible to use the functions not just on floating point numbers, but
also on series of integers and other types. Other statistical functions such as
Stats.mean
return nan
when no values are available.
Frame statistics
Functions such as Stats.mean
can be called on series, but also on entire data frames.
In that case, they calculate the statistics for each column of a data frame and return
Series<'C, float>
where 'C
is the column key of the original frame.
In the following snippet, we calculate means and standard deviations of all columns of
the air
data set and build a frame that shows the values (series) in two columns:
1: 2: 3: 4: 5: 

Min 
Max 
Mean 
+/ 


Ozone 
1 
168 
42.14 
33.13 
Solar.R 
7 
334 
185.93 
90.06 
Wind 
1.7 
20.7 
9.96 
3.52 
Temp 
56 
97 
77.88 
9.47 
Month 
5 
9 
6.99 
1.42 
Day 
1 
31 
15.8 
8.86 
Missing values are handled in the same way as when calculating statistics of a series and are skipped. If this is not desirable, you can use functions from the Series module for working with missing values to treat missing values in different ways.
The Stats
module provides basic statistical functionality such as mean, standard
deviation and variance, but also more advanced functions including skewness and kurtosis.
You can find a complete list in the Series statistics
and Frame statistics sections of the API reference.
Moving window statistics
The Stats
type provides an efficient implementation of moving window statistics. The
implementation uses an online algorithm so that it does not have to recalculate the
statistics for each window separately, but instead updates the value as it iterates over
the input (and so this is faster than using Series.window
).
The moving window function names are prefixed with the word moving
and calculate moving
statistics over a window of a fixed length. The following example calculates means over a
moving window of length 3:
1: 2: 

Keys 
0 
1 
2 
3 
4 
... 
150 
151 
152 

Values 
N/A 
N/A 
24 
22 
15 
... 
22 
16 
17.3333 
The keys of the resulting series are the same as the keys of the input series. Statistical moving functions (count, sum, mean, variance, standard deviation, skewness and kurtosis) over a window of size n always mark the first n1 values with missing (i.e. they only perform the calculation over complete windows). This explains why the value associated with the key 1 is N/A. For the key 2, the mean is calculated from all available values in the window, which is: (36+12)/2.
The boundary behavior of the functions that calculate minimum and maximum over a moving window differs. Rather than returning N/A for the first n1 values, they return the extreme value over a smaller window:
1: 2: 

Keys 
0 
1 
2 
3 
4 
... 
150 
151 
152 

Values 
N/A 
36 
12 
12 
12 
... 
14 
14 
14 
Here, the first value is missing, because the oneelement window containing just the first value contains only missing values. However, the value for the key 1, because the twoelement window (starting from the beginning of the series) contains two elements.
Remarks
The windowing functions in the Stats
type support an efficient calculations over a fixedsize
windows specified by the size of the window. They also provide one, fixed, boundary behavior.
If you need more complex windowing behavior (such as window based on the distance between keys),
different handling of boundaries, or chunking (calculation over adjacent chunks), you can use
chunking and windowing functions from the Series
module such as Series.windowSizeInto
or
Series.chunkSizeInto
. For more information, see Grouping, windowing and
chunking section in the API reference.
Expanding windows
Expanding window means that the window starts as a singleelement sized window at the beginning
of a series and expands as it moves over the series. For a timeseries data ordered by time,
this gives you statistics calculated over all previous known observations.
In other words, the statistics is calculated for all values up to the current key and the
result is attached to the key at the end of the window. The expanding window functions are
prefixed with expanding
.
The following example demonstrates how to calculate expanding mean and expanding standard deviation over the Ozone series. The resulting series has the same keys as the input series. Here, we align the two series using a frame, so that we can easily see the results aligned:
1: 2: 3: 4: 

Ozone 
Mean 
+/ 


0 
N/A 
N/A 
N/A 
1 
36 
36 
N/A 
2 
12 
24 
16.97 
3 
18 
22 
12.49 
4 
N/A 
22 
12.49 
5 
28 
23.5 
10.63 
6 
23 
23.4 
9.21 
7 
19 
22.67 
8.43 
... 
... 
... 
... 
149 
N/A 
42.8 
33.32 
150 
14 
42.55 
33.28 
151 
18 
42.33 
33.21 
152 
20 
42.14 
33.13 
As the example illustrates, expanding window statistics typically returns a series that starts
with some missing values. Here, the first mean is missing (because oneelement window contains
no values) and the first two standard deviations are missing (stdDev
is define only for two
and more values). The only exception is expandingSum
, because the sum of no elements is zero.
Multilevel indexed statistics
For a series with multilevel (hierarchical) index, the functions prefixed with level
provide
a way to apply statistical operation on a single level of the index. Series with multilevel
index can be created directly by using a tuple (such as 'K1 * 'K2
) as the key, or they can
be produced by a grouping operation such as Frame.groupRowsBy
.
For example, you can create twolevel index that represents timeseries data with month as the first part of the key and day as the second part of the key. Then you can use multilevel statistical functions to calculate means (and other statistics) for each month separately.
The following example demonstrates the idea  the air
data set contains data for each
day between May and September. We can create a frame with twolevel row key using
Frame.indexRowsUsing
and returning a tuple as the index:
1: 2: 3: 

The type of the byMonth
value is Frame<string * int, string>
meaning that the row index
has two levels. To make the output a little nicer, we use the GetMonthName
function to
turn the first level of the index into a string representing the month name.
We can now access individual columns and calculate statistics over the
first level (individual months) using functions prefixed with level
:
1: 2: 

Keys 
May 
June 
July 
August 
September 

Values 
22.92 
29.4444 
59.1154 
59.9615 
31.4483 
Currently, the Stats
type does not include a function that would let you apply multilevel
statistical functions on entire data frames, but this can easily be implemented using the
Frame.getNumericalCols
function and Series.mapValues
:
1: 2: 3: 4: 5: 

May 
June 
July 
August 
September 


Ozone 
22.92 
29.4444 
59.1154 
59.9615 
31.4483 
Solar.R 
181.2963 
190.1667 
216.4839 
171.8571 
167.4333 
Wind 
11.6226 
10.2667 
8.9419 
8.7935 
10.18 
Temp 
65.5484 
79.1 
83.9032 
83.9677 
76.9 
If we used Frame.getNumericCols
directly, we would also calculate the mean of "Day" and
"Month" columns, which does not make much sense in this example. For that reason, the snippet
first calls sliceCols
to get only relevant columns.
namespace FSharp

namespace Microsoft.FSharp
namespace FSharp.Data

namespace Microsoft.FSharp.Data
module Frame
from Deedle

type Frame =
static member ReadReader : reader:IDataReader > Frame<int,string>
static member CustomExpanders : Dictionary<Type,Func<obj,seq<string * Type * obj>>>
static member NonExpandableInterfaces : ResizeArray<Type>
static member NonExpandableTypes : HashSet<Type>

type Frame<'TRowKey,'TColumnKey (requires equality and equality)> =
interface IDynamicMetaObjectProvider
interface INotifyCollectionChanged
interface IFsiFormattable
interface IFrame
new : names:seq<'TColumnKey> * columns:seq<ISeries<'TRowKey>> > Frame<'TRowKey,'TColumnKey>
new : rowIndex:IIndex<'TRowKey> * columnIndex:IIndex<'TColumnKey> * data:IVector<IVector> * indexBuilder:IIndexBuilder * vectorBuilder:IVectorBuilder > Frame<'TRowKey,'TColumnKey>
member AddColumn : column:'TColumnKey * series:ISeries<'TRowKey> > unit
member AddColumn : column:'TColumnKey * series:seq<'V> > unit
member AddColumn : column:'TColumnKey * series:ISeries<'TRowKey> * lookup:Lookup > unit
member AddColumn : column:'TColumnKey * series:seq<'V> * lookup:Lookup > unit
...

new : names:seq<'TColumnKey> * columns:seq<ISeries<'TRowKey>> > Frame<'TRowKey,'TColumnKey>
new : rowIndex:Indices.IIndex<'TRowKey> * columnIndex:Indices.IIndex<'TColumnKey> * data:IVector<IVector> * indexBuilder:Indices.IIndexBuilder * vectorBuilder:Vectors.IVectorBuilder > Frame<'TRowKey,'TColumnKey>
static member Frame.ReadCsv : stream:Stream * ?hasHeaders:bool * ?inferTypes:bool * ?inferRows:int * ?schema:string * ?separators:string * ?culture:string * ?maxRows:int * ?missingValues:string [] > Frame<int,string>
static member Frame.ReadCsv : reader:TextReader * ?hasHeaders:bool * ?inferTypes:bool * ?inferRows:int * ?schema:string * ?separators:string * ?culture:string * ?maxRows:int * ?missingValues:string [] > Frame<int,string>
static member Frame.ReadCsv : path:string * indexCol:string * ?hasHeaders:bool * ?inferTypes:bool * ?inferRows:int * ?schema:string * ?separators:string * ?culture:string * ?maxRows:int * ?missingValues:string [] > Frame<'R,string> (requires equality)
module Stats

type Stats =
static member count : frame:Frame<'R,'C> > Series<'C,int> (requires equality and equality)
static member count : series:Series<'K,'V> > int (requires equality)
static member expandingCount : series:Series<'K,float> > Series<'K,float> (requires equality)
static member expandingKurt : series:Series<'K,float> > Series<'K,float> (requires equality)
static member expandingMax : series:Series<'K,float> > Series<'K,float> (requires equality)
static member expandingMean : series:Series<'K,float> > Series<'K,float> (requires equality)
static member expandingMin : series:Series<'K,float> > Series<'K,float> (requires equality)
static member expandingSkew : series:Series<'K,float> > Series<'K,float> (requires equality)
static member expandingStdDev : series:Series<'K,float> > Series<'K,float> (requires equality)
static member expandingSum : series:Series<'K,float> > Series<'K,float> (requires equality)
...
static member Stats.mean : series:Series<'K,float> > float (requires equality)
from Microsoft.FSharp.Core
static member Stats.max : series:Series<'K,'V> > 'V option (requires equality and comparison)
static member Stats.min : series:Series<'K,'V> > 'V option (requires equality and comparison)
static member Stats.median : series:Series<'K,float> > float (requires equality)
static member Stats.stdDev : series:Series<'K,float> > float (requires equality)
type CultureInfo =
new : name:string > CultureInfo + 3 overloads
member Calendar : Calendar
member ClearCachedData : unit > unit
member Clone : unit > obj
member CompareInfo : CompareInfo
member CultureTypes : CultureTypes
member DateTimeFormat : DateTimeFormatInfo with get, set
member DisplayName : string
member EnglishName : string
member Equals : value:obj > bool
...

CultureInfo(name: string) : CultureInfo
CultureInfo(culture: int) : CultureInfo
CultureInfo(name: string, useUserOverride: bool) : CultureInfo
CultureInfo(culture: int, useUserOverride: bool) : CultureInfo
member ObjectSeries.GetAs : column:'K * fallback:'R > 'R
val int : value:'T > int (requires member op_Explicit)

type int = int32

type int<'Measure> = int
module Series
from Deedle

type Series =
static member ofNullables : values:seq<Nullable<'a0>> > Series<int,'a0> (requires default constructor and value type and 'a0 :> ValueType)
static member ofObservations : observations:seq<'c * 'd> > Series<'c,'d> (requires equality)
static member ofOptionalObservations : observations:seq<'K * 'a1 option> > Series<'K,'a1> (requires equality)
static member ofValues : values:seq<'a> > Series<int,'a>

type Series<'K,'V (requires equality)> =
interface IFsiFormattable
interface ISeries<'K>
new : pairs:seq<KeyValuePair<'K,'V>> > Series<'K,'V>
new : keys:'K [] * values:'V [] > Series<'K,'V>
new : keys:seq<'K> * values:seq<'V> > Series<'K,'V>
new : index:IIndex<'K> * vector:IVector<'V> * vectorBuilder:IVectorBuilder * indexBuilder:IIndexBuilder > Series<'K,'V>
member After : lowerExclusive:'K > Series<'K,'V>
member Aggregate : aggregation:Aggregation<'K> * observationSelector:Func<DataSegment<Series<'K,'V>>,KeyValuePair<'TNewKey,OptionalValue<'R>>> > Series<'TNewKey,'R> (requires equality)
member Aggregate : aggregation:Aggregation<'K> * keySelector:Func<DataSegment<Series<'K,'V>>,'TNewKey> * valueSelector:Func<DataSegment<Series<'K,'V>>,OptionalValue<'R>> > Series<'TNewKey,'R> (requires equality)
member AsyncMaterialize : unit > Async<Series<'K,'V>>
...

new : pairs:seq<Collections.Generic.KeyValuePair<'K,'V>> > Series<'K,'V>
new : keys:seq<'K> * values:seq<'V> > Series<'K,'V>
new : keys:'K [] * values:'V [] > Series<'K,'V>
new : index:Indices.IIndex<'K> * vector:IVector<'V> * vectorBuilder:Vectors.IVectorBuilder * indexBuilder:Indices.IIndexBuilder > Series<'K,'V>
static member Frame.ofRows : rows:Series<'R,#ISeries<'C>> > Frame<'R,'C> (requires equality and equality)