Deedle


Creating lazily loaded series

When loading data from an external data source (such as a database), you might want to create a virtual time series that represents the data source, but does not actually load the data until needed. If you apply some range restriction (like slicing) to the data series before using the values, then it is not necessary to load the entire data set into memory.

Deedle supports lazy loading through the DelayedSeries.FromValueLoader method. It returns an ordinary data series of type Series<K, V> which has a delayed internal representation.

Creating lazy series

We will not use a real database in this tutorial, but let's say that you have the following function which loads data for a given day range:

 1: 
 2: 
 3: 
 4: 
 5: 
 6: 
 7: 
 8: 
 9: 
10: 
open Deedle

/// Given a time range, generates random values for dates (at 12:00 AM)
/// starting with the day of the first date time and ending with the 
/// day after the second date time (to make sure they are in range)
let generate (low:DateTime) (high:DateTime) =
  let rnd = Random()
  let days = int (high.Date - low.Date).TotalDays + 1
  seq { for d in 0 .. days -> 
          KeyValue.Create(low.Date.AddDays(float d), rnd.Next()) }

Using random numbers as the source in this example is not entirely correct, because it means that we will get different values each time a new sub-range of the series is required - but it will suffice for the demonstration.

Now, to create a lazily loaded series, we need to open the Indices namespace, specify the minimal and maximal value of the series and use DelayedSeries.FromValueLoader:

1: 
2: 
3: 
4: 
5: 
6: 
7: 
8: 
9: 
open Deedle.Indices

// Minimal and maximal values that can be loaded from the series
let min, max = DateTime(2010, 1, 1), DateTime(2013, 1, 1)

// Create a lazy series for the given range
let ls = DelayedSeries.FromValueLoader(min, max, fun (lo, lob) (hi, hib) -> async { 
    printfn "Query: %A - %A" (lo, lob) (hi, hib)
    return generate lo hi })

To make the diagnostics easier, we print the required range whenever a request is made. After running this code, you should not see any output yet. The parameter to DelayedSeries.FromValueLoader is a function that takes 4 arguments:

  • lo and hi specify the low and high boundaries of the range. Their type is the type of the key (e.g. DateTime in our example)
  • lob and hib are values of type BoundaryBehavior and can be either Inclusive or Exclusive. They specify whether the boundary value should be included or not.

Our sample function does not handle boundaries correctly - it always includes the boundary (and possibly more values). This is not a problem, because the lazy loader automatically skips over such values. But if you want, you can use lob and hib parameters to build a more optimal SQL query.

Using un-evaluated series

Let's now have a look at the operations that we can perform on un-evaluated series. Any operation that actually accesses values or keys of the series (such as Series.observations or lookup for a specific key) will force the evaluation of the series.

However, we can use range restrictions before accessing the data:

 1: 
 2: 
 3: 
 4: 
 5: 
 6: 
 7: 
 8: 
 9: 
10: 
11: 
12: 
13: 
// Get series representing January 2012
let jan12 = ls.[DateTime(2012, 1, 1) .. DateTime(2012, 2, 1)]

// Further restriction - only first half of the month
let janHalf = jan12.[.. DateTime(2012, 1, 15)]

// Get value for a specific date
janHalf.[DateTime(2012, 1, 1)]
 Query: (1/1/2012, Inclusive) - (1/15/2012, Inclusive)
 val it : int = 1127670994

janHalf.[DateTime(2012, 1, 2)]
 val it : int = 560920727

As you can see from the output on line 9, the series obtained data for the 15 day range that we created by restricting the original series. When we requested another value within the specified range, it was already available and it was returned immediately. Note that janHalf is restricted to the specified 15 day range, so we cannot access values outside of the range. Also, when you access a single value, entire series is loaded. The motivation is that you probably need to access multiple values, so it is likely cheaper to load the whole series.

Another operation that can be performed on an unevaluated series is to add it to a data frame with some existing key range:

1: 
2: 
3: 
4: 
5: 
6: 
// Create empty data frame for days of December 2011
let dec11 = Frame.ofRowKeys [ for d in 1 .. 31 -> DateTime(2011, 12, d) ]

// Add series as the 'Values' column to the data frame
dec11?Values <- ls
 Query: (12/1/2011, Inclusive) - (12/31/2011, Inclusive)

When adding lazy series to a data frame, the series has to be evaluated (so that the values can be properly aligned) but it is first restricted to the range of the data frame. In the above example, only one month of data is loaded.

namespace System
namespace Deedle
val generate : low:DateTime -> high:DateTime -> seq<Collections.Generic.KeyValuePair<DateTime,int>>


 Given a time range, generates random values for dates (at 12:00 AM)
 starting with the day of the first date time and ending with the
 day after the second date time (to make sure they are in range)
val low : DateTime
Multiple items
type DateTime =
  struct
    new : ticks:int64 -> DateTime + 10 overloads
    member Add : value:TimeSpan -> DateTime
    member AddDays : value:float -> DateTime
    member AddHours : value:float -> DateTime
    member AddMilliseconds : value:float -> DateTime
    member AddMinutes : value:float -> DateTime
    member AddMonths : months:int -> DateTime
    member AddSeconds : value:float -> DateTime
    member AddTicks : value:int64 -> DateTime
    member AddYears : value:int -> DateTime
    ...
  end

--------------------
DateTime ()
   (+0 other overloads)
DateTime(ticks: int64) : DateTime
   (+0 other overloads)
DateTime(ticks: int64, kind: DateTimeKind) : DateTime
   (+0 other overloads)
DateTime(year: int, month: int, day: int) : DateTime
   (+0 other overloads)
DateTime(year: int, month: int, day: int, calendar: Globalization.Calendar) : DateTime
   (+0 other overloads)
DateTime(year: int, month: int, day: int, hour: int, minute: int, second: int) : DateTime
   (+0 other overloads)
DateTime(year: int, month: int, day: int, hour: int, minute: int, second: int, kind: DateTimeKind) : DateTime
   (+0 other overloads)
DateTime(year: int, month: int, day: int, hour: int, minute: int, second: int, calendar: Globalization.Calendar) : DateTime
   (+0 other overloads)
DateTime(year: int, month: int, day: int, hour: int, minute: int, second: int, millisecond: int) : DateTime
   (+0 other overloads)
DateTime(year: int, month: int, day: int, hour: int, minute: int, second: int, millisecond: int, kind: DateTimeKind) : DateTime
   (+0 other overloads)
val high : DateTime
val rnd : Random
Multiple items
type Random =
  new : unit -> Random + 1 overload
  member Next : unit -> int + 2 overloads
  member NextBytes : buffer:byte[] -> unit
  member NextDouble : unit -> float

--------------------
Random() : Random
Random(Seed: int) : Random
val days : int
Multiple items
val int : value:'T -> int (requires member op_Explicit)

--------------------
type int = int32

--------------------
type int<'Measure> = int
property DateTime.Date: DateTime
Multiple items
val seq : sequence:seq<'T> -> seq<'T>

--------------------
type seq<'T> = Collections.Generic.IEnumerable<'T>
val d : int
Multiple items
active recognizer KeyValue: Collections.Generic.KeyValuePair<'Key,'Value> -> 'Key * 'Value

--------------------
type KeyValue =
  static member Create : key:'K * value:'V -> KeyValuePair<'K,'V>
static member KeyValue.Create : key:'K * value:'V -> Collections.Generic.KeyValuePair<'K,'V>
DateTime.AddDays(value: float) : DateTime
Multiple items
val float : value:'T -> float (requires member op_Explicit)

--------------------
type float = Double

--------------------
type float<'Measure> = float
Random.Next() : int
Random.Next(maxValue: int) : int
Random.Next(minValue: int, maxValue: int) : int
namespace Deedle.Indices
val min : DateTime
val max : DateTime
val ls : Series<DateTime,int>
type DelayedSeries =
  static member FromIndexVectorLoader : scheme:IAddressingScheme * vectorBuilder:IVectorBuilder * indexBuilder:IIndexBuilder * min:'K * max:'K * loader:('K * BoundaryBehavior -> 'K * BoundaryBehavior -> Async<IIndex<'K> * IVector<'V>>) -> Series<'K,'V> (requires equality)
  static member FromIndexVectorLoader : scheme:IAddressingScheme * vectorBuilder:IVectorBuilder * indexBuilder:IIndexBuilder * min:'K * max:'K * loader:Func<'K,BoundaryBehavior,'K,BoundaryBehavior,Task<IIndex<'K> * IVector<'V>>> -> Series<'K,'V> (requires equality)
  static member FromValueLoader : min:'K * max:'K * loader:('K * BoundaryBehavior -> 'K * BoundaryBehavior -> Async<seq<KeyValuePair<'K,'V>>>) -> Series<'K,'V> (requires comparison)
  static member FromValueLoader : min:'K * max:'K * loader:Func<'K,BoundaryBehavior,'K,BoundaryBehavior,Task<seq<KeyValuePair<'K,'V>>>> -> Series<'K,'V> (requires comparison)
static member DelayedSeries.FromValueLoader : min:'K * max:'K * loader:('K * BoundaryBehavior -> 'K * BoundaryBehavior -> Async<seq<Collections.Generic.KeyValuePair<'K,'V>>>) -> Series<'K,'V> (requires comparison)
static member DelayedSeries.FromValueLoader : min:'K * max:'K * loader:Func<'K,BoundaryBehavior,'K,BoundaryBehavior,Threading.Tasks.Task<seq<Collections.Generic.KeyValuePair<'K,'V>>>> -> Series<'K,'V> (requires comparison)
val lo : DateTime
val lob : BoundaryBehavior
val hi : DateTime
val hib : BoundaryBehavior
val async : AsyncBuilder
val printfn : format:Printf.TextWriterFormat<'T> -> 'T
val jan12 : Series<DateTime,int>
val janHalf : Series<DateTime,int>
val dec11 : Frame<DateTime,string>
Multiple items
module Frame

from Deedle

--------------------
type Frame =
  static member ReadCsv : stream:Stream * hasHeaders:Nullable<bool> * inferTypes:Nullable<bool> * inferRows:Nullable<int> * schema:string * separators:string * culture:string * maxRows:Nullable<int> * missingValues:string [] * preferOptions:Nullable<bool> -> Frame<int,string>
  static member ReadCsv : location:string * hasHeaders:Nullable<bool> * inferTypes:Nullable<bool> * inferRows:Nullable<int> * schema:string * separators:string * culture:string * maxRows:Nullable<int> * missingValues:string [] * preferOptions:bool -> Frame<int,string>
  static member ReadReader : reader:IDataReader -> Frame<int,string>
  static member CustomExpanders : Dictionary<Type,Func<obj,seq<string * Type * obj>>>
  static member NonExpandableInterfaces : ResizeArray<Type>
  static member NonExpandableTypes : HashSet<Type>

--------------------
type Frame<'TRowKey,'TColumnKey (requires equality and equality)> =
  interface IDynamicMetaObjectProvider
  interface INotifyCollectionChanged
  interface IFsiFormattable
  interface IFrame
  new : names:seq<'TColumnKey> * columns:seq<ISeries<'TRowKey>> -> Frame<'TRowKey,'TColumnKey>
  new : rowIndex:IIndex<'TRowKey> * columnIndex:IIndex<'TColumnKey> * data:IVector<IVector> * indexBuilder:IIndexBuilder * vectorBuilder:IVectorBuilder -> Frame<'TRowKey,'TColumnKey>
  member AddColumn : column:'TColumnKey * series:ISeries<'TRowKey> -> unit
  member AddColumn : column:'TColumnKey * series:seq<'V> -> unit
  member AddColumn : column:'TColumnKey * series:ISeries<'TRowKey> * lookup:Lookup -> unit
  member AddColumn : column:'TColumnKey * series:seq<'V> * lookup:Lookup -> unit
  ...

--------------------
new : names:seq<'TColumnKey> * columns:seq<ISeries<'TRowKey>> -> Frame<'TRowKey,'TColumnKey>
new : rowIndex:IIndex<'TRowKey> * columnIndex:IIndex<'TColumnKey> * data:IVector<IVector> * indexBuilder:IIndexBuilder * vectorBuilder:Vectors.IVectorBuilder -> Frame<'TRowKey,'TColumnKey>
static member Frame.ofRowKeys : keys:seq<'R> -> Frame<'R,string> (requires equality)
Fork me on GitHub