Summary: This tutorial demonstrates how to access a public dataset for temperature data with FSharp.Data, how to smooth the data points with the Savitzky-Golay filter from FSharp.Stats and finally how to visualize the results with Plotly.NET.
The Savitzky-Golay is a type of low-pass filter, particularly suited for smoothing noisy data. The main idea behind this approach is to make for each point a least-square fit with a polynomial of high order over a odd-sized window centered at the point. One advantage of the Savitzky-Golay filter is that portions of high frequencies are not simply cut off, but are preserved due to the polynomial regression. This allows the filter to preserve properties of the distribution such as relative maxima, minima, and dispersion, which are usually distorted by flattening or shifting by conventional methods such as moving average.
This is useful when trying to identify general trends in highly fluctuating data sets, or to smooth out noise to improve the ability to find minima and maxima of the data trend. To showcase this we will plot a temperature dataset from the "Deutscher Wetterdienst", a german organization for climate data. We will do this for both the original data points and a smoothed version.
The image shows the moving window for polynomial regression used in the Savitzky-Golay filter @wikipedia
#r "nuget: Deedle.Interactive, 3.0.0"
#r "nuget: FSharp.Stats, 0.4.3"
#r "nuget: Plotly.NET.Interactive, 4.0.0"
#r "nuget: FSharp.Data, 4.2.7"
open FSharp.Data
open Deedle
open Plotly.NET
Loading extensions from `C:\Users\schne\.nuget\packages\plotly.net.interactive\4.0.0\interactive-extensions\dotnet\Plotly.NET.Interactive.dll`
Loading extensions from `C:\Users\schne\.nuget\packages\deedle.interactive\3.0.0\interactive-extensions\dotnet\Deedle.Interactive.dll`
We will start by retrieving the data. This is done with the FSharp.Data package and will return a single string in the original format.
// Get data from Deutscher Wetterdienst
// Explanation for Abbreviations: https://www.dwd.de/DE/leistungen/klimadatendeutschland/beschreibung_tagesmonatswerte.html
let rawData = FSharp.Data.Http.RequestString @"https://raw.githubusercontent.com/fslaborg/datasets/main/data/WeatherDataAachen-Orsbach_daily_1year.txt"
// print first 1000 characters to console.
rawData.[..1000]
<pre> Tageswerte der Station 10505 Aachen-Orsbach STAT JJJJMMDD QN TG TN TM TX RFM FM FX SO NM RR PM ----- -------- -- ------ ------ ------ ------ ------ ------ ------ ------ ------ ------ ------ 10505 20210728 1 13.5 14.4 17.7 21.5 73.3 3.0 13.9 4.9 6.4 5.5 983.8 10505 20210727 1 14.3 15.2 17.9 23.4 81.6 3.0 10.1 3.8 6.1 1.2 984.8 10505 20210726 1 14.8 16.0 18.5 22.6 79.3 3.0 9.8 5.2 6.7 1.4 983.7 10505 20210725 1 11.5 14.2 18.7 25.2 79.4 2.0 11.5 7.9 6.1 981.7 10505 20210724 1 12.4 13.8 17.9 21.8 83.8 2.0 10.2 0.3 7.5 2.8 982.4 10505 20210723 1 9.1 11.8 18.5 24.8 68.7 2.0 7.4 13.8 1.3 0.0 989.3 10505 20210722 1 10.1 12.9 19.1 24.1 67.9 2.0 5.8 11.0 1.1 0.0 994.2 10505 20210721 1 8.4 11.1
open System
open System.Text.RegularExpressions
/// Tuple of 4 data arrays representing the measured temperature for over a year.
let processedData =
// First separate the huge string in lines
rawData.Split([|'\n'|], StringSplitOptions.RemoveEmptyEntries)
// Skip the first 5 rows until the real data starts, also skip the last row (length-2) to remove a "</pre>" at the end
|> fun arr -> arr.[5..arr.Length-2]
|> Array.map (fun data ->
// Regex pattern that will match groups of whitespace
let whitespacePattern = @"\s+"
// This is needed to tell regex to replace hits with a tabulator
let matchEval = MatchEvaluator(fun _ -> @"\t" )
// The original data columns are separated by different amounts of whitespace.
// Therefore, we need a flexible string parsing option to replace any amount of whitespace with a single tabulator.
// This is done with the regex pattern above and the fsharp core library "System.Text.RegularExpressions"
let tabSeparated = Regex.Replace(data, whitespacePattern, matchEval)
tabSeparated
// Split each row by tabulator will return rows with an equal amount of values, which we can access.
|> fun dataStr -> dataStr.Split([|@"\t"|], StringSplitOptions.RemoveEmptyEntries)
|> fun dataArr ->
// Second value is the date of measurement, which we will parse to the DateTime type
DateTime.ParseExact(dataArr.[1], "yyyyMMdd", Globalization.CultureInfo.InvariantCulture),
// 5th value is minimal temperature at that date.
float dataArr.[4],
// 6th value is average temperature over 24 timepoints at that date.
float dataArr.[5],
// 7th value is maximal temperature at that date.
float dataArr.[6]
)
// Sort by date
|> Array.sortBy (fun (day,tn,tm,tx) -> day)
// Unzip the array of value tuples, to make the different values easier accessible
|> fun arr ->
arr |> Array.map (fun (day,tn,tm,tx) -> day.ToShortDateString()),
arr |> Array.map (fun (day,tn,tm,tx) -> tm),
arr |> Array.map (fun (day,tn,tm,tx) -> tx),
arr |> Array.map (fun (day,tn,tm,tx) -> tn)
processedData
(System.String[], System.Double[], System.Double[], System.Double[])
Item1 | [ 16.03.20, 17.03.20, 18.03.20, 19.03.20, 20.03.20, 21.03.20, 22.03.20, 23.03.20, 24.03.20, 25.03.20, 26.03.20, 27.03.20, 28.03.20, 29.03.20, 30.03.20, 31.03.20, 01.04.20, 02.04.20, 03.04.20, 04.04.20 ... (480 more) ] |
Item2 | [ 10.3, 9, 10.5, 11.1, 6.6, 4.8, 3.7, 4, 4.8, 4.6, 4.2, 7.2, 7.7, 2.1, 2.2, 4, 4.3, 4.5, 6.6, 7.7 ... (480 more) ] |
Item3 | [ 14.9, 13.8, 15.8, 15.2, 8.7, 8.9, 8.8, 9.6, 10.4, 10.6, 9.4, 13.3, 13.9, 4.6, 6.7, 7.9, 9.4, 9.4, 10.1, 13.3 ... (480 more) ] |
Item4 | [ 6, 4.7, 6.7, 7.1, 3.1, 1.1, -0.3, -1.1, -2, -2.5, -2.2, 0.1, 2.4, -1.3, -3.9, 0.1, -1.6, -1.6, 2.6, 0.1 ... (480 more) ] |
Next we create a create chart function with Plotly.NET to produce a visual representation of our data set.
open Plotly.NET
open Plotly.NET.LayoutObjects
// Because our data set is already rather wide we want to move the legend from the right side of the plot
// to the right center.
let legend =
Legend.init(
YAnchor = StyleParam.YAnchorPosition.Top,
Y = 0.99,
XAnchor = StyleParam.XAnchorPosition.Left,
X = 0.5
)
/// This function will take 'processedData' as input and return a range chart with a line for the average temperature
/// and a different colored area for the range between minimal and maximal temperature at that date.
let createTempChart (days,tm,tmUpper,tmLower) =
Chart.Range(
// data arrays
x = days,
y = tm,
upper = tmUpper,
lower = tmLower,
mode = StyleParam.Mode.Lines,
MarkerColor= Color.fromString "#3D1244",
RangeColor= Color.fromString "#F99BDE",
// Name for line in legend
Name="Average temperature over 24 timepoints each day",
// Name for lower point when hovering over chart
LowerName="Min temp",
// Name for upper point when hovering over chart
UpperName="Max temp"
)
// Configure the chart with the legend from above
|> Chart.withLegend legend
// Add name to y axis
|> Chart.withYAxisStyle("daily temperature [°C]")
|> Chart.withSize (1000.,600.)
/// Chart for original data set
processedData
|> createTempChart
As you can see the data looks chaotic and is difficult to analyze. Trends are hidden in daily temperature fluctuations and correlating events with temperature can get difficult. So next we want to smooth the data to clearly see temperature trends.
We will use the Signal.Filtering.savitzkyGolay
function from FSharp.Stats.
Parameters:
int
) the length of the window. Must be an odd integer number.int
) the order of the polynomial used in the filtering. Must be less then windowSize
- 1.int
) the order of the derivative to compute (default = 0 means only smoothing)int
) this factor will influence amplitude when using Savitzky-Golay for derivationfloat array
) the values of the time history of the signal.open FSharp.Stats
let smootheTemp ws order (days,tm,tmUpper,tmLower) =
let tm' = Signal.Filtering.savitzkyGolay ws order 0 1 tm
let tmUpper' = Signal.Filtering.savitzkyGolay ws order 0 1 tmUpper
let tmLower' = Signal.Filtering.savitzkyGolay ws order 0 1 tmLower
days,tm',tmUpper',tmLower'
processedData
|> smootheTemp 31 4
|> createTempChart