Glad to see you here! Now that you found out and learned about FsLab, this section aims to illustrate how FsLab packages synergize and can be used to tackle practical data science challenges. Note that every package used througout the tutorial has its own documentation so if you are interested in Deedle (link), FSharp.Stats or Plotly.Net feel free to take a deeper dive.
FsLab is a meant to be a project incubation space and can be thought of as a safe heaven for both, package developers and package users by providing guidelines and tutorials. Packages provided by the community can be used on their own, in combination with other FsLab packages but also in combination with any other .netstandard 2.0 compatible package. From F# 5.0 on packages can be referenced using the following notation:
// Packages hosted by the Fslab community
#r "nuget: Deedle.Interactive, 3.0.0"
#r "nuget: FSharp.Stats"
// third party .net packages
#r "nuget: Plotly.NET.Interactive, 4.0.0"
#r "nuget: FSharpAux"
#r "nuget: FSharp.Data"
Loading extensions from Plotly.NET.Interactive.dll
Loading extensions from Deedle.Interactive.dll
After referencing the packages one can access their namespaces and use provided functions. In the following example we will reference the top level namespaces and then use a function provided by the FSharp.Stats package to calculate a factorial:
open FSharp.Stats
SpecialFunctions.Factorial.factorial 3
Equipped with these packages we are now ready to tackle the promise made in the first paragraph: solving a practical data science problem. We will start by retrieving the data using the FSharp.Data package. Subsequently, we will use Deedle (link), a powerful data frame library that makes tabular data accessible by data frame programming. (Note that the chosen names give insight on their type. In addition, thanks to FSharp being a strongly typed language and the type inference done by the compiler, we can at any time hower over a value to see its assigned type.)
open FSharp.Data
open Deedle
// Retrieve data using the FSharp.Data package
let rawData = Http.RequestString @"https://raw.githubusercontent.com/dotnet/machinelearning/master/test/data/housing.txt"
// And create a data frame object using the ReadCsvString method provided by Deedle.
// Note: Of course you can directly provide the path to a local source.
let df = Frame.ReadCsvString(rawData,hasHeaders=true,separators="\t")
df
// Note: If you are working outside of a notebook, you may want to print the dataframe using
// df.Print true
MedianHomeValue | CrimesPerCapita | PercentResidental | PercentNonRetail | CharlesRiver | NitricOxides | RoomsPerDwelling | PercentPre40s | EmploymentDistance | HighwayDistance | TaxRate | TeacherRatio | BlackIndex | PercentLowIncome | (Decimal) | (Decimal) | (Decimal) | (Decimal) | (int) | (Decimal) | (Decimal) | (Decimal) | (Decimal) | (int) | (Decimal) | (Decimal) | (Decimal) | (Decimal) |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | -> | 24.00 | 0.00632 | 18.00 | 2.310 | 0 | 0.5380 | 6.5750 | 65.20 | 4.0900 | 1 | 296.0 | 15.30 | 396.90 | 4.98 |
1 | -> | 21.60 | 0.02731 | 0.00 | 7.070 | 0 | 0.4690 | 6.4210 | 78.90 | 4.9671 | 2 | 242.0 | 17.80 | 396.90 | 9.14 |
2 | -> | 34.70 | 0.02729 | 0.00 | 7.070 | 0 | 0.4690 | 7.1850 | 61.10 | 4.9671 | 2 | 242.0 | 17.80 | 392.83 | 4.03 |
3 | -> | 33.40 | 0.03237 | 0.00 | 2.180 | 0 | 0.4580 | 6.9980 | 45.80 | 6.0622 | 3 | 222.0 | 18.70 | 394.63 | 2.94 |
4 | -> | 36.20 | 0.06905 | 0.00 | 2.180 | 0 | 0.4580 | 7.1470 | 54.20 | 6.0622 | 3 | 222.0 | 18.70 | 396.90 | 5.33 |
: | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | |
501 | -> | 22.40 | 0.06263 | 0.00 | 11.930 | 0 | 0.5730 | 6.5930 | 69.10 | 2.4786 | 1 | 273.0 | 21.00 | 391.99 | 9.67 |
502 | -> | 20.60 | 0.04527 | 0.00 | 11.930 | 0 | 0.5730 | 6.1200 | 76.70 | 2.2875 | 1 | 273.0 | 21.00 | 396.90 | 9.08 |
503 | -> | 23.90 | 0.06076 | 0.00 | 11.930 | 0 | 0.5730 | 6.9760 | 91.00 | 2.1675 | 1 | 273.0 | 21.00 | 396.90 | 5.64 |
504 | -> | 22.00 | 0.10959 | 0.00 | 11.930 | 0 | 0.5730 | 6.7940 | 89.30 | 2.3889 | 1 | 273.0 | 21.00 | 393.45 | 6.48 |
505 | -> | 11.90 | 0.04741 | 0.00 | 11.930 | 0 | 0.5730 | 6.0300 | 80.80 | 2.5050 | 1 | 273.0 | 21.00 | 396.90 | 7.88 |
506 rows x 14 columns
0 missing values
The data set of choice is the boston housing data set. As you can see from analyzing the printed output, it consists of 506 rows. Each row represents a house in the boston city area and each column encodes a feature/variable, such as the number of rooms per dwelling (RoomsPerDwelling), Median value of owner-occupied homes in $1000's (MedianHomeValue) and even variables indicating if the house is bordering river charles (CharlesRiver, value = 1) or not (CharlesRiver, value = 0).
Lets say in our analysis we are only interested in the variables just described, furthermore we only want to keep rows for houses that do NOT border the river. We can use Deedle to easily create a new frame that fullfills our criteria. In this example, we cast the value of the column "CharlesRiver" to be of type bool, which illustrates how data frame programming can become typesafe using deedle.
let housesNotAtRiver =
df
|> Frame.sliceCols ["RoomsPerDwelling";"MedianHomeValue";"CharlesRiver"]
|> Frame.filterRowValues (fun s -> s.GetAs<bool>("CharlesRiver") |> not )
housesNotAtRiver
RoomsPerDwelling | MedianHomeValue | CharlesRiver | (Decimal) | (Decimal) | (int) |
---|---|---|---|---|
0 | -> | 6.5750 | 24.00 | 0 |
1 | -> | 6.4210 | 21.60 | 0 |
2 | -> | 7.1850 | 34.70 | 0 |
3 | -> | 6.9980 | 33.40 | 0 |
4 | -> | 7.1470 | 36.20 | 0 |
: | ... | ... | ... | |
501 | -> | 6.5930 | 22.40 | 0 |
502 | -> | 6.1200 | 20.60 | 0 |
503 | -> | 6.9760 | 23.90 | 0 |
504 | -> | 6.7940 | 22.00 | 0 |
505 | -> | 6.0300 | 11.90 | 0 |
471 rows x 3 columns
0 missing values
Exploratory data analysis is an approach favored by many - to meet this demand we strongly advertise the use of Plotly.Net. The following snippet illustrates how we can access a column of a data frame and create an interactive chart in no time. Since we might want an idea of the distribution of the house prices a histogram can come in handy:
open Plotly.NET
// Note that we explicitly specify that we want to work with the values as floats.
// Since the row identity is not needed anymore when plotting the distribution we can
// directly convert the collection to a FSharp Sequence.
let pricesNotAtRiver : seq<float> =
housesNotAtRiver
|> Frame.getCol "MedianHomeValue"
|> Series.values
Chart.Histogram(pricesNotAtRiver)
|> Chart.withYAxisStyle("median value of owner occupied homes in 1000s")
|> Chart.withXAxisStyle("price distribution")
// Note: If you are working outside of a notebook, you may want to show the chart in browser using
// |> Chart.show
Since plotly charts are interactive they invite us to combine mutliple charts. Let repeat the filter step and see if houses that are located at the river show a similar distribution:
let housesAtRiver =
df
|> Frame.sliceCols ["RoomsPerDwelling";"MedianHomeValue";"CharlesRiver"]
|> Frame.filterRowValues (fun s -> s.GetAs<bool>("CharlesRiver"))
let pricesAtRiver : seq<float> =
housesAtRiver
|> Frame.getCol "MedianHomeValue"
|> Series.values
[
Chart.Histogram(pricesNotAtRiver, Opacity = 0.66, OffsetGroup = "A")
|> Chart.withTraceInfo "not at river"
Chart.Histogram(pricesAtRiver, Opacity = 0.66, OffsetGroup = "A")
|> Chart.withTraceInfo "at river"
]
|> Chart.combine
|> Chart.withYAxisStyle("median value of owner occupied homes in 1000s")
|> Chart.withXAxisStyle("Comparison of price distributions")
The interactive chart allows us to compare the distributions directly. We can now form our own ideas of the city of boston, the sampled area, just by looking at the data, e.g.:
Assuming that the sampling process was homogenous - observing that there are many more houses sampled that are not located on the riverside could indicate that a spot on the river is a scarce commodity. This can also be backed by analyzing the tails of the distribution: it seems that houses located at the river are given a head-start in their assigned value - the distribution of the riverside houses is truncated on the left.
Suppose we would have a customer who wants two models, one to predict the prices of a house at the riverside and one that predicts the prices if this is not the case. We can meet this demand by using FSharp.Stats in combination with Deedle. Of course we need a variable that is indicative of the house price, let's check if the number of rooms per dwelling correlates with the house value:
open FSharp.Stats
open FSharpAux
open FSharp.Stats.Correlation
let pricesAll :Series<int,float> =
df
|> Frame.getCol "MedianHomeValue"
let roomsPerDwellingAll :Series<int,float> =
df
|> Frame.getCol "RoomsPerDwelling"
let correlation =
let tmpPrices, tmpRooms =
Series.zipInner pricesAll roomsPerDwellingAll
|> Series.values
|> Seq.unzip
Seq.pearson tmpPrices tmpRooms
correlation
So indeed, the number of rooms per dwelling shows a positive correlation with the house prices. With a pearson correlation of ~0.7 it does not explain the house prices completely - this is not suprising to us, as one of our hypothesis is that the location (riverside or not) also influences the price - however, it should be sufficient to create a linear model.
So now we will use FSharp.Stats to build the two linear models ordered by the hypothetical customer. We start by defining a function that performs the fitting and plots the result:
open Fitting.LinearRegression.OLS
let predictPricesByRooms description data =
let pricesAll :Series<_,float> =
data
|> Frame.getCol "MedianHomeValue"
let roomsPerDwellingAll :Series<_,float> =
data
|> Frame.getCol "RoomsPerDwelling"
let fit =
let tmpRooms, tmpPrices =
Series.zipInner roomsPerDwellingAll pricesAll
|> Series.sortBy fst
|> Series.values
|> Seq.unzip
let coeffs = Linear.Univariable.fit (vector tmpRooms) (vector tmpPrices)
let predictedPrices = tmpRooms |> Seq.map (Linear.Univariable.predict coeffs)
[
Chart.Point(tmpRooms,tmpPrices)
|> Chart.withTraceInfo (sprintf "%s: data" description )
Chart.Line(tmpRooms,predictedPrices)
|> Chart.withTraceInfo (sprintf "%s: coefficients: intercept:%f, slope:%f" description coeffs.[0] coeffs.[1])
]
|> Chart.combine
|> Chart.withXAxisStyle("rooms per dwelling")
|> Chart.withYAxisStyle("median value")
fit
Afterwards, we can apply the function on our prepared datasets and have a look at the model and especially the model coefficients.
[
predictPricesByRooms "not at river" housesNotAtRiver
predictPricesByRooms "at river" housesAtRiver
]
|> Chart.combine
|> Chart.withSize(1200.,700.)
Both models approximate the data in a reasonable way. When we inspect the coefficients, we see that the models only differ slightly in slope, but differ in offset by ~7.5. This observation complements the insights gained by the explorative data analysis approach using the histogram!