(** [![Binder](https://fslab.org/images/badge-binder.svg)](https://mybinder.org/v2/gh/fslaborg/fslaborg.github.io/gh-pages?filepath=content/tutorials/002_clustering_kMeans.ipynb)  [![Script](https://fslab.org/images/badge-script.svg)](https://fslab.org/content/tutorials/002_clustering_kMeans.fsx)  [![Notebook](https://fslab.org/images/badge-notebook.svg)](https://fslab.org/content/tutorials/002_clustering_kMeans.ipynb) # Clustering with FSharp.Stats I: k-means _Summary:_ This tutorial demonstrates k means clustering with FSharp.Stats and how to visualize the results with Plotly.NET. ## Introduction Clustering methods can be used to group elements of a huge data set based on their similarity. Elements sharing similar properties cluster together and can be reported as coherent group. k-means clustering is a frequently used technique, that segregates the given data into k clusters with similar elements grouped in each cluster, but high variation between the clusters. The algorithm to cluster a n-dimensional dataset can be fully described in the following 4 steps: 1. Initialize k n-dimensional centroids, that are randomly distributed over the data range. 2. Calculate the distance of each point to all centroids and assign it to the nearest one. 3. Reposition all centroids by calculating the average point of each cluster. 4. Repeat step 2-3 until convergence. ### Centroid initiation Since the random initiation of centroids may influences the result, a second initiation algorithm is proposed (_cvmax_), that extract a set of medians from the dimension with maximum variance to initialize the centroids. ### Distance measure While several distance metrics can be used (e.g. Manhattan distance or correlation measures) it is preferred to use Euclidean distance. It is recommended to use a squared Euclidean distance. To not calculate the square root does not change the result but saves computation time. For demonstration of k-means clustering, the classic iris data set is used, which consists of 150 records, each of which contains four measurements and a species identifier. ## Referencing packages ```fsharp // Packages hosted by the Fslab community #r "nuget: Deedle" #r "nuget: FSharp.Stats" // third party .net packages #r "nuget: Plotly.NET, 2.0.0-preview.16" #r "nuget: Plotly.NET.Interactive, 2.0.0-preview.12" #r "nuget: FSharp.Data" ``` ## Loading data *) open FSharp.Data open Deedle // Retrieve data using the FSharp.Data package and read it as dataframe using the Deedle package let rawData = Http.RequestString @"https://raw.githubusercontent.com/fslaborg/datasets/main/data/iris.csv" let df = Frame.ReadCsvString(rawData) df.Print()(* output: sepal_length sepal_width petal_length petal_width species 0 -> 5.5 2.4 3.8 1.1 versicolor 1 -> 4.9 3.1 1.5 0.1 setosa 2 -> 7.6 3 6.6 2.1 virginica 3 -> 5.6 2.8 4.9 2 virginica 4 -> 6.1 3 4.9 1.8 virginica 5 -> 6.3 3.4 5.6 2.4 virginica 6 -> 6.2 2.8 4.8 1.8 virginica 7 -> 7.2 3.2 6 1.8 virginica 8 -> 6.9 3.2 5.7 2.3 virginica 9 -> 4.9 3 1.4 0.2 setosa 10 -> 5.4 3.9 1.7 0.4 setosa 11 -> 7 3.2 4.7 1.4 versicolor 12 -> 6.1 3 4.6 1.4 versicolor 13 -> 5.4 3.7 1.5 0.2 setosa 14 -> 7.2 3.6 6.1 2.5 virginica : ... ... ... ... ... 135 -> 7.7 3.8 6.7 2.2 virginica 136 -> 5.7 3 4.2 1.2 versicolor 137 -> 6.4 2.7 5.3 1.9 virginica 138 -> 5.8 2.7 3.9 1.2 versicolor 139 -> 5 2.3 3.3 1 versicolor 140 -> 6.7 3.1 4.4 1.4 versicolor 141 -> 6.3 3.3 6 2.5 virginica 142 -> 4.9 2.4 3.3 1 versicolor 143 -> 5.8 2.8 5.1 2.4 virginica 144 -> 5 3.3 1.4 0.2 setosa 145 -> 7.7 2.6 6.9 2.3 virginica 146 -> 5.7 2.6 3.5 1 versicolor 147 -> 5.9 3 5.1 1.8 virginica 148 -> 6.8 3.2 5.9 2.3 virginica 149 -> 5 3.6 1.4 0.2 setosa*) (** Let's take a first look at the data with heatmaps using Plotly.NET. Each of the 150 records consists of four measurements and a species identifier. Since the species identifier occur several times (_Iris-virginica_, _Iris-versicolor_, and _Iris-setosa_), we create unique labels by adding the rows index to the species identifier. *) open Plotly.NET let colNames = ["sepal_length";"sepal_width";"petal_length";"petal_width"] // isolate data as float [] [] let data = Frame.dropCol "species" df |> Frame.toJaggedArray //isolate labels as seq let labels = Frame.getCol "species" df |> Series.values |> Seq.mapi (fun i s -> sprintf "%s_%i" s i) let dataChart = Chart.Heatmap(data,colNames=colNames,rowNames=labels) // required to fit the species identifier on the left side of the heatmap |> Chart.withMarginSize(Left=100.) |> Chart.withTitle "raw iris data" // required to fit the species identifier on the left side of the heatmap(* output:
*) (** ## Clustering The function that performs k-means clustering can be found at `FSharp.Stats.ML.Unsupervised.IterativeClustering.kmeans`. It requires four input parameters: 1. Centroid initiation method 2. Distance measure (from `FSharp.Stats.ML.DistanceMetrics`) 3. Data to cluster as `float [] []`, where each entry of the outer array is a sequence of coordinates 4. _k_, the number of clusters that are desired *) open FSharp.Stats open FSharp.Stats.ML open FSharp.Stats.ML.Unsupervised // For random cluster initiation use randomInitFactory: let rnd = System.Random() let randomInitFactory : IterativeClustering.CentroidsFactory = IterativeClustering.randomCentroids rnd // For assisted cluster initiation use cvmaxFactory: //let cvmaxFactory : IterativeClustering.CentroidsFactory = // IterativeClustering.intitCVMAX let distanceFunction = DistanceMetrics.euclideanNaNSquared let kmeansResult = IterativeClustering.kmeans distanceFunction randomInitFactory data 4 (** After all centroids are set, the affiliation of a datapoint to a cluster can be determined by minimizing the distance of the respective point to each of the centroids. A function realizing the mapping is integrated in the `kmeansResult`. *) let clusteredIrisData = Seq.zip labels data |> Seq.map (fun (species,dataPoint) -> let clusterIndex,centroid = kmeansResult.Classifier dataPoint clusterIndex,species,dataPoint) // Each datapoint is given associated with its cluster index, species identifier, and coordinates.(* output: "2, "versicolor_0", [|5.5; 2.4; 3.8; 1.1|] 4, "setosa_1", [|4.9; 3.1; 1.5; 0.1|] 1, "virginica_2", [|7.6; 3.0; 6.6; 2.1|] 3, "virginica_3", [|5.6; 2.8; 4.9; 2.0|] 3, "virginica_4", [|6.1; 3.0; 4.9; 1.8|] 1, "virginica_5", [|6.3; 3.4; 5.6; 2.4|] 3, "virginica_6", [|6.2; 2.8; 4.8; 1.8|] ... "*) (** ## Visualization of the clustering result as heatmap The datapoints are sorted according to their associated cluster index and visualized in a combined heatmap. *) let clusterChart = clusteredIrisData //sort all data points according to their assigned cluster number |> Seq.sortBy (fun (clusterIndex,label,dataPoint) -> clusterIndex) |> Seq.unzip3 |> fun (_,labels,d) -> Chart.Heatmap(d,colNames=colNames,rowNames=labels) // required to fit the species identifier on the left side of the heatmap |> Chart.withMarginSize(Left=100.) |> Chart.withTitle "clustered iris data (k-means clustering)"(* output:
*) (** To visualize the result in a three-dimensional chart, three of the four measurements are isolated after clustering and visualized as 3D-scatter plot. *) let clusterChart3D = //group clusters clusteredIrisData |> Seq.groupBy (fun (clusterIndex,label,dataPoint) -> clusterIndex) //for each cluster generate a scatter plot |> Seq.map (fun (clusterIndex,cluster) -> cluster |> Seq.unzip3 |> fun (clusterIndex,label,data) -> let clusterName = sprintf "cluster %i" (Seq.head clusterIndex) //for 3 dimensional representation isolate sepal length, petal length, and petal width let truncData = data |> Seq.map (fun x -> x.,x.,x.) Chart.Scatter3D(truncData,mode=StyleParam.Mode.Markers,Name = clusterName,MultiText=label) ) |> Chart.combine |> Chart.withTitle "isolated coordinates of clustered iris data (k-means clustering)" |> Chart.withXAxisStyle colNames. |> Chart.withYAxisStyle colNames. |> Chart.withZAxisStyle colNames.(* output:
*) (** ### Optimal cluster number The identification of the optimal cluster number _k_ in terms of the average squared distance of each point to its centroid can be realized by performing the clustering over a range of _k_'s multiple times and taking the _k_ according to the elbow criterion. Further more robust and advanced cluster number determination techniques can be found [here](https://fslab.org/FSharp.Stats/Clustering.html#Determining-the-optimal-number-of-clusters). *) let getBestkMeansClustering bootstraps k = let dispersions = Array.init bootstraps (fun _ -> IterativeClustering.kmeans distanceFunction randomInitFactory data k ) |> Array.map (fun clusteringResult -> IterativeClustering.DispersionOfClusterResult clusteringResult) Seq.mean dispersions,Seq.stDev dispersions let iterations = 10 let maximalK = 10 let bestKChart = [2 .. maximalK] |> List.map (fun k -> let mean,stdev = getBestkMeansClustering iterations k k,mean,stdev ) |> List.unzip3 |> fun (ks,means,stdevs) -> Chart.Line(ks,means) |> Chart.withYErrorStyle(stdevs) |> Chart.withXAxisStyle "k" |> Chart.withYAxisStyle "average dispersion" |> Chart.withTitle "iris data set average dispersion per k"(* output:
*) (** ## Limitations 1. Outlier have a strong influence on the positioning of the centroids. 2. Determining the correct number of clusters in advance is critical. Often it is chosen according to the number of classes present in the dataset which isn't in the spirit of clustering procedures. ## Notes - Please note that depending on what data you want to cluster, a column wise z-score normalization may be required. In the presented example differences in sepal width have a reduced influence because the absolute variation is low. ## References - FSharp.Stats documentation, fslaborg, https://fslab.org/FSharp.Stats/Clustering.html - Shraddha and Saganna, A Review On K-means Data Clustering Approach, International Journal of Information & Computation Technology, Vol:4 No:17, 2014 - Moth'd Belal, A New Algorithm for Cluster Initialization, International Journal of Computer and Information Engineering, Vol:1 No:4, 2007 - Singh et al., K-means with Three different Distance Metrics, International Journal of Computer Applications, 2013, DOI:10.5120/11430-6785 - Kodinariya and Makwana, Review on Determining of Cluster in K-means Clustering, International Journal of Advance Research in Computer Science and Management Studies, 2013 ## Further reading Examples are taken from [FSharp.Stats documentation](https://fslab.org/FSharp.Stats/Clustering.html) that covers various techniques for an optimal cluster number determination. The next article in this series covers [hierarchical clustering using FSharp.Stats](003_clustering_hierarchical.html). *)