In this step by step tutorial, I will teach you how to perform cluster analysis in ML.Net using the Iris dataset. Then we would use the model we to predict which cluster a new flower belongs.
Prerequisites
Visual Studio 2017. You can use the free community edition.
We would cover the following subtopics:
- Understand Your Dataset
- Create a Console Application in Visual Studio
- Create the Classes
- Define the Data and Model Paths
- Create a Machine Learning Context
- Load the Data
- Create a Learning Pipeline
- Train and Save the Model
- Make Prediction
- Next Steps
1. Understand Your Dataset
The dataset a collection of flowers called iris. There are 5 columns and 150 rows. You can download it from here. Save it in a local folder.
The five columns of the dataset represents, in order:
- sepal length
- sepal width
- petal length
- petal width
- class of the iris flower( there are three classes: Iris-setosa, Iris-versicolor and Iris-verginica)
2. Create a Console Application in Visual Studio
Follow the steps below to create the application
Step 1: Create a console application in Visual Studio
Step 2: Add a folder to the application. Give it a name ‘Data’
Step 3: Copy the iris.data file into this folder
Step 4: Set the ‘Copy to Output Directory’ property of the file to ‘Copy if Newer’
Step 5: Migrate the References from ‘packages.config’ to ‘PackageReference’ (Right-click on the References and choose ‘Migrate packages.config to PackageReference’. See procedure here.
Step 6: Add the Microsoft.ML Nuget package to your project (Right-click on your project. Choose ‘Manage Nugget Packages’
3. Create the Classes
Now we would add two classes to out project. One class would represent the features(sepal length, sepal width, petal length, petal width). The second class would represent the prediction.
Follow the steps below.
Step 1: Right-click on the project > Add > Class. Name it IridData. The content of the class is as shown below:
class IrisData { [LoadColumn(0)] public float SepalLength; [LoadColumn(1)] public float SepalWidth; [LoadColumn(2)] public float PetalLength; [LoadColumn(3)] public float PetalWidth; }
Step 2: Add a second class. Name it ClusterPrediction. The content would be as shown below.
public class ClusterPrediction { [ColumnName("PredictedLabel")] public uint PredictedClusterId; [ColumnName("Score")] public float[] Distances; }
The ClusterPrediction class is the output of the model when given a single instance of an iris. The two column are explained as:
- PredictedLabel: This field contains the ID of the predicted cluster
- Score: This field contains an array with distances from the three cluster centers.
Step 3: Ensure the following namespaces are in your classes
using Microsoft.ML.Data;
4. Define the Data and Model Paths
This should be done in the Program.cs file.
We would add two static string field which represents path to the dataset and path to where the model will be saved.
Step 1: Open the Program.cs file
Step 2: Add the following code above the main function
static readonly string _dataPath = Path.Combine(Environment.CurrentDirectory, "Data", "iris.data"); static readonly string _modelPath = Path.Combine(Environment.CurrentDirectory, "Data", "IrisClusteringModel.zip");
Step 3: Add the following using statements
using System; using System.IO;
5. Create a Machine Learning Context
The ML Context represents the environment where the model would be developed and trained. It is similar to the TensorFlow or base environment in Anaconda. The ML context provides enables logging, it also provide entry point for data loading, model training and prediction. It is also similar to DBContext in Entity Framework.
Follow the steps below:
Step 1: Ensure that the following namespace is added to the Program.cs file.
using Microsoft.ML; using Microsoft.ML.Data;
Step 2: In the main method, write the following line:
var mlContext = new MLContext(seed: 0);
6. Load the Data
Now we are going to load up the data using the LoadFromTextFile() method available in Data method of the ML Context. In this case, we would load the data from the iris.data file into an IDataView object. The LoadFromTextFile() method returns and IDataView object.
Use the code below
IDataView dataView = mlContext.Data.LoadFromTextFile<IrisData>(_dataPath, hasHeader: false, separatorChar: ',');
7. Create a Learning Pipeline
The learning pipelines is like a Neural Network make up of layers from input layer through hidden layers to the output layer. There are two steps in creating the learning pipeline:
- concatenate the feature set (of 4 columns) into one single column to be used by a clustering trainer
- use a KMeansTrainer trainer to train the model
Use the code below to create a learning pipeline
string featuresColumnName = "Features"; var pipeline = mlContext.Transforms .Concatenate(featuresColumnName, "SepalLength", "SepalWidth", "PetalLength", "PetalWidth") .Append(mlContext.Clustering.Trainers.KMeans(featuresColumnName, numberOfClusters: 3));
From the code you can see that initial number of clusters is specified as 3. This is required for K-Means clustering algorithm
8. Train and Save the Model
For every machine learning model built, it has to be trained using the training dataset. After training, we then save the model and it’s ready for use. To train the model, use the Fit() method of the pipeline object as shown below:
var model = pipeline.Fit(dataView);
This code take a while to execute depending on the size of the dataset. But at completion, the model is ready and we can then save it to the directory we created. The code below saves the model to the _modelPath directory.
using (var fileStream = new FileStream(_modelPath, FileMode.Create, FileAccess.Write, FileShare.Write)) { mlContext.Model.Save(model, dataView.Schema, fileStream); }
9. Make Prediction Using the Model
To make prediction using the model, we use the PredictionEngine(Src, Dst) class. This takes an instance of the input type as well as the instance of the output type. You use the PredictionEngine to make prediction on a single input instance. You can also use the PredictionEnginePool which is thread-safe.
Step 1: The first step is to create the PredictionEngine using the code below:
var predictor = mlContext.Model.CreatePredictionEngine<IrisData,IrisPrediction>(model);
Step 2: Create a file TestIrisData.cs and add the following code:
internal static readonly IrisData Setosa = new IrisData { SepalLength = 5.1f, SepalWidth = 3.5f, PetalLength = 1.4f, PetalWidth = 0.2f };
Step 3: In the main method, at the last part, add the following code:
var prediction = predictor.Predict(TestIrisData.Setosa); Console.WriteLine($"Cluster: {prediction.PredictedClusterId}"); Console.WriteLine($"Distances: {string.Join(" ", prediction.Distances)}"); Console.ReadLine();
Step 4: Fire up the project. You will see the output as shown below
Note: You may receive error on cpu architecture. To solve this, just go Project Properties > Build > Platform Target. Change it as required.
10. Next Steps
If you have gotten here successfully, then congrats!
I recommend you watch the video lesson and subscribe to the channel so you can get updates as we would continue with the following lessons;
- Sentiments Analysis with ML.Net
- Movies Recommender System
- Object Recognition in Images
- Image Classfication
- Product Sales Analysis
- Multiclass Classification in ML.Net
[…] ML.Net Tutorial 1 – Perform Cluster Analysis Using Iris Dataset […]
Thank you very much for this comprehensive tutorial with the video. More than anything the terminology and practise will get it right. Did you end up uploading the video for object recognition and images. You also mentioned doing a tutorial on image transmission.
Regards,
Herve Bonkondo
Thanks a lot!
I am learning ML.NET from your videos and subscribed to your channel. This video was a great tutorial for the likes of starting new on this Microsoft ML platform.