3 ways to design effective classes in ML Classification Algorithms
--
In this post we are going to see three ways to affectively design the target classes in classification problems using the properties of the training data alone.
Background
Classification is an area in Machine Learning (ML) where the machine is tasked to learn to categorize a given input. For example, given an image, the machine should return the category the image belongs to.
As shown in Figure 1 f is trained with labelled training images of cats and dogs. Here, “dog” is a category and “cat” is another category. These categories are also referred to as “classes” in classification problems. The goodness of f largely depends on the quality of data it is trained on — if we feed images of monkey labelled as “cat”, that is what it will learn!
Quality of training data implies the quality of input, classes and their mappings. In this post we will focus on some ideas for designing “classes” in some non-trivial situations. In an earlier post, I shared some ideas for handling high cardinal input data with hashing, character one-hot encoding and embeddings.
Class imbalance is one of the first things to look out for when generating training data for classification problems. If our training data has 98 examples of dog and 2 examples of cat, f will have a tough time learning how a cat actually looks like as it is under represented. Can you guess another thing to be watchful of here? Here’s a hint, what’s the accuracy of f below for the above dataset,
def f(image):
return "dog"
It is 98%, as f correctly predicts 98 of 100 training images as “dog”, but is absolutely useless. It just got lucky because of the bias in training data. We should always remember that,
ML is a daemon which tries to solve the problem in the worst possible way that you didn’t prohibit
This statement by John Mount really stuck with me. So it is our responsibility to either balance the classes or use a different accuracy metric while designing a classification algorithm. All the techniques here, ensure this problem is addressed.