ML: all you need to know without any overcomplicated math
What is Machine Learning?
Sure, the actual theory behind models like ChatGPT is admittedly very difficult, but the underlying intuition behind Machine Learning (ML) is, well, intuitive! So, what is ML?
Machine Learning allows computers to learn using data.
But what does this mean? How do computers use data? What does it mean for a computer to learn? And first of all, who cares? Let’s start with the last question.
Nowadays, data is all around us. So it’s increasingly important to use tools like ML, as it can help find meaningful patterns in data without ever being explicitly programmed to do so! In other words, by utilizing ML we are able to apply generic algorithms to a wide variety of problems successfully.
There are a few main categories of Machine Learning, with some of the main types being supervised learning (SL), unsupervised learning (UL), and reinforcement learning (RL). Today I’ll just be describing supervised learning, though in subsequent posts I hope to elaborate more on unsupervised learning and reinforcement learning.
1 Minute SL Speedrun
Look, I get that you might not want to read this whole article. In this section I’ll teach you the very basics (which for a lot of people is all you need to know!) before going into more depth in the later sections.
Supervised learning involves learning how to predict some label using different features.
Imagine you are trying to figure out a way to predict the price of diamonds using features like carat, cut, clarity, and more. Here, the goal is to learn a function that takes as input the features of a specific diamond and outputs the associated price.
Just as humans learn by example, in this case computers will do the same. To be able to learn a prediction rule, this ML agent needs “labeled examples” of diamonds, including both their features and their price. The supervision comes since you are given the label (price). In reality, it’s important to consider that your labeled examples are actually true, as it’s an assumption of supervised learning that the labeled examples are “ground truth”.
Ok, now that we’ve gone over the most fundamental basics, we can get a bit more in depth about the whole data science/ML pipeline.
Problem Setup
Let’s use an extremely relatable example, which is inspired from this textbook. Imagine you’re stranded on an island, where the only food is a rare fruit known as “Justin-Melon”. Even though you’ve never eaten Justin-Melon in particular, you’ve eaten plenty of other fruits, and you know you don’t want to eat fruit that has gone bad. You also know that usually you can tell if a fruit has gone bad by looking at the color and firmness of the fruit, so you extrapolate and assume this holds for Justin-Melon as well.
In ML terms, you used prior industry knowledge to determine two features (color, firmness) that you think will accurately predict the label (whether or not the Justin-Melon has gone bad).
But how will you know what color and what firmness correspond to the fruit being bad? Who knows? You just need to try it out. In ML terms, we need data. More specifically, we need a labeled dataset consisting of real Justin-Melons and their associated label.
Data Collection/Processing
So you spend the next couple of days eating melons and recording the color, firmness, and whether or not the melon was bad. After a few painful days of constantly eating melons that have gone bad, you have the following labeled dataset:
Each row is a specific melon, and each column is the value of the feature/label for the corresponding melon. But notice we have words, since the features are categorical rather than numerical.
Really we need numbers for our computer to process. There are a number of techniques to convert categorical features to numerical features, ranging from one hot encoding to embeddings and beyond.
The simplest thing we can do is turn the column “Label” into a column “Good”, which is 1 if the melon is good and 0 if it’s bad. For now, assume there is some methodology to turn color and firmness to a scale from -10 to 10, in such a way that is sensible. For bonus points, think about the assumptions of putting a categorical feature like color on such a scale. After this preprocessing, our dataset might look something like this:
We now have a labeled dataset, which means we can employ a supervised learning algorithm. Our algorithm needs to be a classification algorithm, as we are predicting a category good (1) or bad (0). Classification is in opposition to regression algorithms, which predict a continuous value like the price of a diamond.
Exploratory Data Analysis
But what algorithm? There are a number of supervised classification algorithms, ranging in complexity from basic logistic regression to some hardcore deep learning algorithms. Well, let’s first take a look at our data by doing some exploratory data analysis (EDA):
The above image is a plot of the feature space; we have two features, and we are simply putting each example onto a plot with the two axes being the two features. Additionally, we make the point purple if the associated melon was good, and we make it yellow if it was bad. Clearly, with just a little bit of EDA, there’s an obvious answer!
We should probably classify all points inside the red circle as good melons, while ones outside of the circle should be classified in bad melons. Intuitively, this makes sense! For example, you don’t want a melon that’s rock solid, but you also don’t want it to be absurdly squishy. Rather, you want something in between, and the same is probably true about color as well.
We determined we would want a decision boundary that is a circle, but this was just based off of preliminary data visualization. How would we systematically determine this? This is especially relevant in larger problems, where the answer is not so simple. Imagine hundreds of features. There’s no possible way to visualize the 100 dimensional feature space in any reasonable way.
What are we learning?
The first step is to define your model. There are tons of classification models. Since each has their own set of assumptions, it’s important to try to make a good choice. To emphasize this, I’ll start by making a really bad choice.
One intuitive idea is to make a prediction by weighing each of the factors:
For example, suppose our parameters w1 and w2 are 2 and 1, respectively. Also assume our input Justin Melon is one with Color = 4, Firmness = 6. Then our prediction Good = (2 x 4) + (1 x 6) = 14.
Our classification (14) is not even one of the valid options (0 or 1). This is because this is actually a regression algorithm. In fact, it’s a simple case of the simplest regression algorithm: linear regression.
So, let’s turn this into a classification algorithm. One simple way would be this: use linear regression and classify as 1 if the output is higher than a bias term b. In fact, we can simplify by adding a constant term to our model in such a way that we classify as 1 if the output is higher than 0.
In math, let PRED = w1 * Color + w2 * Firmness + b. Then we get:
This is certainly better, as we are at least performing a classification, but let’s make a plot of PRED on the x axis and our classification on the y axis:
This is a bit extreme. A slight change in PRED could change the classification entirely. One solution is that the output of our model represents the probability that the Justin-Melon is good, which we can do by smoothing out the curve:
This is a sigmoid curve (or a logistic curve). So, instead of taking PRED and apply this piecewise activation (Good if PRED ≥ 0), we can apply this sigmoid activation function to get a smoothed out curve like above. Overall, our logistic model looks like this:
Here, the sigma represents the sigmoid activation function. Great, so we have our model, and we just need to figure out what weights and biases are best! This process is known as training.
Training the Model
Great, so all we need to do is figure out what weights and biases are best! But this is much easier said than done. There are an infinite number of possibilities, and what does best even mean?
We begin with the latter question: what is best? Here’s one simple, yet powerful way: the most optimal weights are the one that get the highest accuracy on our training set.
So, we just need to figure out an algorithm that maximizes accuracy. However, mathematically it’s easier to minimize something. In words, rather than defining a value function, where higher value is “better”, we prefer to define a loss function, where lower loss is better. Although people typically use something like binary cross entropy for (binary) classification loss, we will just use a simple example: minimize the number of points classified incorrectly.
To do this, we use an algorithm known as gradient descent. At a very high level, gradient descent works like a nearsighted skier trying to get down a mountain. An important property of a good loss function (and one that our crude loss function actually lacks) is smoothness. If you were to plot our parameter space (parameter values and associated loss on the same plot), the plot would look like a mountain.
So, we first start with random parameters, and therefore we likely start with bad loss. Like a skier trying to go down the mountain as fast as possible, the algorithm looks in every direction, trying to see the steepest way to go (i.e. how to change parameters in order to lower loss the most). But, the skier is nearsighted, so they only look a little in each direction. We iterate this process until we end up at the bottom (keen eyed individuals may notice we actually might end up at a local minima). At this point, the parameters we end up with are our trained parameters.
Once you train your logistic regression model, you realize your performance is still really bad, and that your accuracy is only around 60% (barely better than guessing!). This is because we are violating one of the model assumptions. Logistic regression mathematically can only output a linear decision boundary, but we knew from our EDA that the decision boundary should be circular!
With this in mind, you try different, more complex models, and you get one that gets 95% accuracy! You now have a fully trained classifier capable of differentiating between good Justin-Melons and bad Justin-Melons, and you can finally eat all the tasty fruit you want!
Conclusion
Let’s take a step back. In around 10 minutes, you learned a lot about machine learning, including what is essentially the whole supervised learning pipeline. So, what’s next?
Well, that’s for you to decide! For some, this article was enough to get a high level picture of what ML actually is. For others, this article may leave a lot of questions unanswered. That’s great! Perhaps this curiosity will allow you to further explore this topic.
For example, in the data collection step we assumed that you would just eat a ton of melons for a few days, without really taking into account any specific features. This makes no sense. If you ate a green mushy Justin-Melon and it made you violently ill, you probably would stray away from those melons. In reality, you would learn through experience, updating your beliefs as you go. This framework is more similar to reinforcement learning.
And what if you knew that one bad Justin-Melon could kill you instantly, and that it was too risky to ever try one without being sure? Without these labels, you couldn’t perform supervised learning. But maybe there’s still a way to gain insight without labels. This framework is more similar to unsupervised learning.
In following blog posts, I hope to analogously expand on reinforcement learning and unsupervised learning.