PYTHON | DATA | MACHINE LEARNING
A guide to why, how, and what
Clustering has always been one of those topics that garnered my attention. Especially when I was first getting into the whole sphere of machine learning, unsupervised clustering always carried an allure with it for me.
To put it simply, clustering is rather like the unsung knight in shining armour of machine learning. This form of unsupervised learning aims to bundle similar data points into groups.
Visualise yourself in a social gathering where everyone is a stranger.
How would you decipher the crowd?
Perhaps, by grouping individuals based on shared traits, such as those laughing at a joke, the football aficionados deep in conversation, or the group captivated by a literary discussion. That’s clustering in a nutshell!
You may wonder, “Why is it relevant?”.
Clustering boasts numerous applications.
- Customer segmentation — helping businesses categorise their customers according to buying patterns to tailor their marketing approaches.
- Anomaly detection — identify peculiar data points, like suspicious transactions in banking.
- Optimised resource utilisation — by configuring computing clusters.
However, there’s a caveat.
How do we make sure that our clustering effort is successful?
How can we efficiently evaluate a clustering solution?
This is where the requirement for robust evaluation methods emerges.
Without a robust evaluation technique, we could potentially end up with a model that appears promising on paper, but drastically underperforms in practical scenarios.
In this article, we’ll examine two renowned clustering evaluation methods: the Silhouette score and Density-Based Clustering Validation (DBCV). We’ll dive into their strengths, limitations, and ideal scenarios of use.