Clustering: A simple way to group similar rows and prevent unnecessary data processing
In my previous article, I explained how to optimise SQL queries using partitioning:
Now, I’m writing the sequel! (Dad joke, anyone?)
This article will look at clustering: another powerful optimisation technique you can use in BigQuery. Like partitioning, clustering can help you write more performant queries that are quicker and cheaper to run. If you want to develop your SQL toolkit and build those higher-level Data Science skills, this is a great place to start.
In BigQuery, a clustered table is a table that keeps similar rows grouped together in physical “blocks”.
For example, picture a table called user_signups
that keeps track of all the people registering an account on a fictitious website. It’s got four columns:
registration_date
: the date on which the user created an accountcountry
: the country where the user is basedtier
: the user’s plan (“Free” or “Paid”)username
: the user’s username
If we wanted, we could cluster the table by country
so that users from the same country are stored nearby each other in the table: