Add One Line of SQL to Optimise Your BigQuery Tables | by Matt Chapman | Dec, 2023

Clustering: A simple way to group similar rows and prevent unnecessary

Matt Chapman
Towards Data Science

In my previous article, I explained how to optimise SQL queries using partitioning:

Now, I’m writing the sequel! (Dad joke, ?)

This article will look at clustering: another powerful optimisation technique you can use in BigQuery. Like partitioning, clustering can help you write more performant queries that are quicker and cheaper to run. If you want to develop your SQL toolkit and those higher-level Data , this is a great place to start.

In BigQuery, a clustered table is a table that keeps similar rows grouped together in physical “blocks”.

For example, picture a table called user_signups that keeps track of all the people registering an on a fictitious . It’s got four columns:

  • registration_date: the date on which the user created an account
  • country: the country where the user is based
  • tier: the user’s plan (“Free” or “Paid”)
  • username: the user’s username

If we wanted, we could cluster the table by country so that from the same country are stored nearby each other in the table:

Source link