As you can probably guess, this article details obstacles I ran into at work when I found myself in a similar situation.
Ultimately, the solution was in a somewhat hack-y, but remarkably comprehensible, workaround using the native MySQL JSON data type.
The JSON data type in MySQL was added in version 5.7.8, and provides a lot of useful utility for both storage and modeling.
Under the JSON data type umbrella (officially called “JSON documents”) are two different data structures: JSON arrays and JSON objects.
A JSON array can simply be thought of as an array (a list, if you’re a Pythonista): values enclosed by square brackets [ ]
and separated by commas.
- An example MySQL JSON array value:
[“foo”, “bar”, 1, 2]
A JSON object can be thought of as a hash table (or, again in Python terms, a dictionary): key-value pairs, separated by commas, and enclosed by curly brackets { }
.
- An example of a MySQL JSON object value:
{“foo”: “bar”, 1: 2}
MySQL has a number of functions that can be used to deal with both of these formats—almost none of which perform any sort of aggregation.
Thankfully, though, there are two that do. And they both return JSON documents, which means we can use MySQL’s built-in functions to access the values therein.
The MySQL function JSON_ARRAYAGG
acts a lot like GROUP_CONCAT
. The biggest difference is that it returns a JSON array, which, again, comes with several helpful built-in functions linked above.
The JSON array data type solves one of our two problems with astounding simplicity: the problem of reliably counting the number of subscriptions in a combination. This is accomplished using the JSON_LENGTH
function. The syntax is wonderfully straightforward:
SELECT JSON_LENGTH(JSON_ARRAY("foo", "bar", "hello", "world"));
-- JSON_ARRAY function used here just to quickly create an example array
The result of this statement is 4, since there are 4 values in the generated JSON array.
But let’s return to the combination of subscriptions. Unfortunately, JSON_ARRAYAGG
doesn’t come with the ordering functionality that GROUP_CONCAT
has. Ordering the subscription
values, even in a CTE before the base query, doesn’t return the desired results:
WITH
subscriptions_ordered AS (
SELECT
customer_id,
subscription
FROM subscriptions
ORDER BY subscription
)
, subscriptions_grouped AS (
SELECT
customer_id,
JSON_ARRAYAGG(subscription) AS subscriptions,
JSON_LENGTH(JSON_ARRAYAGG(subscription)) AS num_subscriptions
FROM
subscriptions_ordered
GROUP BY customer_id
)
SELECT
subscriptions,
COUNT(*) AS num_accounts
num_subscriptions
FROM subscriptions_grouped
GROUP BY subscriptions
;
The number of subscriptions in each combination is there, thanks to the JSON_LENGTH
function—but combinations that are effectively the same are once again mischaracterized as distinct because of their order.
Using ROW_NUMBER to force the ordering of values
ROW_NUMBER
is a window function that creates an index. The index has to be defined; that is, you have to tell it where to start, how to increment (directionally), and where to end.
We can see a quick example of this by applying the ROW_NUMBER
function and telling it to order by the subscription
field:
SELECT
customer_id,
subscription,
ROW_NUMBER() OVER(ORDER BY subscription) AS alphabetical_row_num
FROM subscriptions
;
Look closely at the results. Even though we didn’t use an ORDER BY
statement at the end of our query, the data is nonetheless ordered according to the ORDER BY
in the OVER
clause.
But of course this still isn’t exactly what we want. What we need to do next is add a PARTITION BY
clause to our window function, so that the ordering of the results are related to (and in fact bounded by) each customer ID. Like so:
SELECT
customer_id,
subscription,
ROW_NUMBER() OVER(PARTITION BY customer_id ORDER BY subscription) AS alphabetical_order
FROM subscriptions
;
You can probably see where this is going.
If we execute the JSON_ARRAYAGG
function against these results in a CTE, we see that the duplicate combinations now look exactly the same, thanks to the subscriptions being forced into an alphabetical order by the ROW_NUMBER
function:
WITH
subscriptions_ordered AS (
SELECT
customer_id,
subscription,
ROW_NUMBER() OVER(PARTITION BY customer_id ORDER BY subscription) AS alphabetical_order
FROM subscriptions
)
SELECT
customer_id,
JSON_ARRAYAGG(subscription) AS subscriptions
FROM subscriptions_ordered
GROUP BY 1
ORDER BY 2
;
Now all we need to do is add in the grouping CTE following the one executing ROW_NUMBER, and alter the base query:
WITH
subscriptions_ordered AS (
SELECT
customer_id,
subscription,
ROW_NUMBER() OVER(PARTITION BY customer_id ORDER BY subscription) AS alphabetical_order
FROM subscriptions
)
, subscriptions_grouped AS (
SELECT
customer_id,
JSON_ARRAYAGG(subscription) AS subscriptions,
JSON_LENGTH(JSON_ARRAYAGG(subscription)) AS num_subscriptions
FROM subscriptions_ordered
GROUP BY customer_id
)
SELECT
subscriptions,
COUNT(*) AS num_customers,
num_subscriptions
FROM subscriptions_grouped
GROUP BY subscriptions
ORDER BY num_customers DESC
;
This gives not only accurately distinct combinations of subscriptions, but also the number of customers who’ve purchased those combinations, and how many subscriptions comprise each of them:
Voila!
- We wanted to know how many customers purchased different combination of subscriptions, and how many subscriptions were in each of those combinations. This presented two problems: how best to obtain the latter, and how to generate accurately distinct subscription combinations.
- To obtain the number of subscriptions in each combination, we chose to go with one of MySQL’s JSON functions,
JSON_ARRAYAGG
. The resulting aggregation was returned to us as a JSON data type, allowing us to use theJSON_LENGTH
function. - We then needed to force the ordering of values inside the JSON array so that duplicate combinations didn’t mistakenly appear distinct. To do this, we used the window function
ROW_NUMBER
in a CTE prior to the base query, partitioning by customer ID and ordering the subscriptions alphabetically (in ascending order). - This ultimately allowed us to aggregate up to accurately distinct combinations of subscriptions; and with this we were able to use a simple
COUNT
function to see how many customers had purchased each combination.