DATA ENGINEERING
A go-to guide for data engineers wading through the backfilling maze
Imagine starting a new data pipeline and getting data from a source you’ve never parsed before (e.g. pulling info from an API or an existing hive table). Now, you’re on a mission to make it seem like you collected this data ages ago. That’s one example of what we call data backfilling in data engineering.
But it’s not just about starting a new data pipeline or table. You could have a table that’s been gathering data for a while, and suddenly, you need to change the data (for example due to a new metric definition), or toss in more data from a new data source. Or maybe there’s an awkward gap in your data, and you just want to patch it up. All these situations are examples of data backfilling. The common thread is turning “back” in time and “filling” up your table with some historical data.
The following figure (Figure 1) shows a straightforward backfilling scenario. In this instance, a daily job retrieves data from two upstream sources (one for platform A and another for platform B). The dataset is structured with the first partition being ‘ds,’ and the second partition (or sub-partitions) representing the platforms. Unfortunately, data for the period from 2023–10–03 to 2023–10–05 is absent due to certain issues. To address this gap, a backfilling operation was initiated (the backfilling job started on 2023–10–08).
A brief heads-up before proceeding further: within the domain of data engineering, we normally encounter two scenarios: “backfilling” a table or “restating” a table. These processes, while sharing some similarities, have some subtle differences. Backfilling, as a practice, is about populating missing or incomplete data in a dataset. Its application is commonly directed towards updating historical data or rectifying gaps. Conversely, restating a table involves effecting substantial…