Backfilling Mastery: Elevating Data Engineering Expertise | by Naser Tamimi | Nov, 2023

A go-to for data engineers wading through the backfilling maze

Naser Tamimi
Towards Data Science
Photo by Towfiqu barbhuiya on Unsplash

Imagine starting a new data pipeline and getting data from a source you’ve never parsed before (e.g. pulling info from an API or an existing table). Now, you’re on a mission to make it seem like you collected this data ages ago. That’s one example of what we call data backfilling in data .

But it’s not just about starting a new data pipeline or table. You could have a table that’s been gathering data for a while, and suddenly, you need to change the data (for example due to a new metric definition), or toss in more data from a new data source. Or maybe there’s an awkward gap in your data, and you just want to it up. All these situations are examples of data backfilling. The common thread is turning “back” in and “filling” up your table with some historical data.

The following figure (Figure 1) a straightforward backfilling scenario. In this instance, a daily job retrieves data from two upstream sources (one for platform A and another for platform B). The is structured with the first partition being ‘ds,’ and the second partition (or sub-partitions) representing the . Unfortunately, data for the period from 2023–10–03 to 2023–10–05 is absent due to certain issues. To address this gap, a backfilling operation was initiated (the backfilling job started on 2023–10–08).

Figure 1) A simple backfilling scenario

A brief heads-up before proceeding further: within the domain of data engineering, we normally encounter two scenarios: “backfilling” a table or “restating” a table. These processes, while some similarities, have some subtle differences. Backfilling, as a practice, is about populating missing or incomplete data in a dataset. Its application is commonly directed towards updating historical data or rectifying gaps. Conversely, restating a table involves effecting substantial…

Source link