Backfilling Mastery: Elevating Data Engineering Expertise | by Naser Tamimi | Nov, 2023

DATA

A go-to for data engineers wading through the backfilling maze

Naser Tamimi
Towards Data Science
Photo by Towfiqu barbhuiya on Unsplash

Imagine starting a new data pipeline and getting data from a you’ve never parsed before (e.g. pulling info from an API or an existing table). Now, you’re on a to make it seem like you collected this data ages ago. That’s one example of what we call data backfilling in data engineering.

But it’s not just about starting a new data pipeline or table. You could have a table that’s been gathering data for a while, and suddenly, you need to change the data (for example due to a new metric definition), or toss in more data from a new data source. Or maybe there’s an awkward gap in your data, and you just want to it up. All these situations are examples of data backfilling. The common thread is turning “back” in and “filling” up your table with some historical data.

The following (Figure 1) a straightforward backfilling scenario. In this instance, a daily job retrieves data from two upstream sources (one for A and another for platform B). The dataset is structured with the first partition being ‘ds,’ and the second partition (or sub-partitions) representing the platforms. Unfortunately, data for the period from 2023–10–03 to 2023–10–05 is absent due to certain issues. To address this gap, a backfilling operation was initiated (the backfilling job started on 2023–10–08).

Figure 1) A simple backfilling scenario

A brief heads-up before proceeding further: within the domain of data engineering, we normally encounter two scenarios: “backfilling” a table or “restating” a table. These processes, while sharing some similarities, have some subtle differences. Backfilling, as a practice, is about populating missing or incomplete data in a dataset. Its application is commonly directed towards updating historical data or rectifying gaps. Conversely, restating a table involves effecting substantial…

Source link