Building a Batch Data Pipeline with Athena and MySQL | by đź’ˇMike Shakhomirov | Oct, 2023

An End-To-End Tutorial for

đź’ˇMike Shakhomirov
Towards Data Science
Photo by Redd F on Unsplash

In this story I will speak about one of the most popular ways to run tasks — batch data processing. This data pipeline design pattern becomes incredibly useful when we need to process data in chunks making it very efficient for ETL that require scheduling. I will demonstrate how it can be achieved by building a data transformation pipeline using MySQL and Athena. We will use infrastructure as code to deploy it in the cloud.

Imagine that you have just joined a as a Data . Their data stack is modern, -driven, cost-effective, flexible, and can scale easily to meet the growing data resources you have. External data sources and data pipelines in your data platform are managed by the data engineering team using a flexible environment with CI/CD GitHub .

As a you need to create a business intelligence dashboard that displays the geography of company revenue streams as shown below. Raw payment data is stored in the server database (MySQL). You want to build a batch pipeline that extracts data from that database , then use AWS S3 to store data files and Athena to process it.

Revenue dashboard. Image by author.

Batch data pipeline

A data pipeline can be considered as a sequence of data processing steps. Due to logical data flow connections between these stages, each stage generates an output that serves as an input for the following stage.

There is a data pipeline whenever there is data processing between points A and B.

Data pipelines might be different due it their conceptual and logical nature. I previously wrote about it here [1]:

Source link