An End-To-End Tutorial for Beginners
In this story I will speak about one of the most popular ways to run data transformation tasks — batch data processing. This data pipeline design pattern becomes incredibly useful when we need to process data in chunks making it very efficient for ETL jobs that require scheduling. I will demonstrate how it can be achieved by building a data transformation pipeline using MySQL and Athena. We will use infrastructure as code to deploy it in the cloud.
Imagine that you have just joined a company as a Data Engineer. Their data stack is modern, event-driven, cost-effective, flexible, and can scale easily to meet the growing data resources you have. External data sources and data pipelines in your data platform are managed by the data engineering team using a flexible environment setup with CI/CD GitHub integration.
As a data engineer you need to create a business intelligence dashboard that displays the geography of company revenue streams as shown below. Raw payment data is stored in the server database (MySQL). You want to build a batch pipeline that extracts data from that database daily, then use AWS S3 to store data files and Athena to process it.
Batch data pipeline
A data pipeline can be considered as a sequence of data processing steps. Due to logical data flow connections between these stages, each stage generates an output that serves as an input for the following stage.
There is a data pipeline whenever there is data processing between points A and B.
Data pipelines might be different due it their conceptual and logical nature. I previously wrote about it here [1]: