Here at Source Allies we have several teammates that have worked on several projects utilizing AWS Glue over the past few years. What all of these projects had in common was the need to overcome complexities of Glue and Apache Spark to make the jobs into high quality, maintainable software applications.
The jobs that our teams build require a suite of automated tests that ensure that changes do not introduce bugs into the data. These tests cannot run deployed jobs and move data within any reasonable timeframe, therefore they need to run within a Unit Testing context with mock data. Additionally, these unit tests cannot mock too much of the Spark runtime or we won’t get a realistic test of how spark operates.
A lot of Glue jobs share common patterns of data processing. They often involve filtering and merging data in very similar ways. Implementing these patterns on each job can create a lot of redundant boilerplate that can obscure the real business logic in the job and make them hard to maintain.
In order to support this, we have spent the last year writing a reference implementation of how we feel Glue jobs should be structured in order to achieve these goals. It also features a “Cookbook” of various design patterns and considerations for common data processing use cases. We are releasing this project, which includes both a reference guide as well as a Python Library.