Sharing Our Passion for Technology

& Continuous Learning

Database Migrations at Scale with Fargate and Step Functions

By Paul Rowe | April 15, 2024 | Cloud

Data flowing from a datacenter to the cloud.

As holders of the Migration Consulting Competency from AWS, we are often deeply involved in cloud migration projects. We recently tackled a seemingly straight forward migration of an on-premise Postgres database to AWS Aurora. This database was a data warehouse that was shared by many teams and contained more than 60 schemas with over 8TB of data.

Since this system was used for business operations, we wanted a migration strategy that would limit or eliminate downtime. The Database Migration Service advertises itself as a serverless solution for migrating data from an on-premise database to an RDS database without downtime by leveraging Change Data Capture events. While proving out this service, we encountered two insurmountable issues; Firstly, DMS had an issue with the extensive use of Postgres table inheritance. It would see records in the child tables as appearing in both the parent table as well as the child. Secondly, this particular database included many large tables that contained TEXT columns full of JSON. In our testing, these columns would regularly fail to load into the Aurora RDS database or be extremely slow to load.

We then considered using a Snowball to move the data. This would allow us to create a local database dump within the data center, save it on the Snowball device, then AWS can load that dump into S3 and finally we restore the S3 file into the target database. We ruled out this option because it would prevent us from making data changes between the dump and load steps, which could be several days to a week.

Finally, we landed on using the native pg_dump and pg_restore. This is the simplest way to move data between two Postgres databases. There are many ways to execute these commands, but a simple way that pipes directly between two database is to do something like the following:

pg_dump --host=source_server --dbname=source_db --format=custom | pg_restore --host=target_server --dbname=target_db --clean

For a small database, this could be executed on a workstation that has access to both the source and target servers. In our situation though, we were moving a large amount of data that would take hours to transfer. We also had a short amount of downtime and couldn't afford to be reliant on manual steps.

In order to address the first concern, we decided to execute the dump/restore within AWS instead of locally. This shortens the network path between the databases and removes the reliability of a workstation from the mix. We could have spun up an EC2 instance. But, instead, we decided to create a Fargate Task Definition. Since we are just running native Postgres commands we can actually launch the public Postgres image. Here is an example Cloudformation snippet:

ECSCluster:
  Type: AWS::ECS::Cluster

MigrationTask:
  Type: AWS::ECS::TaskDefinition
  Properties:
    TaskRoleArn: !GetAtt TaskRole.Arn
    ExecutionRoleArn : !GetAtt TaskRole.Arn
    NetworkMode: awsvpc
    RequiresCompatibilities: ["FARGATE"]
    Cpu: 1
    Memory: 2048
    ContainerDefinitions:
      - Name: postgres
        Image: public.ecr.aws/docker/library/postgres:alpine
        Essential: true
        Environment:
          - Name: SOURCE_HOST
            Value: "source_host"
          - Name: TARGET_HOST
            Value: !GetAtt DatabaseCluster.Endpoint.Address
        LogConfiguration:
          LogDriver: awslogs
          Options:
            "awslogs-group": !Ref LogGroup
            "awslogs-region": !Ref AWS::Region
            "awslogs-stream-prefix": "db-migration"
        Secrets:
          - Name: SOURCE_PASSWORD
            ValueFrom: !Sub "${SourceSecret.Arn}:password::"
          - Name: TARGET_PASSWORD
            ValueFrom: !Sub "${Database.MasterUserSecret.SecretArn}:password::"
        Entrypoint: ['bash', '-c']
        Command: |
          set -e
          echo "${SOURCE_HOST}:5432:source_db:postgres:${SOURCE_PASSWORD}" >> ~/.pgpass
          echo "${TARGET_HOST}:5432:target_db:postgres:${TARGET_PASSWORD}" >> ~/.pgpass
          chmod 0600 ~/.pgpass
          pg_dump --host=${SOURCE_HOST} --dbname=source_db --no-password --format=custom \
            | pg_restore --host=${TARGET_HOST} --dbname=target_db --no-password --clean

We added the above configuration to the CloudFormation template responsible for deploying our Aurora Database. In order to execute a migration, all we had to do was start a task using this definition and provide the subnet to launch on. This worked well in testing but wasn't quite what we were looking for: It was still a manual effort to kick this off, and it was very slow. We only had around 24 hours of downtime and this process took longer than that even in lower environments. It was also fragile as a network hiccup or any error would stop the entire process. What we needed was the ability to break down the work into smaller, atomic units that could be retried and run them in parallel.

AWS Step Functions is a service that can coordinate work between other AWS services. Since it Supports AWS EC/Fargate we can use it to launch the migration tasks for each schema. These can be run in parallel, with a concurrency limit. In addition, each "state" within the step function can be configured with automatic retries. In cloudformation, a state machine looks like this:

MigrationStateMachine:
  Type: AWS::Serverless::StateMachine
  Properties:
    Policies:
      - Version: '2012-10-17'
        Statement:
          - Effect: Allow
            Action:
              - iam:PassRole
            Resource: !GetAtt TaskRole.Arn
          - Effect: Allow
            Action:
              - ecs:RunTask
            Resource: !Ref MigrationTask
          - Effect: Allow
            Action:
              - ecs:StopTask
              - ecs:DescribeTasks
            Resource: "*"
          - Effect: Allow
            Action:
              - events:PutTargets
              - events:PutRule
              - events:DescribeRule
            Resource: !Sub "arn:aws:events:${AWS::Region}:${AWS::AccountId}:rule/StepFunctionsGetEventsForECSTaskRule"
    Definition:
      StartAt: SetSchemas
      States:
        SetSchemas:
          Type: Pass
          Result:
            schemas:
              - accounting
              - finance
              - risk
              - agent_management
              - sales
              - products
              - history
          ResultPath: "$"
          Next: LoadSchemas
        LoadSchemas:
          Type: Map
          ItemsPath: "$.schemas"
          MaxConcurrency: 5
          ItemProcessor:
            StartAt: LoadSchema
            States:
              LoadSchema:
                Type: Task
                Resource: "arn:aws:states:::ecs:runTask.sync"
                Parameters:
                  Cluster: !Ref ECSCluster
                  TaskDefinition: !Ref MigrationTask
                  LaunchType: FARGATE
                  NetworkConfiguration:
                    awsvpcConfiguration:
                      subnets: ["sn-23451", "sn-5483671"]
                  Overrides:
                    ContainerOverrides:
                      - Name: postgres
                        Environment:
                          - Name: SCHEMA_NAME
                            "Value.$": "$"
                Retry:
                  - ErrorEquals: [ "States.ALL" ]
                    MaxAttempts: 5
                End: true

          End: true

In order to solidify this solution, we executed our migration over and over in our development environment. Then, once that was working we ran the migration in production every weekend leading up to our go-live date. This process gave us the confidence on the reliability of the process and the timing we needed within the larger deployment timeline. We were ultimately able to migrate all 8TB of data on-premise to AWS Aurora within our 24 hour window.

& Continuous Learning

Database Migrations at Scale with Fargate and Step Functions

Recent Posts

Fine-Tuning Can Wait: What Enterprise GenAI Really Needs

Tales from the Trenches: A Go Optimization Example

How We Saved $10,000,000 Dollars A Year

Agentic AI Analytics: How to use an LLM to make sense of your data

Seeking experts to accelerate delivery & elevate your team?