Sharing Our Passion for Technology
& Continuous Learning
<   Back to Blog

Database Migrations at Scale with Fargate and Step Functions

Data flowing from a datacenter to the cloud.

As holders of the Migration Consulting Competency from AWS, we are often deeply involved in cloud migration projects. We recently tackled a seemingly straight forward migration of an on-premise Postgres database to AWS Aurora. This database was a data warehouse that was shared by many teams and contained more than 60 schemas with over 8TB of data.

Since this system was used for business operations, we wanted a migration strategy that would limit or eliminate downtime. The Database Migration Service advertises itself as a serverless solution for migrating data from an on-premise database to an RDS database without downtime by leveraging Change Data Capture events. While proving out this service, we encountered two insurmountable issues; Firstly, DMS had an issue with the extensive use of Postgres table inheritance. It would see records in the child tables as appearing in both the parent table as well as the child. Secondly, this particular database included many large tables that contained TEXT columns full of JSON. In our testing, these columns would regularly fail to load into the Aurora RDS database or be extremely slow to load.

We then considered using a Snowball to move the data. This would allow us to create a local database dump within the data center, save it on the Snowball device, then AWS can load that dump into S3 and finally we restore the S3 file into the target database. We ruled out this option because it would prevent us from making data changes between the dump and load steps, which could be several days to a week.

Finally, we landed on using the native pg_dump and pg_restore. This is the simplest way to move data between two Postgres databases. There are many ways to execute these commands, but a simple way that pipes directly between two database is to do something like the following:

pg_dump --host=source_server --dbname=source_db --format=custom | pg_restore --host=target_server --dbname=target_db --clean

For a small database, this could be executed on a workstation that has access to both the source and target servers. In our situation though, we were moving a large amount of data that would take hours to transfer. We also had a short amount of downtime and couldn’t afford to be reliant on manual steps.

In order to address the first concern, we decided to execute the dump/restore within AWS instead of locally. This shortens the network path between the databases and removes the reliability of a workstation from the mix. We could have spun up an EC2 instance. But, instead, we decided to create a Fargate Task Definition. Since we are just running native Postgres commands we can actually launch the public Postgres image. Here is an example Cloudformation snippet:

  Type: AWS::ECS::Cluster

  Type: AWS::ECS::TaskDefinition
    TaskRoleArn: !GetAtt TaskRole.Arn
    ExecutionRoleArn : !GetAtt TaskRole.Arn
    NetworkMode: awsvpc
    RequiresCompatibilities: ["FARGATE"]
    Cpu: 1
    Memory: 2048
      - Name: postgres
        Essential: true
          - Name: SOURCE_HOST
            Value: "source_host"
          - Name: TARGET_HOST
            Value: !GetAtt DatabaseCluster.Endpoint.Address
          LogDriver: awslogs
            "awslogs-group": !Ref LogGroup
            "awslogs-region": !Ref AWS::Region
            "awslogs-stream-prefix": "db-migration"
          - Name: SOURCE_PASSWORD
            ValueFrom: !Sub "${SourceSecret.Arn}:password::"
          - Name: TARGET_PASSWORD
            ValueFrom: !Sub "${Database.MasterUserSecret.SecretArn}:password::"
        Entrypoint: ['bash', '-c']
        Command: |
          set -e
          echo "${SOURCE_HOST}:5432:source_db:postgres:${SOURCE_PASSWORD}" >> ~/.pgpass
          echo "${TARGET_HOST}:5432:target_db:postgres:${TARGET_PASSWORD}" >> ~/.pgpass
          chmod 0600 ~/.pgpass
          pg_dump --host=${SOURCE_HOST} --dbname=source_db --no-password --format=custom \
            | pg_restore --host=${TARGET_HOST} --dbname=target_db --no-password --clean

We added the above configuration to the CloudFormation template responsible for deploying our Aurora Database. In order to execute a migration, all we had to do was start a task using this definition and provide the subnet to launch on. This worked well in testing but wasn’t quite what we were looking for: It was still a manual effort to kick this off, and it was very slow. We only had around 24 hours of downtime and this process took longer than that even in lower environments. It was also fragile as a network hiccup or any error would stop the entire process. What we needed was the ability to break down the work into smaller, atomic units that could be retried and run them in parallel.

AWS Step Functions is a service that can coordinate work between other AWS services. Since it Supports AWS EC/Fargate we can use it to launch the migration tasks for each schema. These can be run in parallel, with a concurrency limit. In addition, each “state” within the step function can be configured with automatic retries. In cloudformation, a state machine looks like this:

  Type: AWS::Serverless::StateMachine
      - Version: '2012-10-17'
          - Effect: Allow
              - iam:PassRole
            Resource: !GetAtt TaskRole.Arn
          - Effect: Allow
              - ecs:RunTask
            Resource: !Ref MigrationTask
          - Effect: Allow
              - ecs:StopTask
              - ecs:DescribeTasks
            Resource: "*"
          - Effect: Allow
              - events:PutTargets
              - events:PutRule
              - events:DescribeRule
            Resource: !Sub "arn:aws:events:${AWS::Region}:${AWS::AccountId}:rule/StepFunctionsGetEventsForECSTaskRule"
      StartAt: SetSchemas
          Type: Pass
              - accounting
              - finance
              - risk
              - agent_management
              - sales
              - products
              - history
          ResultPath: "$"
          Next: LoadSchemas
          Type: Map
          ItemsPath: "$.schemas"
          MaxConcurrency: 5
            StartAt: LoadSchema
                Type: Task
                Resource: "arn:aws:states:::ecs:runTask.sync"
                  Cluster: !Ref ECSCluster
                  TaskDefinition: !Ref MigrationTask
                  LaunchType: FARGATE
                      subnets: ["sn-23451", "sn-5483671"]
                      - Name: postgres
                          - Name: SCHEMA_NAME
                            "Value.$": "$"
                  - ErrorEquals: [ "States.ALL" ]
                    MaxAttempts: 5
                End: true

          End: true

In order to solidify this solution, we executed our migration over and over in our development environment. Then, once that was working we ran the migration in production every weekend leading up to our go-live date. This process gave us the confidence on the reliability of the process and the timing we needed within the larger deployment timeline. We were ultimately able to migrate all 8TB of data on-premise to AWS Aurora within our 24 hour window.