As holders of the Migration Consulting Competency from AWS, we are often deeply involved in cloud migration projects. We recently tackled a seemingly straight forward migration of an on-premise Postgres database to AWS Aurora. This database was a data warehouse that was shared by many teams and contained more than 60 schemas with over 8TB of data.
Since this system was used for business operations, we wanted a migration strategy that would limit or eliminate downtime.
The Database Migration Service advertises itself as a serverless solution for migrating data from an on-premise database to an RDS database without downtime by leveraging Change Data Capture events.
While proving out this service, we encountered two insurmountable issues; Firstly, DMS had an issue with the extensive use of Postgres table inheritance.
It would see records in the child tables as appearing in both the parent table as well as the child.
Secondly, this particular database included many large tables that contained TEXT
columns full of JSON.
In our testing, these columns would regularly fail to load into the Aurora RDS database or be extremely slow to load.
We then considered using a Snowball to move the data. This would allow us to create a local database dump within the data center, save it on the Snowball device, then AWS can load that dump into S3 and finally we restore the S3 file into the target database. We ruled out this option because it would prevent us from making data changes between the dump and load steps, which could be several days to a week.
Finally, we landed on using the native pg_dump and pg_restore. This is the simplest way to move data between two Postgres databases. There are many ways to execute these commands, but a simple way that pipes directly between two database is to do something like the following:
pg_dump --host=source_server --dbname=source_db --format=custom | pg_restore --host=target_server --dbname=target_db --clean
For a small database, this could be executed on a workstation that has access to both the source and target servers. In our situation though, we were moving a large amount of data that would take hours to transfer. We also had a short amount of downtime and couldn’t afford to be reliant on manual steps.
In order to address the first concern, we decided to execute the dump/restore within AWS instead of locally. This shortens the network path between the databases and removes the reliability of a workstation from the mix. We could have spun up an EC2 instance. But, instead, we decided to create a Fargate Task Definition. Since we are just running native Postgres commands we can actually launch the public Postgres image. Here is an example Cloudformation snippet:
ECSCluster:
Type: AWS::ECS::Cluster
MigrationTask:
Type: AWS::ECS::TaskDefinition
Properties:
TaskRoleArn: !GetAtt TaskRole.Arn
ExecutionRoleArn : !GetAtt TaskRole.Arn
NetworkMode: awsvpc
RequiresCompatibilities: ["FARGATE"]
Cpu: 1
Memory: 2048
ContainerDefinitions:
- Name: postgres
Image: public.ecr.aws/docker/library/postgres:alpine
Essential: true
Environment:
- Name: SOURCE_HOST
Value: "source_host"
- Name: TARGET_HOST
Value: !GetAtt DatabaseCluster.Endpoint.Address
LogConfiguration:
LogDriver: awslogs
Options:
"awslogs-group": !Ref LogGroup
"awslogs-region": !Ref AWS::Region
"awslogs-stream-prefix": "db-migration"
Secrets:
- Name: SOURCE_PASSWORD
ValueFrom: !Sub "${SourceSecret.Arn}:password::"
- Name: TARGET_PASSWORD
ValueFrom: !Sub "${Database.MasterUserSecret.SecretArn}:password::"
Entrypoint: ['bash', '-c']
Command: |
set -e
echo "${SOURCE_HOST}:5432:source_db:postgres:${SOURCE_PASSWORD}" >> ~/.pgpass
echo "${TARGET_HOST}:5432:target_db:postgres:${TARGET_PASSWORD}" >> ~/.pgpass
chmod 0600 ~/.pgpass
pg_dump --host=${SOURCE_HOST} --dbname=source_db --no-password --format=custom \
| pg_restore --host=${TARGET_HOST} --dbname=target_db --no-password --clean
We added the above configuration to the CloudFormation template responsible for deploying our Aurora Database. In order to execute a migration, all we had to do was start a task using this definition and provide the subnet to launch on. This worked well in testing but wasn’t quite what we were looking for: It was still a manual effort to kick this off, and it was very slow. We only had around 24 hours of downtime and this process took longer than that even in lower environments. It was also fragile as a network hiccup or any error would stop the entire process. What we needed was the ability to break down the work into smaller, atomic units that could be retried and run them in parallel.
AWS Step Functions is a service that can coordinate work between other AWS services. Since it Supports AWS EC/Fargate we can use it to launch the migration tasks for each schema. These can be run in parallel, with a concurrency limit. In addition, each “state” within the step function can be configured with automatic retries. In cloudformation, a state machine looks like this:
MigrationStateMachine:
Type: AWS::Serverless::StateMachine
Properties:
Policies:
- Version: '2012-10-17'
Statement:
- Effect: Allow
Action:
- iam:PassRole
Resource: !GetAtt TaskRole.Arn
- Effect: Allow
Action:
- ecs:RunTask
Resource: !Ref MigrationTask
- Effect: Allow
Action:
- ecs:StopTask
- ecs:DescribeTasks
Resource: "*"
- Effect: Allow
Action:
- events:PutTargets
- events:PutRule
- events:DescribeRule
Resource: !Sub "arn:aws:events:${AWS::Region}:${AWS::AccountId}:rule/StepFunctionsGetEventsForECSTaskRule"
Definition:
StartAt: SetSchemas
States:
SetSchemas:
Type: Pass
Result:
schemas:
- accounting
- finance
- risk
- agent_management
- sales
- products
- history
ResultPath: "$"
Next: LoadSchemas
LoadSchemas:
Type: Map
ItemsPath: "$.schemas"
MaxConcurrency: 5
ItemProcessor:
StartAt: LoadSchema
States:
LoadSchema:
Type: Task
Resource: "arn:aws:states:::ecs:runTask.sync"
Parameters:
Cluster: !Ref ECSCluster
TaskDefinition: !Ref MigrationTask
LaunchType: FARGATE
NetworkConfiguration:
awsvpcConfiguration:
subnets: ["sn-23451", "sn-5483671"]
Overrides:
ContainerOverrides:
- Name: postgres
Environment:
- Name: SCHEMA_NAME
"Value.$": "$"
Retry:
- ErrorEquals: [ "States.ALL" ]
MaxAttempts: 5
End: true
End: true
In order to solidify this solution, we executed our migration over and over in our development environment. Then, once that was working we ran the migration in production every weekend leading up to our go-live date. This process gave us the confidence on the reliability of the process and the timing we needed within the larger deployment timeline. We were ultimately able to migrate all 8TB of data on-premise to AWS Aurora within our 24 hour window.