Source Allies, like many organizations, has several AWS environments. In addition to production we also have a dev and a qual environment. An application deployed to all three environments will be running three copies of its infrastucture. If that architecture includes RDS databases, EC2 instances, ECS tasks, or other compute then we will be billed for each minute those services are running. Since our team isn’t using these environments unless we are actively testing things then this is a wasted expense that could account for two-thirds of a projects overall AWS spend.
Creating a scheduled job to stop these resources during off-hours isn’t a new idea. Generally this involves a Lambda that has a bit of code to make the appropriate AWS calls. Instead, since Step Functions has added support for calling almost any AWS API natively, we can leverage a State Machine to shut down our database. In Cloudformation it looks like this:
AWSTemplateFormatVersion: "2010-09-09"
Transform: AWS::Serverless-2016-10-31
Parameters:
ScaleDownOffHours:
Type: String
Default: "false"
Conditions:
ConfigureScaleDownOffHours: !Equals [ "true", !Ref ScaleDownOffHours ]
Resources:
...
ScaleDownOffHoursStateMachine:
Condition: ConfigureScaleDownOffHours
Type: AWS::Serverless::StateMachine
Properties:
Definition:
StartAt: ScaleDown
States:
ScaleDown:
Type: Task
Resource: "arn:aws:states:::aws-sdk:rds:stopDBCluster"
Parameters:
DbClusterIdentifier: !Ref DatabaseCluster
End: true
...
We’re using a AWS::Serverless::StateMachine
rather than a AWS::StepFunctions::StateMachine
.
This configuration leverages the serverless transform and inlines some additional requirements to get this to run on a schedule.
First, we need to create an IAM Role that gives the State Machine permission to stop the database.
We can do that by adding a Policies
property to the resource and the Serverless transform will expand it into a full Role at deploy time:
ScaleDownOffHoursStateMachine:
Condition: ConfigureScaleDownOffHours
Type: AWS::Serverless::StateMachine
Properties:
...
Policies:
- Version: '2012-10-17'
Statement:
- Effect: Allow
Action:
- rds:StopDBCluster
- rds:StartDBCluster
Resource:
- !GetAtt DatabaseCluster.DBClusterArn
We want to scale down every day at 5 PM Central.
We can add an Events
property and the transform will expand that into other resources.
Those resources will kick off the statemachine on the appropriate schedule.
ScaleDownOffHoursStateMachine:
Condition: ConfigureScaleDownOffHours
Type: AWS::Serverless::StateMachine
Properties:
...
Events:
ScaleDown:
Type: ScheduleV2
Properties:
ScheduleExpressionTimezone: America/Chicago
ScheduleExpression: "cron(0 17 * * ? *)"
If we stop here, we have a single resource we can add to out template that is able to automatically shut down the database every day at 5PM.
Additional states can be added to the state machine to stop other resources as well (such as an EC2 instance).
One downside to this approach is that our enviroment is never started back up, we would have to do that manually.
We can modify the definition of our state machine to actually start resources as well.
Replace the Definition
element with:
ScaleDownOffHoursStateMachine:
Condition: ConfigureScaleDownOffHours
Type: AWS::Serverless::StateMachine
Properties:
...
Definition:
StartAt: DetermineDirection
States:
DetermineDirection:
Type: Choice
Choices:
- Variable: "$$.Execution.Input.source"
StringEquals: aws.scheduler
Next: ScaleDown
Default: ScaleUp
ScaleUp:
Type: Task
Resource: "arn:aws:states:::aws-sdk:rds:startDBCluster"
Parameters:
DbClusterIdentifier: !Ref DatabaseCluster
End: true
ScaleDown:
Type: Task
Resource: "arn:aws:states:::aws-sdk:rds:stopDBCluster"
Parameters:
DbClusterIdentifier: !Ref DatabaseCluster
End: true
This definition will start the database if the state machine is not triggered by the scheduled event (such as manually). Let’s go even further by adding an event to start the database whenever we deploy a new version of our application:
ScaleDownOffHoursStateMachine:
Condition: ConfigureScaleDownOffHours
Type: AWS::Serverless::StateMachine
Properties:
...
Events:
...
ScaleUp:
Type: EventBridgeRule
Properties:
Pattern:
source: [ "aws.cloudformation" ]
account: [ !Ref AWS::AccountId ]
detail-type: [ "CloudFormation Stack Status Change" ]
detail:
stack-id: [ !Ref AWS::StackId ]
status-details:
status: [ "UPDATE_IN_PROGRESS" ]
This event actually listens for the current stack to go into “UPDATE_IN_PROGRESS” state and starts the database in response. It isn’t a synchronous operation so it will still take a few moments before the application is usable.
This is just a sample of some of the ways to manage your non-production infrastructure. State machines are flexible enough that all sorts of innovative combinations can be supported. You could even setup a Wait state to automatically shut down things a certain amount of time after they are deployed. Take a look at the complete template on our Github repository.