Node Reference - CloudWatch

By Paul Rowe, Matt Vincent | July 24, 2018 | Cloud

Teammates sitting together and looking at code

Monitoring

How do we answer the question: “Is our application performing correctly?” With just one application server we could remotely log into the server, look at CPU and memory load, run grep on the log files and then determine that everything is fine. This approach is manual intensive and obviously does not scale well when our service is horizontally scaled to be running on many servers, or in many containers, in AWS.

What we need is a system to gather the information we care about (also known as “Metrics”), aggregate the information together, and present it in a digestible way to us. There are many monitoring solutions on the market. However, to keep things simple, we will be leveraging AWS’s built-in monitoring solution called CloudWatch. Cloudwatch provides a good list of features that we will need to leverage in order to build out our monitoring solution:

It supports storing and aggregating metrics over time. This allows us to calculate an average or sum over several days, and allows us to look for patterns.
It can manage dashboards that allow us to graph metrics so we can visually determine if an application is performing unusually.
We can setup alarms that can alert us when a metric exceeds pre-determined thresholds.
We can manage all of the above via CloudFormation. This allows us to follow the same infrastructure-as-code best practices as we follow with our application components. Essentially, monitoring can be thought of as simply another application whose end users are the development and operation teams.
It has close integration with many AWS services that allow us to start reading metrics with little to no additional setup within our infrastructure.

There is another monitoring product within AWS. X-Ray is a service that allows the tracing of the duration of each request in your application as well as the duration of the service calls (i.e. DynamoDB) that it makes to downstream services. We are not leveraging X-Ray because its support for NodeJS projects has fallen behind the ecosystem in two main ways:

In order to track a single request through various asyncronous functions, it uses continuation-local-storage to maintain context across callbacks and promises. Unfortunately, this does not play well with async/await (see Issue #12).
The only supported HTTP frameworks are Express and Restify. We could write our own Koa Middleware. However, in combination with the above async/await limitation, we would have to basically write most of the client library.

When the async/await issue is resolved, and X-Ray has better integration with the rest of AWS (specifically Cloudwatch), it would be worth taking a second look. In the meantime, if we wanted to know how long a specific piece of code (i.e. an HTTP request or calculation) can take, we could log the time taken to the console and use a Metric Filter to capture it into a custom metric.

The first thing we need to do is to look back at the question we are trying to answer: “Is our application performing correctly?” Before we get into setting up a bunch of snazzy dashboards and alarms we should sit down with our team and agree on some definitions. We need to define what performing “correctly” means in terms of our specific “application”. Application is probably pretty easy, it is everything deployed within the Cloudformation stack that we setup. Correctly is another matter. The exact definition will depend on what your service is responsible for and how important it is within the overall organization. It might take some form of:

The service should be fast enough.
The service shouldn’t be producing errors and exceptions.

This list results in more terms that have to be defined and agreed upon. For our service, we will be using this more specific list:

The 95th percentile of HTTP latency from the load balancer should be less than 200ms.
There should be 0 requests that result in a 5xx HTTP status code from the client.

It turns out that both “TargetResponseTime” and “HTTPCode_Target_5XX_Count” are existing metrics sent by our load balancer to CloudWatch by default. Therefore, all we need to do is create a dashboard to graph these metrics.

We have to remember that over time, the team will be responsible for monitoring several, if not dozens, of applications. Because of this, we don’t want to create a dashboard specific to this one application. If we did so, then we would have to individually check each application to determine if it was healthy. Instead, we are going to create a new CloudFormation stack with its own template to hold all of the dashboards and alarms that our team is concerned with. We recommend checking this template into its own source control repository and deploying it independently. This allows dashboards and other monitoring assets to be updated without triggering a deployment of a specific application.

In order to graph our load balancer from our monitoring stack we need to know its “LoadBalancerFullName” value. CloudFormation auto generates resource names if they are not specified, so we need to export this value from our stack template so we can in turn import the value into our monitoring stack. Add the following to the “Outputs” section of our existing cloudformation.template.yml and commit & push this change:

LoadBalancerFullName:
  Value: !GetAtt LoadBalancer.LoadBalancerFullName
  Export:
    Name: !Sub '${AWS::StackName}:LoadBalancerFullName'

Next, create a new CloudFormation template file (we named it monitoring.template.yml) and add this content to it to create a simple dashboard that will graph out Load Balancer response times, errors and also any Throttled read or write requests to our DynamoDB table:

AWSTemplateFormatVersion: '2010-09-09'
Description: Monitoring dashboards
Parameters:
  ProductServiceStackName:
    Type: String
    Description: Name of the product service cloudformation stack
Resources:
  Dashboard:
    Type: 'AWS::CloudWatch::Dashboard'
    Properties:
      DashboardName: 'My_Dashboard'
      DashboardBody:
        Fn::Sub:
          - |
            {
              "widgets": [
                {
                  "type": "metric",
                  "width": 24,
                  "properties": {
                    "title": "Average Response Time",
                    "period": 60,
                    "stat": "p95",
                    "region": "${AWS::Region}",
                    "metrics": [
                      ["AWS/ApplicationELB", "TargetResponseTime", "LoadBalancer", "${ProductServiceLoadBalancerFullName}", {"label": "Product Service"}]
                    ]
                  }
                },
                {
                  "type": "metric",
                  "width": 24,
                  "properties": {
                    "title": "Request Counts",
                    "period": 60,
                    "stat": "Sum",
                    "region": "${AWS::Region}",
                    "metrics": [
                      ["AWS/ApplicationELB", "HTTPCode_Target_5XX_Count", "LoadBalancer", "${ProductServiceLoadBalancerFullName}", {"label": "Product Service 5xx"}]
                    ]
                  }
                },
                {
                  "type": "metric",
                  "width": 24,
                  "properties": {
                    "title": "Throttled Requests",
                    "period": 60,
                    "stat": "Sum",
                    "region": "${AWS::Region}",
                    "metrics": [
                      ["AWS/DynamoDB", "ThrottledRequests", "TableName", "${ProductServiceTableName}", {"label": "Table Throttled Requests"}]
                    ]
                  }
                }
              ]
            }
          - ProductServiceLoadBalancerFullName:
              Fn::ImportValue: !Sub '${ProductServiceStackName}:LoadBalancerFullName'
            ProductServiceTableName:
              Fn::ImportValue: !Sub '${ProductServiceStackName}:ProductsTable::Id'

Note how in the template above we create a CloudFormation parameter to hold the name of the CloudFormation stack for our product service. This is used to dynamically construct the export name so that we can reference stack resources.

We then create a Cloudwatch Dashboard resource with whatever name we choose and the body that specifies the metrics we want to graph. The DashboardBody property must be a string and not an object, so we use the YAML multiline support in combination with the Fn::Sub intrinsic function to build this JSON structure. We are separate our metrics into two separate graphs because they are at different resolutions.

As we add new services, we simply need to add a new “metrics” element to our dashboard to start graphing that service’s performance.

Deploy the template (monitoring.template.yaml) with the following command:

aws cloudformation deploy \
    --stack-name=TeamA-Monitoring \
    --template-file=monitoring.template.yml \
    --parameter-overrides \
        ProductServiceStackName="ProductService-DEV"

Log into the CloudWatch console and select “Dashboards” to see our new dashboard.

Alarms

Having one place to quickly answer the question of correctness for all of our applications is great. However, it is only valuable if we are asking the question. That is, if we are looking at our dashboards. What we need now is the ability to set a threshold that will alert us if a service is experiencing issues so that we can possibly take action. For that we need a Cloudwatch Alarm.

Cloudwatch Alarms monitor a metric for when it passes a specified threshold over a certain amount of time (all of this is configurable). When a threshold is reached, it then triggers one or more Actions. These actions could be to trigger autoscaling, a Simple Workflow Service or send a message to an SNS Topic.

We are going to be sending Alarms to an SNS topic because a topic can forward messages to email addresses, phones (via text message), or a Lambda function (if we need something custom).

Let’s add a topic element to the monitoring template (monitoring.template.yaml):

  AlarmTopic:
    Type: "AWS::SNS::Topic"
    Properties: {}

Next we add Alarms to monitoring.template.yaml. We are leveraging the same exported value as our dashboard to reference our LoadBalancerFullName.

ProductServiceResponseTimeAlarm:
  Type: "AWS::CloudWatch::Alarm"
  Properties:
    AlarmDescription: Product Service response time over 100ms
    Namespace: "AWS/ApplicationELB"
    MetricName: "TargetResponseTime"
    Dimensions:
      - Name: "LoadBalancer"
        Value:
          Fn::ImportValue: !Sub "${ProductServiceStackName}:LoadBalancerFullName"
    ExtendedStatistic: p95
    ComparisonOperator: GreaterThanOrEqualToThreshold
    Threshold: 100
    Period: 60
    EvaluationPeriods: 1
    ActionsEnabled: true #This can be set to false in non-prod environments if you don't want to be alerted
    AlarmActions:
      - !Ref AlarmTopic
ProductServiceErrorAlarm:
  Type: "AWS::CloudWatch::Alarm"
  Properties:
    AlarmDescription: Product Service producing 5xx responses
    Namespace: "AWS/ApplicationELB"
    MetricName: "HTTPCode_Target_5XX_Count"
    Dimensions:
      - Name: "LoadBalancer"
        Value:
          Fn::ImportValue: !Sub "${ProductServiceStackName}:LoadBalancerFullName"
    Statistic: Sum
    ComparisonOperator: GreaterThanThreshold
    Threshold: 0
    Period: 3600 # 1 Hour
    EvaluationPeriods: 1
    ActionsEnabled: true #This can be set to false in non-prod environments if you don't want to be alerted
    AlarmActions:
      - !Ref AlarmTopic

We also leverage the exported value of our Products DynamoDB Tablename in order to monitor it as well. Add the following Alarms to monitoring.template.yaml to monitor ReadThrottleEvents, WriteThrottleEvents, ThrottledRequests, UserErrors and SystemErrorshttps://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/dynamo-metricscollected.html:

ReadThrottleEventsAlarm:
  Type: 'AWS::CloudWatch::Alarm'
  Properties:
    AlarmDescription: 'Reads are throttled. Lower ReadCapacityUnitsUtilizationTarget or increase MaxReadCapacityUnits.'
    Namespace: 'AWS/DynamoDB'
    MetricName: ReadThrottleEvents
    Dimensions:
    - Name: TableName
      Value:
        Fn::ImportValue: !Sub "${ProductServiceStackName}:ProductsTable::Id"
    Statistic: Sum
    Period: 60
    EvaluationPeriods: 1
    Threshold: 1
    ComparisonOperator: GreaterThanOrEqualToThreshold
    AlarmActions:
    - !Ref AlarmTopic
    OKActions:
    - !Ref AlarmTopic
WriteThrottleEventsAlarm:
  Type: 'AWS::CloudWatch::Alarm'
  Properties:
    AlarmDescription: 'Writes are throttled. Lower WriteCapacityUnitsUtilizationTarget or increase MaxWriteCapacityUnits.'
    Namespace: 'AWS/DynamoDB'
    MetricName: WriteThrottleEvents
    Dimensions:
    - Name: TableName
      Value:
        Fn::ImportValue: !Sub "${ProductServiceStackName}:ProductsTable::Id"
    Statistic: Sum
    Period: 60
    EvaluationPeriods: 1
    Threshold: 1
    ComparisonOperator: GreaterThanOrEqualToThreshold
    AlarmActions:
    - !Ref AlarmTopic
    OKActions:
    - !Ref AlarmTopic
ThrottledRequestsEventsAlarm:
  Type: 'AWS::CloudWatch::Alarm'
  Properties:
    AlarmDescription: 'Batch requests are throttled. Lower {Read/Write}CapacityUnitsUtilizationTarget or increase Max{Read/Write}CapacityUnits.'
    Namespace: 'AWS/DynamoDB'
    MetricName: ThrottledRequests
    Dimensions:
    - Name: TableName
      Value:
        Fn::ImportValue: !Sub "${ProductServiceStackName}:ProductsTable::Id"
    Statistic: Sum
    Period: 60
    EvaluationPeriods: 1
    Threshold: 1
    ComparisonOperator: GreaterThanOrEqualToThreshold
    AlarmActions:
    - !Ref AlarmTopic
    OKActions:
    - !Ref AlarmTopic
UserErrorsAlarm:
  Type: 'AWS::CloudWatch::Alarm'
  Properties:
    AlarmDescription: 'User errors'
    Namespace: 'AWS/DynamoDB'
    MetricName: UserErrors
    Dimensions:
    - Name: TableName
      Value:
        Fn::ImportValue: !Sub "${ProductServiceStackName}:ProductsTable::Id"
    Statistic: Sum
    Period: 60
    EvaluationPeriods: 1
    Threshold: 1
    ComparisonOperator: GreaterThanOrEqualToThreshold
    AlarmActions:
    - !Ref AlarmTopic
    OKActions:
    - !Ref AlarmTopic
SystemErrorsAlarm:
  Type: 'AWS::CloudWatch::Alarm'
  Properties:
    AlarmDescription: 'System errors'
    Namespace: 'AWS/DynamoDB'
    MetricName: SystemErrors
    Dimensions:
    - Name: TableName
      Value:
        Fn::ImportValue: !Sub "${ProductServiceStackName}:ProductsTable::Id"
    Statistic: Sum
    Period: 60
    EvaluationPeriods: 1
    Threshold: 1
    ComparisonOperator: GreaterThanOrEqualToThreshold
    AlarmActions:
    - !Ref AlarmTopic
    OKActions:
    - !Ref AlarmTopic

Next, all we need to do is subscribe to our topic. We can add a parameter to our CloudFormation template with the email address of an account that should receive alerts. Then we can add a subscription to notify us. Add the following to monitoring.template.yaml:

AlarmEmail:
  Type: String
  Description: Email address that should be alerted of Alarms
---
EmailAlarmSubscription:
  Type: 'AWS::SNS::Subscription'
  Properties:
    TopicArn: !Ref AlarmTopic
    Protocol: email
    Endpoint: !Ref AlarmEmail

Commit and push to deploy the monitoring stack and you now have production-level monitoring in place to alert you of issues.

aws cloudformation deploy \
    --stack-name=TeamA-Monitoring \
    --template-file=monitoring.template.yml \
    --parameter-overrides \
        AlarmEmail="nodereference@sourceallies.com"

Active Monitoring

The monitoring elements above are considered passive checks versus active checks. Our passive checks are monitoring real requests as they come in. However, this means that we will not know about a problem with our service until a client is affected. We still need to actively check our service to verify it is healthy, whether users/clients are invoking the service or not.

With an active health check, we periodically “probe” upstream resources comprising our service.

We are already using /hello as our heath check endpoint in our Load Balancer TargetGroup, but we are currently not notified if the endpoint is no longer accessible (unless through our _passive checks_). We can add an active check on the same endpoint so that it is polled periodically, and we are notified if it no longer responds with an HTTP 200.

To implement this active check, add one more export to the Outputs section in cloudformation.template.yml and then commit and push your changes:

FullyQualifiedDomainName:
  Value: !Sub '$SubDomain}.${BaseDomain}'
  Export:
    Name: !Sub '${AWS::StackName}:FQDN'

Next, add this new parameter to your Parameters section in monitoring.template.yml in order to specify the route we want to use for our health check:

HealthCheckRoute:
  Type: String
  Description: An unathenticated endpoint for health check purposes.  Returns 200 if OK.  Example is "/health" or "/hello".

Then add a Route53 HeathCheck, plus another SNS Topic and CloudWatch Alarm resource to the Resources section in monitoring.template.yml:

  DNSHealthCheck:
    Type: "AWS::Route53::HealthCheck"
    Properties:
      HealthCheckConfig:
        EnableSNI: true
        FailureThreshold: 3
        FullyQualifiedDomainName:
          Fn::ImportValue:
            Fn::Sub: ${ProductServiceStackName}:FQDN
        Inverted: false
        Port: 443
        RequestInterval: 30
        ResourcePath: !Ref HealthCheckRoute
        Type: "HTTPS"
  HealthCheckAlarm:
    Type: "AWS::CloudWatch::Alarm"
    Properties:
      AlarmActions:
        - !Ref AlarmTopic
      ComparisonOperator: "LessThanThreshold"
      Dimensions:
      - Name: HealthCheckId
        Value: !Ref DNSHealthCheck
      EvaluationPeriods: 1
      MetricName: "HealthCheckStatus"
      Namespace: "AWS/Route53"
      Period: 60
      Statistic: "Minimum"
      Threshold: 1.0

Finally, you can deploy the changes to your monitoring stack with this command:

aws cloudformation deploy \
    --stack-name=TeamA-Monitoring \
    --template-file=monitoring.template.yml \
    --parameter-overrides \
        HealthCheckRoute="/hello"

Once the stack is deployed, you’ll receive an email from AWS. You’ll need to click on a verification link in order to receive emails from the SNS Topic subscriptions.

You can verify your new active check via the AWS Management Console. Navigate to Route53 and then to Health Checks to find your new Health Check.

You can see our template changes here.