Sharing Our Passion for Technology
& Continuous Learning
Node Reference - CloudWatch
Monitoring
How do we answer the question: "Is our application performing correctly?" With just one application server we could remotely log into the server, look at CPU and memory load, run grep{:target="_blank"} on the log files and then determine that everything is fine. This approach is manual intensive and obviously does not scale well when our service is horizontally scaled to be running on many servers, or in many containers, in AWS.
What we need is a system to gather the information we care about (also known as "Metrics"), aggregate the information together, and present it in a digestible way to us. There are many monitoring solutions on the market. However, to keep things simple, we will be leveraging AWS's built-in monitoring solution called CloudWatch{:target="_blank"}. Cloudwatch provides a good list of features that we will need to leverage in order to build out our monitoring solution:
- It supports storing and aggregating metrics{:target="_blank"} over time. This allows us to calculate an average or sum over several days, and allows us to look for patterns.
- It can manage dashboards{:target="_blank"} that allow us to graph metrics so we can visually determine if an application is performing unusually.
- We can setup alarms{:target="_blank"} that can alert us when a metric exceeds pre-determined thresholds.
- We can manage all of the above via CloudFormation{:target="_blank"}. This allows us to follow the same infrastructure-as-code{:target="_blank"} best practices as we follow with our application components. Essentially, monitoring can be thought of as simply another application whose end users are the development and operation teams.
- It has close integration with many AWS services{:target="_blank"} that allow us to start reading metrics with little to no additional setup within our infrastructure.
There is another monitoring product within AWS. X-Ray{:target="_blank"} is a service that allows the tracing of the duration of each request in your application as well as the duration of the service calls (i.e. DynamoDB) that it makes to downstream services. We are not leveraging X-Ray because its support for NodeJS projects has fallen behind the ecosystem in two main ways:
- In order to track a single request through various asyncronous functions, it uses continuation-local-storage to maintain context across callbacks and promises. Unfortunately, this does not play well with async/await (see Issue #12{:target="_blank"}).
- The only supported HTTP frameworks are Express and Restify{:target="_blank}. We could write our own Koa Middleware{:target="_blank}. However, in combination with the above async/await limitation, we would have to basically write most of the client library.
When the async/await issue is resolved, and X-Ray has better integration with the rest of AWS (specifically Cloudwatch), it would be worth taking a second look. In the meantime, if we wanted to know how long a specific piece of code (i.e. an HTTP request or calculation) can take, we could log the time taken to the console and use a Metric Filter{:target="_blank"} to capture it into a custom metric.
The first thing we need to do is to look back at the question we are trying to answer: "Is our application performing correctly?" Before we get into setting up a bunch of snazzy dashboards and alarms we should sit down with our team and agree on some definitions. We need to define what performing "correctly" means in terms of our specific "application". Application is probably pretty easy, it is everything deployed within the Cloudformation stack that we setup. Correctly is another matter. The exact definition will depend on what your service is responsible for and how important it is within the overall organization. It might take some form of:
- The service should be fast enough.
- The service shouldn't be producing errors and exceptions.
This list results in more terms that have to be defined and agreed upon. For our service, we will be using this more specific list:
- The 95th percentile of HTTP latency from the load balancer should be less than 200ms.
- There should be 0 requests that result in a 5xx HTTP status code from the client.
It turns out that both "TargetResponseTime" and "HTTPCode_Target_5XX_Count" are existing metrics sent by our load balancer to CloudWatch by default. Therefore, all we need to do is create a dashboard to graph these metrics.
We have to remember that over time, the team will be responsible for monitoring several, if not dozens, of applications. Because of this, we don't want to create a dashboard specific to this one application. If we did so, then we would have to individually check each application to determine if it was healthy. Instead, we are going to create a new CloudFormation{:target="_blank"} stack with its own template to hold all of the dashboards and alarms that our team is concerned with. We recommend checking this template into its own source control repository and deploying it independently. This allows dashboards and other monitoring assets to be updated without triggering a deployment of a specific application.
In order to graph our load balancer from our monitoring stack we need to know its "LoadBalancerFullName" value.
CloudFormation auto generates resource names if they are not specified, so we need to export this value{:target="_blank"} from our stack template so we can in turn import the value{:target="_blank"} into our monitoring stack. Add the following to the "Outputs" section of our existing cloudformation.template.yml
and commit & push this change:
LoadBalancerFullName:
Value: !GetAtt LoadBalancer.LoadBalancerFullName
Export:
Name: !Sub '${AWS::StackName}:LoadBalancerFullName'
Next, create a new CloudFormation template file (we named it monitoring.template.yml
) and add this content to it to create a simple dashboard that will graph out Load Balancer response times, errors and also any Throttled read or write requests to our DynamoDB table:
AWSTemplateFormatVersion: '2010-09-09'
Description: Monitoring dashboards
Parameters:
ProductServiceStackName:
Type: String
Description: Name of the product service cloudformation stack
Resources:
Dashboard:
Type: 'AWS::CloudWatch::Dashboard'
Properties:
DashboardName: 'My_Dashboard'
DashboardBody:
Fn::Sub:
- |
{
"widgets": [
{
"type": "metric",
"width": 24,
"properties": {
"title": "Average Response Time",
"period": 60,
"stat": "p95",
"region": "${AWS::Region}",
"metrics": [
["AWS/ApplicationELB", "TargetResponseTime", "LoadBalancer", "${ProductServiceLoadBalancerFullName}", {"label": "Product Service"}]
]
}
},
{
"type": "metric",
"width": 24,
"properties": {
"title": "Request Counts",
"period": 60,
"stat": "Sum",
"region": "${AWS::Region}",
"metrics": [
["AWS/ApplicationELB", "HTTPCode_Target_5XX_Count", "LoadBalancer", "${ProductServiceLoadBalancerFullName}", {"label": "Product Service 5xx"}]
]
}
},
{
"type": "metric",
"width": 24,
"properties": {
"title": "Throttled Requests",
"period": 60,
"stat": "Sum",
"region": "${AWS::Region}",
"metrics": [
["AWS/DynamoDB", "ThrottledRequests", "TableName", "${ProductServiceTableName}", {"label": "Table Throttled Requests"}]
]
}
}
]
}
- ProductServiceLoadBalancerFullName:
Fn::ImportValue: !Sub '${ProductServiceStackName}:LoadBalancerFullName'
ProductServiceTableName:
Fn::ImportValue: !Sub '${ProductServiceStackName}:ProductsTable::Id'
Note how in the template above we create a CloudFormation parameter to hold the name of the CloudFormation stack for our product service. This is used to dynamically construct the export name so that we can reference stack resources.
We then create a Cloudwatch Dashboard{:target="_blank"} resource with whatever name we choose and the body that specifies the metrics we want to graph. The DashboardBody property must be a string and not an object, so we use the YAML multiline support in combination with the Fn::Sub intrinsic function{:target="_blank"} to build this JSON structure. We are separate our metrics into two separate graphs because they are at different resolutions.
As we add new services, we simply need to add a new "metrics" element to our dashboard to start graphing that service's performance.
Deploy the template (monitoring.template.yaml
) with the following command:
aws cloudformation deploy \
--stack-name=TeamA-Monitoring \
--template-file=monitoring.template.yml \
--parameter-overrides \
ProductServiceStackName="ProductService-DEV"
Log into the CloudWatch console{:target="_blank"} and select "Dashboards" to see our new dashboard.
Alarms
Having one place to quickly answer the question of correctness for all of our applications is great. However, it is only valuable if we are asking the question. That is, if we are looking at our dashboards. What we need now is the ability to set a threshold that will alert us if a service is experiencing issues so that we can possibly take action. For that we need a Cloudwatch Alarm{:target="_blank"}.
Cloudwatch Alarms monitor a metric for when it passes a specified threshold over a certain amount of time (all of this is configurable). When a threshold is reached, it then triggers one or more Actions{:target="_blank"}. These actions could be to trigger autoscaling, a Simple Workflow Service{:target="_blank"} or send a message to an SNS Topic{:target="_blank"}.
We are going to be sending Alarms to an SNS topic because a topic can forward messages to email addresses, phones (via text message), or a Lambda function (if we need something custom).
Let's add a topic element to the monitoring template (monitoring.template.yaml
):
AlarmTopic:
Type: "AWS::SNS::Topic"
Properties: {}
Next we add Alarms to monitoring.template.yaml
.
We are leveraging the same exported value as our dashboard to reference our LoadBalancerFullName.
ProductServiceResponseTimeAlarm:
Type: "AWS::CloudWatch::Alarm"
Properties:
AlarmDescription: Product Service response time over 100ms
Namespace: "AWS/ApplicationELB"
MetricName: "TargetResponseTime"
Dimensions:
- Name: "LoadBalancer"
Value:
Fn::ImportValue: !Sub "${ProductServiceStackName}:LoadBalancerFullName"
ExtendedStatistic: p95
ComparisonOperator: GreaterThanOrEqualToThreshold
Threshold: 100
Period: 60
EvaluationPeriods: 1
ActionsEnabled: true #This can be set to false in non-prod environments if you don't want to be alerted
AlarmActions:
- !Ref AlarmTopic
ProductServiceErrorAlarm:
Type: "AWS::CloudWatch::Alarm"
Properties:
AlarmDescription: Product Service producing 5xx responses
Namespace: "AWS/ApplicationELB"
MetricName: "HTTPCode_Target_5XX_Count"
Dimensions:
- Name: "LoadBalancer"
Value:
Fn::ImportValue: !Sub "${ProductServiceStackName}:LoadBalancerFullName"
Statistic: Sum
ComparisonOperator: GreaterThanThreshold
Threshold: 0
Period: 3600 # 1 Hour
EvaluationPeriods: 1
ActionsEnabled: true #This can be set to false in non-prod environments if you don't want to be alerted
AlarmActions:
- !Ref AlarmTopic
We also leverage the exported value of our Products DynamoDB Tablename in order to monitor it as well. Add the following Alarms to monitoring.template.yaml
to monitor ReadThrottleEvents, WriteThrottleEvents, ThrottledRequests, UserErrors and SystemErrorshttps://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/dynamo-metricscollected.html:
ReadThrottleEventsAlarm:
Type: 'AWS::CloudWatch::Alarm'
Properties:
AlarmDescription: 'Reads are throttled. Lower ReadCapacityUnitsUtilizationTarget or increase MaxReadCapacityUnits.'
Namespace: 'AWS/DynamoDB'
MetricName: ReadThrottleEvents
Dimensions:
- Name: TableName
Value:
Fn::ImportValue: !Sub "${ProductServiceStackName}:ProductsTable::Id"
Statistic: Sum
Period: 60
EvaluationPeriods: 1
Threshold: 1
ComparisonOperator: GreaterThanOrEqualToThreshold
AlarmActions:
- !Ref AlarmTopic
OKActions:
- !Ref AlarmTopic
WriteThrottleEventsAlarm:
Type: 'AWS::CloudWatch::Alarm'
Properties:
AlarmDescription: 'Writes are throttled. Lower WriteCapacityUnitsUtilizationTarget or increase MaxWriteCapacityUnits.'
Namespace: 'AWS/DynamoDB'
MetricName: WriteThrottleEvents
Dimensions:
- Name: TableName
Value:
Fn::ImportValue: !Sub "${ProductServiceStackName}:ProductsTable::Id"
Statistic: Sum
Period: 60
EvaluationPeriods: 1
Threshold: 1
ComparisonOperator: GreaterThanOrEqualToThreshold
AlarmActions:
- !Ref AlarmTopic
OKActions:
- !Ref AlarmTopic
ThrottledRequestsEventsAlarm:
Type: 'AWS::CloudWatch::Alarm'
Properties:
AlarmDescription: 'Batch requests are throttled. Lower {Read/Write}CapacityUnitsUtilizationTarget or increase Max{Read/Write}CapacityUnits.'
Namespace: 'AWS/DynamoDB'
MetricName: ThrottledRequests
Dimensions:
- Name: TableName
Value:
Fn::ImportValue: !Sub "${ProductServiceStackName}:ProductsTable::Id"
Statistic: Sum
Period: 60
EvaluationPeriods: 1
Threshold: 1
ComparisonOperator: GreaterThanOrEqualToThreshold
AlarmActions:
- !Ref AlarmTopic
OKActions:
- !Ref AlarmTopic
UserErrorsAlarm:
Type: 'AWS::CloudWatch::Alarm'
Properties:
AlarmDescription: 'User errors'
Namespace: 'AWS/DynamoDB'
MetricName: UserErrors
Dimensions:
- Name: TableName
Value:
Fn::ImportValue: !Sub "${ProductServiceStackName}:ProductsTable::Id"
Statistic: Sum
Period: 60
EvaluationPeriods: 1
Threshold: 1
ComparisonOperator: GreaterThanOrEqualToThreshold
AlarmActions:
- !Ref AlarmTopic
OKActions:
- !Ref AlarmTopic
SystemErrorsAlarm:
Type: 'AWS::CloudWatch::Alarm'
Properties:
AlarmDescription: 'System errors'
Namespace: 'AWS/DynamoDB'
MetricName: SystemErrors
Dimensions:
- Name: TableName
Value:
Fn::ImportValue: !Sub "${ProductServiceStackName}:ProductsTable::Id"
Statistic: Sum
Period: 60
EvaluationPeriods: 1
Threshold: 1
ComparisonOperator: GreaterThanOrEqualToThreshold
AlarmActions:
- !Ref AlarmTopic
OKActions:
- !Ref AlarmTopic
Next, all we need to do is subscribe to our topic.
We can add a parameter to our CloudFormation template with the email address of an account that should receive alerts. Then we can add a subscription to notify us. Add the following to monitoring.template.yaml
:
AlarmEmail:
Type: String
Description: Email address that should be alerted of Alarms
---
EmailAlarmSubscription:
Type: 'AWS::SNS::Subscription'
Properties:
TopicArn: !Ref AlarmTopic
Protocol: email
Endpoint: !Ref AlarmEmail
Commit and push to deploy the monitoring stack and you now have production-level monitoring in place to alert you of issues.
aws cloudformation deploy \
--stack-name=TeamA-Monitoring \
--template-file=monitoring.template.yml \
--parameter-overrides \
AlarmEmail="nodereference@sourceallies.com"
Active Monitoring
The monitoring elements above are considered passive checks versus active checks. Our passive checks are monitoring real requests as they come in. However, this means that we will not know about a problem with our service until a client is affected. We still need to actively check our service to verify it is healthy, whether users/clients are invoking the service or not.
With an active health check, we periodically “probe” upstream resources comprising our service.
We are already using /hello
as our heath check endpoint in our Load Balancer TargetGroup{:target="blank"}, but we are currently not notified if the endpoint is no longer accessible (unless through our _passive checks). We can add an active check on the same endpoint so that it is polled periodically, and we are notified if it no longer responds with an HTTP 200.
To implement this active check, add one more export to the Outputs
section in cloudformation.template.yml
and then commit and push your changes:
FullyQualifiedDomainName:
Value: !Sub '$SubDomain}.${BaseDomain}'
Export:
Name: !Sub '${AWS::StackName}:FQDN'
Next, add this new parameter to your Parameters
section in monitoring.template.yml
in order to specify the route we want to use for our health check:
HealthCheckRoute:
Type: String
Description: An unathenticated endpoint for health check purposes. Returns 200 if OK. Example is "/health" or "/hello".
Then add a Route53 HeathCheck{:target="_blank"}, plus another SNS Topic{:target="_blank"} and CloudWatch Alarm{:target="_blank"} resource to the Resources
section in monitoring.template.yml
:
DNSHealthCheck:
Type: "AWS::Route53::HealthCheck"
Properties:
HealthCheckConfig:
EnableSNI: true
FailureThreshold: 3
FullyQualifiedDomainName:
Fn::ImportValue:
Fn::Sub: ${ProductServiceStackName}:FQDN
Inverted: false
Port: 443
RequestInterval: 30
ResourcePath: !Ref HealthCheckRoute
Type: "HTTPS"
HealthCheckAlarm:
Type: "AWS::CloudWatch::Alarm"
Properties:
AlarmActions:
- !Ref AlarmTopic
ComparisonOperator: "LessThanThreshold"
Dimensions:
- Name: HealthCheckId
Value: !Ref DNSHealthCheck
EvaluationPeriods: 1
MetricName: "HealthCheckStatus"
Namespace: "AWS/Route53"
Period: 60
Statistic: "Minimum"
Threshold: 1.0
Finally, you can deploy the changes to your monitoring stack with this command:
aws cloudformation deploy \
--stack-name=TeamA-Monitoring \
--template-file=monitoring.template.yml \
--parameter-overrides \
HealthCheckRoute="/hello"
Once the stack is deployed, you'll receive an email from AWS. You'll need to click on a verification link in order to receive emails from the SNS Topic subscriptions.
You can verify your new active check via the AWS Management Console. Navigate to Route53{:target="_blank"} and then to Health Checks{:target="_blank"} to find your new Health Check.
You can see our template changes here{:target="_blank"}.
Table of Contents
- Introduction
- Unit Testing
- Koa
- Docker
- Cloudformation
- CodePipeline
- Fargate
- Application Load Balancer
- HTTPS/DNS
- Cognito
- Authentication
- DynamoDB
- Put Product
- Validation
- Smoke Testing
- Monitoring (this post)
- List Products
- Get Product
- Patch Product
- History Tracking
- Delete
- Change Events
- Conclusion
If you have questions or feedback on this series, contact the authors at nodereference@sourceallies.com.