PolarSPARC |
AWS Logging and Monitoring - Quick Notes
Bhaskar S | 01/23/2024 |
Amazon CloudWatch
The following is the summary of the various features/capabilities of CloudWatch:
Is used for performance monitoring, trigger alarms, log collection, and automated actions
Can create alarms that watch metrics and send notifications or automatically make changes to the resources that are monitored when a threshold is breached
Can also collect metrics from on-prem systems
One can automate responses to operational changes to improve operational performance and resource optimization (Ex: monitor the CPU usage on EC2 instances and use that metric to determine if additional need to be launched to handle increased load)
Derive actionable insights from logs
The following are some of the categories under CloudWatch:
CloudWatch Metrics
Provides various metrics for the various AWS services for monitoring purposes
Services send time sequenced metrics data points
EC2 metrics sent every 5 mins by default
One can enable detailed monitoring (incurs cost) for sending EC2 metrics every 1 min
Unified CloudWatch Agent can send system level metrics (memory and disk usage) for EC2 Instances as well as on-prem servers when installed
Unified CloudWatch Agent can be used to send custom metrics from applications using statsd or collectd protocols
Default EC2 metrics does NOT include memory and disk metrics
One can create Dashboards from the metrics
One can also create custom metrics and publish via CLI or API
One can stream metrics using Kinesis Data Firehose (and from there to other destinations) with near real-time delivery and low latency
CloudWatch Alarms
Used to initiate actions when a metric breaches some threshold
There are two types - Metric Alarm and Composite Alarm
Metric Alarm means perform one or more actions based on a single metric
Composite Alarm uses a rule expression (with AND and OR conditions) and works on multiple other alarms
Has three states - OK (within threshold), INSUFFICIENT_DATA (not enough data), ALARM (threshold breached)
Have three action targets for EC2 Instances
Stop, Terminate, Reboot
Auto scaling action
Send notification to SNS
CloudWatch Logs
Centralized place to collect and store system and application logs
Can define log expiration policies (never expire, 1 day to 10 years, etc)
Are encrypted by default
Can be sent to various destinations - S3 (via export), Kinesis Data Streams, Kinesis Data Firehose, Lambda, OpenSearch
Unified CloudWatch Agent when installed on EC2 Instance or on-prem server, can send logs to CloudWatch
Ability to query multiple Log Groups across AWS services for AWS accounts
Subscription Filters allow one to stream the log events in real-time which can be sent to Kinesis Data Streams, Kinesis Data Firehose, or Lambda for further processing
Can be used when we want to move Exabytes of data
Ability to stream log events from across AWS accounts and different Regions
CloudWatch Events (aka EventBridge)
Stream of system events describing changes to AWS resources which can be used to trigger actions
CloudWatch Insights
Container Insights - Collect, aggregate, summarize metrics and logs from containers (ECS, EKS, K8S on EC2, Fargate)
Lambda Insights - Monitoring and troubleshooting for serverless applications running on AWS Lambda by collecting, aggregating, summarizing system-level metrics as well as diagnostic information on cold starts and shutdowns
Contributor Insights - Analyze log data for creating a time series, which can be used to identify the top-N contributors (services or hosts) impacting the system performance
Amazon CloudTrail
The following is the summary of the various features/capabilities of CloudTrail:
Enables one to perform operational and risk auditing, governance, and compliance of their AWS account
Any API actions taken by a user, role, AWS service, SDK, CLI, or Console are recorded as events
Allows one to figure who did what, when, and on what resources
It is enabled by default
One can get a history of all the events made within an AWS account for the past 90 days
One can move the logs into CloudWatch Logs or S3 bucket if one wants a retention period greater than 90 days
A Trail can be applied to all Regions (default) or a single Region
Events can be triggered based on API calls
Events can also be streamed to CloudWatch Logs
The following are three types of Events:
Management Events
Any management operations (read or write) performed on the resources in an AWS account is captured as an event
Examples - launching or terminating EC2 Instances, configuring security, creating subnet, configuring rules for routing in a subnet, etc
Data Events
By default are NOT logged
Examples - S3 object level operations, Lambda function executions, etc
Insights Events
Are NOT enabled by default and one has to pay for the capability
Enables one to detect unusual activity in an AWS account
Examples - inaccurate resource provisioning, hitting service limits, burst of IAM actions, gaps in periodic maintenance activity, etc
References