AWS Logging and Monitoring

The following is the summary of the various features/capabilities of CloudWatch:

Is used for performance monitoring, trigger alarms, log collection, and automated actions
Can create alarms that watch metrics and send notifications or automatically make changes to the resources that are monitored when a threshold is breached
Can also collect metrics from on-prem systems
One can automate responses to operational changes to improve operational performance and resource optimization (Ex: monitor the CPU usage on EC2 instances and use that metric to determine if additional need to be launched to handle increased load)
Derive actionable insights from logs
The following are some of the categories under CloudWatch:

CloudWatch Metrics
- Provides various metrics for the various AWS services for monitoring purposes
- Services send time sequenced metrics data points
- EC2 metrics sent every 5 mins by default
- One can enable detailed monitoring (incurs cost) for sending EC2 metrics every 1 min
- Unified CloudWatch Agent can send system level metrics (memory and disk usage) for EC2 Instances as well as on-prem servers when installed
- Unified CloudWatch Agent can be used to send custom metrics from applications using statsd or collectd protocols
- Default EC2 metrics does NOT include memory and disk metrics
- One can create Dashboards from the metrics
- One can also create custom metrics and publish via CLI or API
- One can stream metrics using Kinesis Data Firehose (and from there to other destinations) with near real-time delivery and low latency
CloudWatch Alarms
- Used to initiate actions when a metric breaches some threshold
- There are two types - Metric Alarm and Composite Alarm
- Metric Alarm means perform one or more actions based on a single metric
- Composite Alarm uses a rule expression (with AND and OR conditions) and works on multiple other alarms
- Has three states - OK (within threshold), INSUFFICIENT_DATA (not enough data), ALARM (threshold breached)
- Have three action targets for EC2 Instances
  1. Stop, Terminate, Reboot
  2. Auto scaling action
  3. Send notification to SNS
CloudWatch Logs
- Centralized place to collect and store system and application logs
- Can define log expiration policies (never expire, 1 day to 10 years, etc)
- Are encrypted by default
- Can be sent to various destinations - S3 (via export), Kinesis Data Streams, Kinesis Data Firehose, Lambda, OpenSearch
- Unified CloudWatch Agent when installed on EC2 Instance or on-prem server, can send logs to CloudWatch
- Ability to query multiple Log Groups across AWS services for AWS accounts
- Subscription Filters allow one to stream the log events in real-time which can be sent to Kinesis Data Streams, Kinesis Data Firehose, or Lambda for further processing
- Can be used when we want to move Exabytes of data
- Ability to stream log events from across AWS accounts and different Regions
CloudWatch Events (aka EventBridge)
- Stream of system events describing changes to AWS resources which can be used to trigger actions
CloudWatch Insights
- Container Insights - Collect, aggregate, summarize metrics and logs from containers (ECS, EKS, K8S on EC2, Fargate)
- Lambda Insights - Monitoring and troubleshooting for serverless applications running on AWS Lambda by collecting, aggregating, summarizing system-level metrics as well as diagnostic information on cold starts and shutdowns
- Contributor Insights - Analyze log data for creating a time series, which can be used to identify the top-N contributors (services or hosts) impacting the system performance

The following is the summary of the various features/capabilities of CloudTrail:

Enables one to perform operational and risk auditing, governance, and compliance of their AWS account
Any API actions taken by a user, role, AWS service, SDK, CLI, or Console are recorded as events
Allows one to figure who did what, when, and on what resources
It is enabled by default
One can get a history of all the events made within an AWS account for the past 90 days
One can move the logs into CloudWatch Logs or S3 bucket if one wants a retention period greater than 90 days
A Trail can be applied to all Regions (default) or a single Region
Events can be triggered based on API calls
Events can also be streamed to CloudWatch Logs
The following are three types of Events:

Management Events
- Any management operations (read or write) performed on the resources in an AWS account is captured as an event
- Examples - launching or terminating EC2 Instances, configuring security, creating subnet, configuring rules for routing in a subnet, etc
Data Events
- By default are NOT logged
- Examples - S3 object level operations, Lambda function executions, etc
Insights Events
- Are NOT enabled by default and one has to pay for the capability
- Enables one to detect unusual activity in an AWS account
- Examples - inaccurate resource provisioning, hitting service limits, burst of IAM actions, gaps in periodic maintenance activity, etc