Monitoring¶

This module includes built-in monitoring designed to meet ISO 27001 compliance requirements.

Lambda Monitoring¶

All Lambda functions use the terraform-aws-lambda-monitored module, which provides:

Error alerting via SNS
Throttle monitoring
CloudWatch log retention (configurable, default 365 days)
Configurable alert strategies

Required Configuration¶

module "actions-runner" {
  # ... other config ...

  # Required: at least one email for alerts
  alarm_emails = ["oncall@example.com", "team@example.com"]
}

Email Confirmation Required

AWS sends confirmation emails to each address. Recipients must click the confirmation link to receive alerts.

Alert Configuration¶

module "actions-runner" {
  # ... other config ...

  alarm_emails = ["oncall@example.com"]

  # Alert when error rate exceeds 10% (default)
  error_rate_threshold = 10.0
}

CloudWatch Metrics¶

Custom Metrics¶

The record_metric Lambda publishes, under the GitHubRunners namespace with an asg_name dimension:

Metric	Description
`BusyRunners`	Number of runners currently executing a job
`IdleRunners`	Number of registered runners waiting for work

AWS Metrics¶

Standard CloudWatch metrics for:

ASG: GroupInServiceInstances, GroupDesiredCapacity
Lambda: Invocations, Errors, Duration, Throttles
EC2: CPUUtilization, StatusCheckFailed

CloudWatch Alarms¶

Autoscaling Alarms¶

Created automatically:

Alarm	Condition	Action
`idle_runners_low`	Idle < target	Scale out
`idle_runners_high`	Idle > target	Scale in

EC2 and ASG Alarms¶

Created automatically, all routed to the module-owned SNS topic (and to any ARNs in alarm_topic_arns):

Alarm	Condition	Why
`CPU Alarm on ASG <name>`	Average CPU > 90% for 1 minute	Runner under sustained load
`ASGAtMax-<name>`	`GroupInServiceInstances >= asg_max_size` for 5 minutes	Saturation — ceiling may need raising
`ASGZeroInService-<name>`	`GroupInServiceInstances == 0` while `GroupDesiredCapacity > 0` for 10 minutes	Catastrophic — ASG wants instances but has none
`WarmPoolEmpty-<name>`	`WarmPoolWarmedCapacity == 0` while `WarmPoolDesiredCapacity > 0` for 10 minutes (when warm pool enabled)	Latency-hiding optimization is off
`ASGLaunchStuck-<name>`	`GroupPendingInstances > 0` sustained >20 minutes	Likely launch failure (LT, capacity, IAM). 20-min threshold sits above Puppet's ~15-min provisioning window.
`RunnerRegistrationGap-<name>`	`GroupInServiceInstances - (BusyRunners + IdleRunners) > 0` for >5 minutes	EC2 is InService but runner never registered with GitHub
`ASGSaturatedAtMax-<name>`	At max size with every runner busy for 10 minutes	Scale-out cannot help; jobs are queueing

Lambda Alarms¶

Each Lambda has:

Alarm	Condition	Action
`*_errors`	Any error (immediate) or error rate > threshold	SNS notification
`*_throttles`	Any throttle	SNS notification
`*_memory`	Memory utilization > 80%	SNS notification

The memory alarm is backed by the LambdaInsights/memory_utilization metric. Lambda Insights is enabled on every function in this module so the alarm can fire before a function runs out of memory — the originally reported incident was a silent OOM in the runner_deregistration sweep that left stale runners in GitHub. If the alarm ever fires, check the function's recent invocations in CloudWatch Logs Insights and either reduce memory pressure in the handler or increase the function's memory_size.

Log Retention¶

All logs are retained in CloudWatch with configurable retention:

module "actions-runner" {
  # ... other config ...

  # Retain logs for 1 year (default: 365)
  cloudwatch_log_group_retention = 365
}

Log Groups Created¶

/aws/lambda/{asg-name}_registration
/aws/lambda/{asg-name}_deregistration
/aws/lambda/{asg-name}_record_metric

Compliance Considerations¶

ISO 27001¶

This module addresses several ISO 27001 controls:

Control	How It's Addressed
A.12.4.1 Event logging	CloudWatch Logs with retention
A.12.4.3 Administrator logs	Lambda execution logs
A.16.1.2 Reporting security events	SNS alerting on errors

SOC 2¶

Relevant for:

CC7.2: Monitoring system components
CC7.3: Evaluating security events

Vanta Integration¶

The module's monitoring setup satisfies Vanta's AWS Lambda checks:

✅ CloudWatch alarms on Lambda errors
✅ Log retention policies configured
✅ Encryption at rest (via AWS-managed keys)

The module always creates its own SNS topic for alarm notifications and subscribes every address in alarm_emails to it. This is the required, load-bearing notification path — every operator gets at least one working alert channel.

Email Alerts¶

module "actions-runner" {
  alarm_emails = [
    "oncall@example.com",
    "team-leads@example.com"
  ]
}

Email Confirmation Required

AWS sends each subscriber a confirmation email. Recipients must click the confirmation link before they will receive alerts.

Fan Out to PagerDuty / Slack / Shared Topics¶

For additional routing, pass one or more existing SNS topic ARNs via alarm_topic_arns. Every alarm this module creates will send to the module-owned topic and to each ARN in the list:

module "actions-runner" {
  alarm_emails     = ["oncall@example.com"]
  alarm_topic_arns = [
    aws_sns_topic.pagerduty_bridge.arn,
    aws_sns_topic.shared_org_alerts.arn,
  ]
}

Subscribing Additional Endpoints to the Module-Owned Topic¶

The module's topic ARN is exposed as an output:

resource "aws_sns_topic_subscription" "slack" {
  topic_arn = module.actions-runner.alarm_topic_arn
  protocol  = "https"
  endpoint  = "https://hooks.slack.com/services/..."
}

CloudWatch Dashboard¶

The module creates a CloudWatch dashboard named after the ASG. It surfaces, in order:

Alarm state for every alarm this module owns.
BusyRunners / IdleRunners and derived utilization.
Fleet size (desired / in-service / min / max) and transient states (pending, terminating, standby).
Warm pool capacity (when warm pool is enabled).
IdleRunners with scale-out/scale-in thresholds annotated, plus autoscaling alarm state.
EC2 CPU (average + p95) and status-check failures.
Lambda lifecycle — invocations / errors / throttles / p95 duration for registration, deregistration, and record_metric.

# URL available as an output
output "runner_dashboard" {
  value = module.actions-runner.dashboard_url
}

Host-Level Alarms (Disk, Memory)¶

Not shipped yet. The runner AMI does not run the CloudWatch agent today, so disk and memory metrics are not available. Tracked in infrahouse/puppet-code#270; once the agent is wired into role::github_runner, a follow-up release of this module will add disk and memory alarms unconditionally.

Debugging¶

Check Lambda Logs¶

# View registration Lambda logs
aws logs tail /aws/lambda/actions-runner-xyz_registration --follow

# View deregistration Lambda logs
aws logs tail /aws/lambda/actions-runner-xyz_deregistration --follow

# View metric Lambda logs
aws logs tail /aws/lambda/actions-runner-xyz_record_metric --follow

Check Runner Status¶

# List runners via GitHub CLI
gh api orgs/{org}/actions/runners --jq '.runners[] | {name, status, busy}'

Check ASG Status¶

# Get ASG instances
aws autoscaling describe-auto-scaling-groups \
  --auto-scaling-group-names "$(terraform output -raw autoscaling_group_name)" \
  --query 'AutoScalingGroups[0].Instances'

# Get warm pool instances
aws autoscaling describe-warm-pool \
  --auto-scaling-group-name "$(terraform output -raw autoscaling_group_name)"

Outputs¶

The module provides these monitoring-related outputs:

Output	Description
`autoscaling_group_name`	ASG name for CloudWatch queries
`alarm_topic_arn`	Module-owned SNS topic ARN; subscribe additional endpoints here
`dashboard_name`	Name of the CloudWatch dashboard
`dashboard_url`	Deep link to the CloudWatch dashboard
`deregistration_log_group`	CloudWatch log group for deregistration Lambda