Troubleshooting¶

Common issues and their solutions.

Runner Not Registering¶

Symptoms¶

Instance launches but doesn't appear in GitHub runners list
ASG shows instances in Pending state for a long time

Check 1: Registration Lambda Logs¶

aws logs tail \
  /aws/lambda/$(terraform output -raw autoscaling_group_name)_registration \
  --follow

Look for: - Authentication errors (invalid token/App credentials) - Rate limiting from GitHub API - Network errors reaching GitHub

Check 2: Instance Cloud-init Logs¶

SSH to the instance or use SSM:

# Cloud-init output
cat /var/log/cloud-init-output.log

# Runner service status
systemctl status actions-runner

# Runner logs
journalctl -u actions-runner -f

Check 3: Registration Token¶

# Check if token was stored
aws secretsmanager list-secrets --filter Key="name",Values="actions-runner"

# Verify token content (will be deleted after use)

Common Fixes¶

Invalid GitHub credentials:

# For token auth - verify token has admin:org scope
# For App auth - verify App has Self-hosted runners permission

Network issues:

# Ensure Lambda has internet access
lambda_subnet_ids = data.aws_subnets.private.ids  # Must have NAT

Runner Stuck Terminating¶

Symptoms¶

Instance in Terminating:Wait state
Lifecycle hook not completing

Check: Deregistration Lambda Logs¶

aws logs tail \
  /aws/lambda/$(terraform output -raw autoscaling_group_name)_deregistration \
  --follow

Common Fixes¶

SSM command failing:

# Check SSM agent status on instance
aws ssm describe-instance-information --filters Key=InstanceIds,Values=i-xxx

Job still running: The Lambda waits for allowed_drain_time (default 900s) for jobs to complete.

Force complete lifecycle hook:

aws autoscaling complete-lifecycle-action \
  --lifecycle-action-result CONTINUE \
  --lifecycle-hook-name deregistration \
  --auto-scaling-group-name $(terraform output -raw autoscaling_group_name) \
  --instance-id i-xxx

Scaling Issues¶

Not Scaling Up¶

Check CloudWatch alarm:

aws cloudwatch describe-alarms --alarm-names "idle_runners_low-*"

Check record_metric Lambda:

aws logs tail \
  /aws/lambda/$(terraform output -raw autoscaling_group_name)_record_metric

Verify metric is being published:

aws cloudwatch get-metric-statistics \
  --namespace "InfraHouse/ActionsRunner" \
  --metric-name "IdleRunnersCount" \
  --start-time $(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%SZ) \
  --end-time $(date -u +%Y-%m-%dT%H:%M:%SZ) \
  --period 300 \
  --statistics Average

Not Scaling Down¶

Check for orphaned runners:

# List GitHub runners
gh api orgs/{org}/actions/runners --jq '.runners[] | {id, name, status, busy}'

# Compare with ASG instances
aws autoscaling describe-auto-scaling-groups \
  --auto-scaling-group-names $(terraform output -raw autoscaling_group_name) \
  --query 'AutoScalingGroups[0].Instances[*].InstanceId'

The deregistration Lambda cleans up orphaned runners on schedule.

Warm Pool Issues¶

Instances Not Entering Warm Pool¶

Check ASG configuration:

aws autoscaling describe-warm-pool \
  --auto-scaling-group-name $(terraform output -raw autoscaling_group_name)

Verify warm pool is enabled:

# Warm pool is disabled when using spot
on_demand_base_capacity = null  # Must be null for warm pool

Instances Not Waking from Warm Pool¶

Check instance state:

aws ec2 describe-instances --instance-ids i-xxx \
  --query 'Reservations[*].Instances[*].[InstanceId,State.Name]'

Hibernated instances should show stopped. If they show running, they're not properly hibernated.

Lambda Errors¶

Throttling¶

Check CloudWatch alarm:

aws cloudwatch describe-alarms --alarm-name-prefix "throttles"

Increase concurrency: The Lambdas use default concurrency. Consider requesting a limit increase if you see frequent throttling.

Timeout¶

Check Lambda configuration:

aws lambda get-function --function-name $(terraform output -raw autoscaling_group_name)_registration \
  --query 'Configuration.Timeout'

Registration and deregistration Lambdas have default timeouts. If GitHub API is slow, you may see timeouts.

Authentication Errors¶

"Bad credentials" from GitHub API¶

For token auth:

# Test token
curl -H "Authorization: token $(aws secretsmanager get-secret-value --secret-id your-secret --query SecretString --output text)" \
  https://api.github.com/orgs/{org}/actions/runners

For App auth:

# Check App installation
gh api /orgs/{org}/installations --jq '.installations[] | {app_id, app_slug}'

Rate Limiting¶

GitHub API limits: - Personal token: 5,000 requests/hour - GitHub App: 5,000 requests/hour per installation

Monitor rate limit:

curl -H "Authorization: token xxx" https://api.github.com/rate_limit

Puppet Failures¶

Check Puppet Logs¶

# On the instance
cat /var/log/puppet.log
journalctl -u puppet

Common Issues¶

Hiera config not found:

# Verify path exists
puppet_hiera_config_path = "/opt/puppet-code/environments/production/hiera.yaml"

Module not found:

# Ensure package is installed
packages = ["infrahouse-puppet-data"]

Jobs Failing with `SetInstanceProtection` Error¶

Symptoms¶

A workflow job fails immediately, before any user code runs, with output like:

A job started hook has been configured by the self-hosted runner administrator
Run '/usr/local/bin/gha_prerun.sh'
2026-01-30 10:31:07,388: ERROR: ... An error occurred (ValidationError) when
calling the SetInstanceProtection operation: The instance i-0f3066deda420ab46
is not in InService or EnteringStandby or Standby.
Error: Process completed with exit code 1.

By the time you investigate, the instance ID in the error message usually no longer exists (describe-instances returns InvalidInstanceID.NotFound).

Root cause¶

Scale-in race. When the ASG picks an idle runner for termination, it sets the instance to Terminating:Wait and fires the deregistration lifecycle hook. The hook's Lambda needs ~10s (SSM round-trip) to actually stop the runner service. During that window, GitHub can dispatch a queued job to the runner's open long-poll. gha_prerun.sh then calls SetInstanceProtection(protect=true) on an already-Terminating:Wait instance, AWS rejects it, and the job exits 1.

The "instance ID doesn't exist" is a timing artifact — the instance terminated cleanly seconds after the error, and EC2 retains terminated records for only ~1 hour.

Fix¶

Upgrade this module to 3.5.0+ and puppet-code to the version that includes PR #265 (actions-runner graceful scale-in). Together they:

Tolerate the Terminating:Wait state in gha_prerun.sh — the job proceeds instead of failing.
Use systemd's graceful-stop to let the in-flight job finish, including gha_postrun.sh.
Complete the lifecycle hook via ExecStopPost once the runner exits.
Heartbeat the hook every 10min so long jobs don't time out.

How to confirm you're affected¶

Look for failed workflow jobs whose logs contain the error above — that's the user-visible symptom.

As a cross-check, CloudTrail records every failed SetInstanceProtection call. Each failure in CloudTrail that has a corresponding failed job at the same timestamp is a race hit:

aws cloudtrail lookup-events \
  --lookup-attributes AttributeKey=EventName,AttributeValue=SetInstanceProtection \
  --query 'Events[?ErrorCode!=`null`].[EventTime,CloudTrailEvent]' \
  --output text

Or via Athena over CloudTrail logs:

SELECT eventtime, requestparameters, errormessage
FROM cloudtrail_logs.events
WHERE eventname = 'SetInstanceProtection'
  AND errorcode IS NOT NULL
ORDER BY eventtime DESC;

How to verify the fix is working¶

The fix does not stop SetInstanceProtection from failing — AWS will still reject the call on a Terminating:Wait instance, and those rejections still appear in CloudTrail at roughly the same rate as before. What changes is that the failure no longer kills the job.

After upgrading, expect:

Failed jobs with this error go to zero. GitHub workflow runs succeed even when they land in the race window.
CloudTrail failures continue. Don't be alarmed — they confirm the race is still occurring, just that we're surviving it.

On a runner that accepts a job during a scale-in window, the host-side flow is:

gha_prerun.sh tries SetInstanceProtection(protect=true) — AWS rejects — script logs skipping protect, job will proceed and exits 0.
Job runs to completion and gha_postrun.sh fires normally.
actions-runner.service exits via SIGTERM (graceful).
ExecStopPost=/usr/local/bin/gha-on-runner-exit.sh calls ih-aws autoscaling complete --hook deregistration --result CONTINUE.
Instance terminates.

Finding jobs a specific runner executed¶

If you observed a runner get scaled in (say ip-10-1-0-5) and want to confirm the jobs it executed completed successfully rather than failed, query GitHub for every workflow job with that runner_name. There's no org-wide "jobs for runner X" endpoint, so you iterate repos / runs. Scope with created=>=... to avoid walking historical runs.

RUNNER=ip-10-1-0-5
ORG=your-org
SINCE=2026-04-01

gh repo list "$ORG" --limit 1000 --no-archived \
  --json nameWithOwner -q '.[].nameWithOwner' |
while read repo; do
  gh api --paginate \
    "/repos/${repo}/actions/runs?per_page=100&created=>=${SINCE}" \
    --jq '.workflow_runs[].id' 2>/dev/null |
  while read run_id; do
    gh api "/repos/${repo}/actions/runs/${run_id}/jobs" \
      --jq ".jobs[] | select(.runner_name==\"${RUNNER}\") |
            {repo:\"${repo}\", id, name, conclusion, started_at, html_url}" \
      2>/dev/null
  done
done

Each line of output is a JSON object with the job's conclusion (success / failure / cancelled) and URL. The jobs survive in GitHub's data even after the runner's registration is removed from the org.

For a single-repo scope, drop the outer gh repo list loop and set repo directly.

See issue #81 for the full history.

Getting Help¶

Collect Debug Information¶

# ASG status
aws autoscaling describe-auto-scaling-groups \
  --auto-scaling-group-names $(terraform output -raw autoscaling_group_name)

# Recent Lambda logs
aws logs filter-log-events \
  --log-group-name /aws/lambda/$(terraform output -raw autoscaling_group_name)_registration \
  --start-time $(date -u -d '1 hour ago' +%s000)

# GitHub runner status
gh api orgs/{org}/actions/runners

Open an Issue ¶

Include:

Module version
Terraform plan/apply output
Lambda logs
Instance cloud-init logs (if applicable)

Troubleshooting¶

Runner Not Registering¶

Symptoms¶

Check 1: Registration Lambda Logs¶

Check 2: Instance Cloud-init Logs¶

Check 3: Registration Token¶

Common Fixes¶

Runner Stuck Terminating¶

Symptoms¶

Check: Deregistration Lambda Logs¶

Common Fixes¶

Scaling Issues¶

Not Scaling Up¶

Not Scaling Down¶

Warm Pool Issues¶

Instances Not Entering Warm Pool¶

Instances Not Waking from Warm Pool¶

Lambda Errors¶

Throttling¶

Timeout¶

Authentication Errors¶

"Bad credentials" from GitHub API¶

Rate Limiting¶

Puppet Failures¶

Check Puppet Logs¶

Common Issues¶

Jobs Failing with SetInstanceProtection Error¶

Symptoms¶

Root cause¶

Fix¶

How to confirm you're affected¶

How to verify the fix is working¶

Finding jobs a specific runner executed¶

Getting Help¶

Collect Debug Information¶

Open an Issue¶

Jobs Failing with `SetInstanceProtection` Error¶

Open an Issue ¶