Troubleshooting¶
This page covers common issues, failure scenarios, and restore procedures.
Alarms and Their Meaning¶
backup_failure fires¶
Symptom: SNS email says "One or more GitHub repositories failed to back up."
What it means: The task ran but the backup script published a BackupFailure metric — at least one repo failed to clone, bundle, or upload.
Diagnosis:
LOG_GROUP="$(terraform output -raw log_group_name)"
aws logs tail "$LOG_GROUP" --since 1d --filter-pattern 'ERROR'
Common causes:
- GitHub App permissions were reduced after install (the task can see the repo but can't read
Contents) - A repo exceeded
task_ephemeral_storage_gb— mirror + bundle didn't fit on disk - Transient GitHub API rate limit or outage
- S3 bucket permissions changed out-of-band
Fixes:
- For disk space: raise
task_ephemeral_storage_gb. - For permissions: verify the GitHub App has
Contents: Read-onlyandMetadata: Read-only. - For transient failures: the next scheduled run usually recovers without intervention.
task_not_running fires¶
Symptom: SNS email says "GitHub backup task has not run in the last 24 hours."
What it means: No BackupSuccess metric was published during the evaluation period. Either EventBridge didn't fire, ECS failed to start the task, or the container crashed before emitting metrics.
This alarm uses treat_missing_data = breaching, so a task that can't even start will trigger it.
Diagnosis:
# Did EventBridge actually invoke RunTask?
aws cloudwatch get-metric-statistics \
--namespace AWS/Events \
--metric-name Invocations \
--dimensions Name=RuleName,Value="$(terraform output -raw service_name)-schedule" \
--start-time "$(date -u -v-2d +%Y-%m-%dT%H:%M:%SZ)" \
--end-time "$(date -u +%Y-%m-%dT%H:%M:%SZ)" \
--period 86400 --statistics Sum
# Did any tasks actually start?
CLUSTER="$(terraform output -raw ecs_cluster_name)"
aws ecs list-tasks --cluster "$CLUSTER" --desired-status STOPPED --max-results 10
Common causes:
- The GitHub App PEM was never written to Secrets Manager (first-deploy mistake)
- The private key in Secrets Manager is malformed / not a valid PEM
- Subnets lost internet access (NAT gateway removed, route table changed)
- The image tag moved and introduced a regression (pin to a SHA in
image_uri)
Restore Procedures¶
Git bundles are portable and self-contained. Each bundle is a full mirror at the time of backup — all branches, tags, and history.
Restore a single repository¶
# Download the bundle from S3
aws s3 cp s3://BUCKET/github-backup/2026-04-16/your-org/repo.bundle repo.bundle
# Verify the bundle
git bundle verify repo.bundle
# Clone from the bundle
git clone repo.bundle repo-restored
# Push back to GitHub (if recreating the repo)
cd repo-restored
git remote set-url origin git@github.com:your-org/repo.git
git push --mirror origin
Restore from the replica region¶
If the primary region is unavailable:
aws s3 cp s3://BUCKET-replica/github-backup/2026-04-16/your-org/repo.bundle repo.bundle \
--region REPLICA_REGION
# Continue with the standard restore steps above
Restore every repo from a specific date¶
# List available backup dates
aws s3 ls s3://BUCKET/github-backup/
# Download every bundle from one date
aws s3 cp s3://BUCKET/github-backup/2026-04-16/ ./restore/ --recursive
# Inspect the manifest
cat restore/manifest.json
# Clone each bundle
for bundle in restore/your-org/*.bundle; do
repo_name="$(basename "$bundle" .bundle)"
git clone "$bundle" "restored/$repo_name"
done
Attach a bundle as a remote on an existing clone¶
Useful when you just want to pull objects from the backup without re-cloning:
Failure Scenarios Reference¶
| Scenario | Detection | Recovery |
|---|---|---|
| Single repo fails to back up | backup_failure alarm | Check logs; fix disk/permissions; rerun task manually. |
| Task does not run at all | task_not_running alarm | Verify EventBridge rule + Secrets Manager has a valid PEM. |
| Primary region outage | AWS status / client-side 5xx | Restore from replica bucket in replica_region. |
| GitHub App key compromised | Out-of-band | Revoke in App settings, rotate PEM, put-secret-value the new one. |
| Accidental S3 delete | S3 object missing | Use S3 versioning: restore the prior non-delete-marker version. |
| Backup corruption | git bundle verify fails | Restore from an earlier day's backup (daily prefixes + versioning). |
Running a Backup On Demand¶
Useful for verifying a fix without waiting for the next scheduled run:
CLUSTER="$(terraform output -raw ecs_cluster_name)"
TASK_DEF="$(terraform output -raw task_definition_arn)"
SG="$(terraform output -raw security_group_id)"
SUBNETS='["subnet-abc123","subnet-def456"]'
aws ecs run-task \
--cluster "$CLUSTER" \
--task-definition "$TASK_DEF" \
--launch-type FARGATE \
--network-configuration "awsvpcConfiguration={subnets=${SUBNETS},securityGroups=[\"$SG\"],assignPublicIp=DISABLED}"
Tail the logs:
Re-Populating a Wiped Secret¶
If the Secrets Manager secret is emptied or the secret value is deleted:
aws secretsmanager put-secret-value \
--secret-id "$(terraform output -raw github_app_key_secret_arn)" \
--secret-string file://github-app.pem
The role performing this call must be listed in github_app_key_secret_writers (or be the Terraform admin). Until the secret is populated, the task fails immediately on startup.