Architecture¶
How It Works¶
On every EventBridge tick the module runs a one-shot Fargate task that backs up every GitHub repo the App can see, then exits.
Runtime flow (per invocation)¶
- EventBridge fires per
schedule_expressionand invokesRunTaskon the ECS cluster. - Fargate task starts in the customer VPC and reads the GitHub App PEM from Secrets Manager.
- The container mints a signed JWT, exchanges it for a short-lived GitHub installation token, and refreshes that token as it ages (
TokenManager). - It lists every repository the App has access to via the GitHub REST API.
- For each repo it runs
git clone --mirrorinto a temporary directory on the task's ephemeral storage (credentials supplied viaGIT_ASKPASSso the token never appears in the process table or shell history). The mirror is intermediate — it is discarded after step 6. - It runs
git bundle create <repo>.bundle --allagainst the mirror to produce the single self-contained file that actually gets uploaded. - It uploads each bundle to the S3 primary bucket under
github-backup/<YYYY-MM-DD>/<org>/. - It writes a
manifest.jsonwith repo metadata (name, default branch, SHA, size). - It emits
BackupSuccess/BackupFailureCloudWatch metrics under namespaceGitHubBackup. - S3 Cross-Region Replication asynchronously copies the new objects to the replica bucket.
- Any failure crashes the container (non-zero exit). Alarms fire on failure or on missing success metrics.
Data layout in S3¶
Bundles are immutable once uploaded. S3 versioning + lifecycle (backup_retention_days) controls how long history is retained.
Components¶
| Component | Purpose |
|---|---|
ECS Cluster (aws_ecs_cluster.backup) | Fargate cluster, Container Insights enabled. |
ECS Task Definition (aws_ecs_task_definition.backup) | Runs container/backup.py on Fargate. |
EventBridge Rule (aws_cloudwatch_event_rule.backup) | Scheduler that triggers the task. |
S3 Primary Bucket (via infrahouse/s3-bucket/aws) | Stores bundles + manifest. Versioned. |
S3 Replica Bucket (aws_s3_bucket.replica) | CRR target in replica_region. |
| Replication IAM role | Assumed by S3 service for cross-region replication. |
Secrets Manager Secret (via infrahouse/secret/aws) | Holds the GitHub App PEM. |
| Task IAM role | Read secret, write S3, publish CloudWatch metrics. |
| Execution IAM role | Pull image from ECR, write logs. |
| EventBridge IAM role | Allows EventBridge to call ecs:RunTask + pass roles. |
| Security Group | Egress-only; locks the task down to outbound traffic. |
| CloudWatch Log Groups | /ecs/<service> (task stdout) + /aws/ecs/containerinsights/<cluster>/performance. |
| CloudWatch Alarms | backup_failure, task_not_running (treat_missing_data=breaching). |
| SNS Topic + Subscriptions | Email delivery for alarms. |
Key Design Decisions¶
- Fargate, not Lambda — no 15-minute timeout, no always-on EC2. You pay for the backup window only. Large organizations with many repos don't hit wall-clock limits.
- Customer-owned GitHub App — no shared credentials from InfraHouse. Installation tokens expire in an hour, limiting blast radius if a bundle is leaked with the token still in memory.
- InfraHouse-published container image on public ECR — no per-customer build pipeline needed, but
image_urican be pinned to a SHA for reproducibility. - S3 with versioning + lifecycle — object versioning protects against deletion/overwrite; lifecycle handles long-term retention.
- Cross-region replication via AWS provider v6 per-resource
region— deliberately avoids aliased providers (provider = aws.replica) because consumers don't want to pass a second provider. Each replica resource setsregion = var.replica_regiondirectly. - Module-managed secret (
infrahouse/secret/aws) — fine-grained resource policy separates admin / writers / readers, audit-friendly via CloudTrail. git bundleover tarballs — bundles are a native git format; restore isgit clone repo.bundlewith no extra tooling, and they preserve full history, branches, and tags.- GIT_ASKPASS for credentials — the installation token never lands on the command line or in the shell history, which tarball-based approaches often leak.
Disaster Recovery¶
RPO and RTO¶
| Metric | Value | Notes |
|---|---|---|
| RPO | Up to the schedule interval (default: 24h) | Worst case is one full schedule_expression interval. |
| RTO | Minutes per repository | Single-repo restore is seconds. Full-org scales with repo count/size. |
Backup Storage¶
Backups live in two buckets:
- Primary — in the region where the module is deployed (where the Fargate task runs).
- Replica — in
replica_region, synchronized via S3 Cross-Region Replication.
Both are versioned, so even if a backup is overwritten or deleted, previous versions are retained per the lifecycle policy.
Failure Modes the Alarms Catch¶
| Alarm | Metric | Condition | Catches |
|---|---|---|---|
backup_failure | GitHubBackup/BackupFailure | sum > 0 / 24h | Repo failed to clone/bundle/upload. |
task_not_running | GitHubBackup/BackupSuccess | count < 1 / 24h, missing=breach | Task never ran. |
See Troubleshooting for restore procedures and recovery steps.