< Prev | Up: Session Overview | Next >
Session 4 β Infrastructure & Operations β
Goal: Confirm infrastructure, deployment, observability, and operational runbooks are production-ready. Estimated time: 60 min
Reading Order β
Core Architecture β
| # | File | Lines | What It Covers |
|---|---|---|---|
| 1 | infrastructure.md | 179 | Full deployment architecture: Terraform, Docker Swarm, backup/DR. |
| 2 | observability.md | 300 | LGTM stack, instrumentation, PII redaction, SLIs/SLOs. |
| 3 | infrastructure/README.md | ~60 | Directory layout for docker/, terraform/, scripts/. |
| 4 | ssh-access-guide.md | ~40 | Server access instructions. |
CI/CD & Alerting β
| # | File | Lines | What It Covers |
|---|---|---|---|
| 5 | ci-cd.md | 213 | Full CI/CD pipeline spec, security tooling, quality gates. |
| 6 | observability-alerts.md | 143 | Alert definitions and triage procedures. |
Runbooks β
| # | File | Lines | What It Covers |
|---|---|---|---|
| 7 | backup-verify-runbook.md | ~50 | Backup integrity verification procedure. |
| 8 | secrets-rotation-runbook.md | ~60 | Docker Swarm secret rotation procedure. |
| 9 | manager-failover-runbook.md | ~80 | Swarm manager node failover procedure. |
| 10 | postgres-cutover.md | 129 | Ubicloud PostgreSQL migration procedure. |
| 11 | legal-hold-runbook.md | ~50 | GDPR legal hold procedure. |
| 12 | observability-terraform-migration.md | ~100 | Observability stack Terraform migration. |
π What to Validate β
- [ ]
infrastructure.mdhas a duplicated line about "Edge Routing" (lines 20β21). Confirm and fix. - [ ] The Ubicloud cutover section is extremely detailed β is this still the plan, or has anything changed?
- [ ]
observability.mdreferencespromtailin the preview section but theLoki Docker logging driverin the retention section β which is actually deployed? - [ ] Runbooks: are they executable as written? Could you follow each one step-by-step during a 3 AM incident?
- [ ]
ci-cd.mdΒ§3 lists several security mitigations as "β οΈ TODO" β are any resolved? - [ ]
observability.mdSLO definitions β are these still the right targets? - [ ] Production container resource limits:
infrastructure.mdΒ§5 flags this as mandatory but TODOS.md says none are set. Still true?
πΊοΈ Mindmap & Path Optimization β
Grab your pen and paper:
- [ ] Map the Incident Response Path: A Grafana alert fires for "High API Latency". Trace the documentation path.
- Observation: Where do you go first?
observability-alerts.mdβobservability.mdβinfrastructure.md? - Optimization: The incident response flow is highly fragmented across 6 runbooks and 2 reference docs. Can we create a single "Incident Root Diagram" that points directly to the right runbook?
- Observation: Where do you go first?
- [ ] Map the Deployment Pipeline: How does code go from a PR to a live preview environment?
- Observation:
ci-cd.mdis 213 lines long. - Optimization: Would a visual GitHub Actions dependency diagram save newcomers hours of reading?
- Observation:
π Findings & Actions β
| Severity | File / Topic | Issue & Optimization Potential | Action Required |
|---|---|---|---|