CI/CD
A robust CI/CD pipeline is the only way to keep a self-hosted Docker Swarm architecture from becoming an operational nightmare. By automating your deployments and infrastructure changes, you remove the human error of SSHing into servers to manually pull images or update routing rules.
1. IaaS CI/CD Pipeline (Terraform)
Your physical and managed infrastructure should be treated as code. You will use a dedicated GitHub repository (or a heavily isolated folder in your monorepo) for your Terraform configurations.
Setup Requirements:
- Remote State Storage: Remote S3-compatible backend (Hetzner Object Storage) for state management with CI/CD support. Requires S3_ACCESS_KEY and S3_SECRET_KEY in GitHub Secrets.
- GitHub Secrets: You need to securely store your
HETZNER_API_TOKENin GitHub.
Planned Automations:
- On Pull Request: GitHub Actions runs
terraform fmt(formatting),terraform validate(syntax checking), andterraform plan. The action comments the execution plan directly on the PR, showing exactly what servers or firewalls will be created, modified, or destroyed. - On Merge to Main: GitHub Actions runs
terraform apply -auto-approve, executing the changes on Hetzner and AWS instantly.
Continuous Manual Tasks:
- Reviewing the Plan: A human must always review the
terraform planoutput on the PR before merging. Terraform is powerful; a typo could accidentally instruct it to delete your production managed database. - Major Architecture Shifts: Moving from Hetzner to AWS, or adding a completely new managed service cluster, will require manual research and writing new Terraform modules.
2. Application CI/CD Pipeline (GitHub Actions + Docker Swarm)
This pipeline handles your code (Vue.js, Nest.js), your GraphQL engine configurations (Hasura), and the routing (Traefik).
Setup Requirements:
- Container Registry: You will use GitHub Container Registry (GHCR) to store your compiled Docker images.
- Swarm Access: You must generate a dedicated SSH key pair. The private key goes into GitHub Secrets, allowing the GitHub Actions runner to securely SSH into your Hetzner Swarm Manager node to execute deploy commands.
- Hasura CLI: Your pipeline needs the Hasura CLI installed to automate database migrations and metadata syncing.
Planned Automations (Per Service):
| Service | CI/CD Automation Flow |
|---|---|
| Vue.js (Frontend) | PR: Run ESLint, run unit tests. Build test image. Merge: Build production static files, bake them into an Nginx Docker image. Push to GHCR with the Git SHA as the tag. Update the Swarm service ( docker service update --image ghcr.io/org/vue-app:SHA frontend). |
| Nest.js (API & Worker) | PR: Run ESLint, Jest unit tests, and e2e tests. Merge: Build the Node.js Docker image. Push to GHCR. Update both the api and api-worker Swarm services sequentially to ensure zero downtime. |
| Hasura & Postgres | PR: Spin up a temporary Postgres/Hasura container in the GitHub runner, apply migrations, and run schema tests. Merge: Before updating the frontend or backend, the action uses the Hasura CLI to apply SQL migrations to the Hetzner Managed Postgres and applies the Hasura Metadata (roles, permissions, event triggers). |
| Nhost (Auth/Storage) | These are pre-built images. Your automation only triggers if you change their configuration files or bump the version tag in your docker-compose.yml. |
| Preview Environments | PR Opened: Action builds images, SSHs into the Swarm Manager, and runs docker stack deploy -c preview.yml pr-123. Traefik dynamically routes a URL (e.g., pr-123.preview.yoursaas.com). Action posts the link to the PR.PR Closed: Action runs docker stack rm pr-123 to free up Hetzner resources. |
Continuous Manual Tasks:
- Secret Rotation: Because Swarm secrets are immutable, if an external API key (like OpenAI or AWS SES) is compromised, you must manually create a new secret version in the Swarm, update your
docker-compose.ymlto point to the new version, and trigger a deployment. - Destructive Database Migrations: While Hasura handles standard migrations beautifully via CI/CD, complex destructive changes (e.g., dropping a widely used column or splitting a massive table) should often be run manually during a planned maintenance window to prevent table-locking issues under high load.
- Node Maintenance: If a Hetzner server needs a hardware replacement, you will manually SSH in, run
docker node update --availability drain <node-id>to safely move containers to other servers, and then replace the node via Terraform.
Hasura Metadata Drift Detection
deploy.yml applies Hasura metadata forward-only via hasura metadata apply, which silently overwrites any manual console changes. Without a drift check, a developer who modifies metadata via hasura console (SSH tunnel) and forgets to export it loses those changes on the next deploy — or worse, causes schema conflicts.
- Post-deploy verification: after
hasura metadata applysucceeds, a follow-up step runshasura metadata diffandhasura migrate statusagainst the repo state. Any un-exported console change surfaces as a CI warning (not a hard failure — console use during debugging is legitimate, but the diff must be acknowledged). - Weekly cron: a scheduled variant runs the same diff check between deploys to catch drift that accumulates while no deploys occur.
- Implementation: tracked in TODOS.md.
3. Security Tooling (GitHub Actions)
Automated security scanning is integrated directly into the CI/CD pipeline and the GitHub Pull Request process.
| Tool | Purpose | Trigger |
|---|---|---|
| GitHub Advanced Security (CodeQL) | Static Application Security Testing (SAST). Scans for security vulnerabilities, injection flaws, and logic errors in TypeScript/JavaScript. | On every PR and weekly cron. |
| Dependabot | Automated dependency updates. Opens PRs when new versions of npm packages are available. Flags known CVEs in transitive dependencies. | Continuous monitoring. |
| Snyk | Deep dependency vulnerability scanning. Provides fix recommendations and license compliance checks. Used alongside Dependabot for layered coverage. | On every PR and continuous monitoring. |
| license-checker | Scans all npm dependency licenses against a configured allowlist. Fails the build if a dependency uses a copyleft or unknown license (e.g., GPL, AGPL). Generates a full license report for legal review. | On every PR. |
License Allowlist Configuration: The license-checker step is configured with --onlyAllow to explicitly permit a set of known-safe licenses: MIT; ISC; BSD-2-Clause; BSD-3-Clause; Apache-2.0; 0BSD; CC0-1.0. Any dependency that falls outside this allowlist causes the pipeline to fail, requiring manual review before merge. The generated license summary is uploaded as a build artifact for auditing.
CI/CD Secret Security
Threat: GitHub Actions secrets (Hetzner API token, SSH deploy keys, database passwords) are available to workflows triggered by PRs from same-repo branches. A collaborator with push access can modify a workflow file in their branch to exfiltrate secrets — for example by encoding the value (base64, hex) to bypass GitHub's automatic log masking, then sending it to an external server.
GitHub's Built-In Protections:
- Log masking — GitHub replaces literal secret values with
***in workflow logs. This stops accidental exposure but does NOT stop intentional encoding-based exfiltration. - Fork PR isolation — workflows triggered by PRs from forks do NOT receive repository secrets by default. This only protects against external contributors, not same-repo collaborators.
Required Mitigations (Mandatory):
| Mitigation | Status | How |
|---|---|---|
| Environment protection gates | ⚠️ TODO | Create production, studio, and preview environments in GitHub → Settings → Environments. Enable "Required reviewers" on each. Deploy jobs reference these environments — secrets only inject after manual approval. |
| Preview build/deploy separation | ⚠️ TODO | Split preview.yml so the build job (uses only GITHUB_TOKEN for GHCR push) runs automatically, but the deploy job (uses DEPLOY_SSH_KEY) requires environment approval. |
| Least-privilege SSH keys | ⚠️ TODO | The DEPLOY_SSH_KEY should only authorize docker stack deploy commands, not full root access. Create a dedicated deploy user on each Swarm node with a restricted shell or sudoers entry. |
| Audit collaborator access | Ongoing | Limit repository write access to trusted team members. Review the collaborator list before onboarding external contributors. |
If External Contributors Are Added (Future):
- Switch PR-triggered workflows from
pull_requesttopull_request_targetto prevent fork PRs from executing modified workflow files. - Require first-time contributor approval for all workflow runs.
4. Scheduled E2E Testing (Cron)
In addition to post-deployment E2E runs, Playwright tests are executed on a cron schedule via GitHub Actions against the staging environment. This catches issues that post-deployment runs cannot: environment drift, expired credentials, time-dependent bugs, and external dependency failures.
- Nightly: Critical-path E2E suite (core booking flow, login, dispatch).
- Weekly: Full E2E suite covering all user journeys.
- Alerting: Failures post to a dedicated Slack channel for immediate triage.
5. Monorepo-Aware Pipeline (Turborepo)
All CI tasks are executed through Turborepo to leverage caching and affected-package filtering. See monorepo.md for pipeline definitions and caching strategy.
Key CI Behaviors:
- Affected-only execution: PR pipelines use
turbo run build test lint --filter=...[origin/main]to skip unchanged packages, keeping feedback times under 10 minutes. - Remote caching: Turborepo remote cache is shared between CI runners, so repeated builds of unchanged packages are instant cache hits.
- Path-filter matrix: GitHub Actions uses
dorny/paths-filterto conditionally trigger expensive jobs (e.g., E2E tests only when app code changes, not just docs or configs).
Deployment Observability:
- Deploy markers: Each deployment annotates Grafana dashboards with the Git SHA and deploy timestamp for visual correlation of performance changes.
- Post-deploy health checks: GitHub Actions waits for OTel metrics (error rate, p99 latency) to stabilize within SLO thresholds before marking the deployment as successful.
- Automatic rollback: If health checks fail within 5 minutes post-deploy, the previous Docker image tag is re-deployed automatically.
6. Quality Gates
Formalized blocking criteria that prevent merges and deployments. GitHub branch protection enforces all gates as required checks.
Merge Gates (Pull Request)
| Gate | Blocking Criteria | Enforcement |
|---|---|---|
| Lint & Format | Zero errors from Biome and ESLint (Vue + boundary rules) | turbo run lint |
| Type Safety | Zero errors from vue-tsc across all packages | turbo run typecheck |
| Unit Tests | All pass; no coverage decrease on diff | turbo run test with coverage |
| Integration Tests | All pass (ephemeral Testcontainers) | Parallel CI job |
| E2E Smoke | Critical-path Playwright subset passes against preview env | Post-build CI job |
| Security Scan | Zero critical/high CVEs (CodeQL + Snyk); license allowlist pass | Continuous + PR trigger |
| Build | Production bundle builds without errors for affected packages | turbo run build --filter=...[origin/main] |
| Bundle Size | Total bundle size within configured threshold | bundlesize (planned) |
Deploy Gates (Pre-Production)
| Gate | Blocking Criteria | Enforcement |
|---|---|---|
| Full E2E Suite | All Playwright specs pass against staging | Deployment pipeline |
| Health Check | OTel metrics (error rate, p99 latency) within SLO thresholds for 5 min post-deploy | Automated rollback trigger |
Coverage Targets
While chasing 100% coverage often becomes an anti-pattern, we maintain baseline expectations:
- Business Logic & Shared Packages (
packages/*): High strictness. 80%+ coverage for core utilities, business logic, and UI components. - Apps (
apps/*): Focus on E2E critical paths and unit testing complex business logic (Stores, Composables). - Coverage ratchet: Coverage percentage on a PR diff must not decrease — enforced via Codecov / Coveralls PR check.
Required Checks Matrix
Maps each gate to its GitHub branch protection configuration:
| GitHub Check Name | Required | Context |
|---|---|---|
lint | ✅ | Biome + ESLint |
typecheck | ✅ | vue-tsc |
test | ✅ | Vitest unit + integration |
build | ✅ | Production bundle |
e2e-smoke | ✅ | Playwright critical paths |
security / codeql | ✅ | SAST |
security / snyk | ✅ | Dependency CVEs |
security / license-check | ✅ | License allowlist |
coverage | ⚠️ Advisory | No decrease on diff |
bundle-size | ⚠️ Advisory (planned) | Threshold check |
7. Preview Deployments
Purpose
Provide a live, isolated environment per Pull Request for visual QA and stakeholder review before merging to main.
Prerequisites
- Docker registry access (GitHub Container Registry)
- Traefik dynamic routing configuration on the Swarm cluster
- DNS wildcard entry for
*.preview.busflow.de - Swarm node availability with defined resource limits
- Secrets injection for preview environments (managed via GitHub Actions / Docker Secrets)
- Database seeds for bootstrapping isolated preview databases
How It Works
- Trigger: A developer opens or updates a PR against
main. This triggers the GitHub Actions preview workflow. - Build: GitHub Actions builds container images for all affected services and pushes them to the registry.
- Deploy: The pipeline deploys a scoped Docker Swarm stack. Traefik dynamic routing maps
pr-<number>.preview.busflow.deto the stack. - Database Strategy:
- Frontend-only changes (no Hasura migrations detected): The preview environment connects to the shared staging database to save resources.
- Changes containing migrations: The pipeline provisions and seeds an isolated PostgreSQL instance for the preview, preventing migration conflicts with other environments.
Environment Lifecycle
- Created on PR open or first push.
- Updated on subsequent commits to the PR branch (rolling redeployment of the preview stack).
- Torn down automatically on PR close or merge. A configurable TTL (e.g., 24 hours) acts as a safety net for orphaned environments.
Limitations & Open Questions
- Resource limits: Define maximum concurrent preview environments and per-environment resource caps on the Hetzner cluster.
- Secret management: Determine whether preview envs use production-equivalent secrets or a dedicated preview secret set.
- Seed data freshness: Define a strategy for keeping database seeds up to date (e.g., nightly snapshots from staging, version-controlled seed scripts).
- External service dependencies: Clarify how third-party integrations (payment providers, email services) behave in preview environments (sandbox mode, mocks, or disabled).