CI/CD

A robust CI/CD pipeline is the only way to keep a self-hosted Docker Swarm architecture from becoming an operational nightmare. By automating your deployments and infrastructure changes, you remove the human error of SSHing into servers to manually pull images or update routing rules.

Production Service Map

Single Source of Truth: The live operational status and topology of the Busflow platform is the Platform Topology Dashboard in Grafana (grafana.busflow.app). The table below serves only as a static reference for ingress-routable endpoints and their CI/CD pipelines.

Quick reference for publicly routable production endpoints. Internal Swarm services (e.g., Redis, MinIO chunk storage, BullMQ workers) are explicitly omitted.

Service	Endpoint	Workflow	Trigger
Workspace	`busflow.app`	.github/workflows/deploy.yml	`release` published
API	`api.busflow.app`	.github/workflows/deploy.yml	`release` published
Hasura	`hasura.busflow.app`	.github/workflows/deploy.yml	`release` published
Auth	`auth.busflow.app`	.github/workflows/deploy.yml	`release` published
Storage	`storage.busflow.app`	.github/workflows/deploy.yml	`release` published
Docs Hub	`docs.busflow.app`	.github/workflows/deploy-studio.yml	push to `main` (path-filtered)
Landing	`busflow.de` / `getbusflow.com`	.github/workflows/deploy-studio.yml	push to `main` (path-filtered)
Context Engine	`context.busflow.app`	.github/workflows/deploy-studio.yml	push to `main` (path-filtered)
Status Page	`status.busflow.app`	.github/workflows/deploy-studio.yml	push to `main` (path-filtered)
Grafana	`grafana.busflow.app`	.github/workflows/deploy-observability.yml	push to `main` (path-filtered), weekly cron, `workflow_dispatch`
Telemetry (OTel)	`telemetry.busflow.app`	.github/workflows/deploy-observability.yml	push to `main` (path-filtered), weekly cron, `workflow_dispatch`
Traefik (ingress)	`*.busflow.app`	.github/workflows/deploy-infra.yml	push to `main` (path-filtered), weekly cron, `workflow_dispatch`

When debugging a production service, check the deploy workflow run history: gh run list --workflow=<workflow-file> --limit 5

1. IaaS CI/CD Pipeline (Terraform)

Your physical and managed infrastructure should be treated as code. You will use a dedicated GitHub repository (or a heavily isolated folder in your monorepo) for your Terraform configurations.

Setup Requirements:

Remote State Storage: Remote S3-compatible backend (Hetzner Object Storage) for state management with CI/CD support. Requires S3_ACCESS_KEY and S3_SECRET_KEY in GitHub Secrets.
GitHub Secrets: You need to securely store your HETZNER_API_TOKEN in GitHub.

Planned Automations:

On Pull Request: GitHub Actions runs terraform fmt (formatting), terraform validate (syntax checking), and terraform plan. The action comments the execution plan directly on the PR, showing exactly what servers or firewalls will be created, modified, or destroyed.
On Merge to Main: GitHub Actions runs terraform apply -auto-approve, executing the changes on Hetzner and AWS instantly.

Continuous Manual Tasks:

Reviewing the Plan: A human must always review the terraform plan output on the PR before merging. Terraform is powerful; a typo could accidentally instruct it to delete your production managed database.
Major Architecture Shifts: Moving from Hetzner to AWS, or adding a completely new managed service cluster, will require manual research and writing new Terraform modules.

2. Application CI/CD Pipeline (GitHub Actions + Docker Swarm)

This pipeline handles your code (Vue.js, Nest.js), your GraphQL engine configurations (Hasura), and the routing (Traefik).

Setup Requirements:

Container Registry: You will use GitHub Container Registry (GHCR) to store your compiled Docker images.
Swarm Access: You must generate a dedicated SSH key pair. The private key goes into GitHub Secrets, allowing the GitHub Actions runner to securely SSH into your Hetzner Swarm Manager node to execute deploy commands.
Hasura CLI: Your pipeline needs the Hasura CLI installed to automate database migrations and metadata syncing.

Planned Automations (Per Service):

Service	CI/CD Automation Flow
Vue.js (Frontend)	PR: Run ESLint, run unit tests. Build test image. Merge: Build production static files, bake them into an Nginx Docker image. Push to GHCR with the Git SHA as the tag. Update the Swarm service (`docker service update --image ghcr.io/org/vue-app:SHA frontend`).
Nest.js (API & Worker)	PR: Run ESLint, Jest unit tests, and e2e tests. Merge: Build the Node.js Docker image. Push to GHCR. Update both the `api` and `api-worker` Swarm services sequentially to ensure zero downtime.
Hasura & Postgres	PR: Spin up a temporary Postgres/Hasura container in the GitHub runner, apply migrations, and run schema tests. Merge: Before updating the frontend or backend, the action uses the Hasura CLI to apply SQL migrations to the Hetzner Managed Postgres and applies the Hasura Metadata (roles, permissions, event triggers).
Nhost (Auth/Storage)	These are pre-built images. Your automation only triggers if you change their configuration files or bump the version tag in your `docker-compose.yml`.
Preview Environments	PR Opened: Action builds images, SSHs into the Swarm Manager, and runs `docker stack deploy -c preview.yml pr-123`. Traefik dynamically routes a URL (e.g., `pr-123.preview.yoursaas.com`). Action posts the link to the PR. PR Closed: Action runs `docker stack rm pr-123` to free up Hetzner resources.

Continuous Manual Tasks:

Secret Rotation: Because Swarm secrets are immutable, if an external API key (like OpenAI or AWS SES) is compromised, you must manually create a new secret version in the Swarm, update your docker-compose.yml to point to the new version, and trigger a deployment.
Destructive Database Migrations: While Hasura handles standard migrations beautifully via CI/CD, complex destructive changes (e.g., dropping a widely used column or splitting a massive table) should often be run manually during a planned maintenance window to prevent table-locking issues under high load.
Node Maintenance: If a Hetzner server needs a hardware replacement, you will manually SSH in, run docker node update --availability drain <node-id> to safely move containers to other servers, and then replace the node via Terraform.

Hasura Metadata Drift Detection

deploy.yml applies Hasura metadata forward-only via hasura metadata apply, which silently overwrites any manual console changes. Without a drift check, a developer who modifies metadata via hasura console (SSH tunnel) and forgets to export it loses those changes on the next deploy — or worse, causes schema conflicts.

Post-deploy verification: after hasura metadata apply succeeds, a follow-up step runs hasura metadata diff and hasura migrate status against the repo state. Any un-exported console change surfaces as a CI warning (not a hard failure — console use during debugging is legitimate, but the diff must be acknowledged).
Weekly cron: a scheduled variant runs the same diff check between deploys to catch drift that accumulates while no deploys occur.
Implementation: tracked in TODOS.md.

3. Security Tooling (GitHub Actions)

Automated security scanning is integrated directly into the CI/CD pipeline and the GitHub Pull Request process.

Tool	Purpose	Trigger
GitHub Advanced Security (CodeQL)	Static Application Security Testing (SAST). Scans for security vulnerabilities, injection flaws, and logic errors in TypeScript/JavaScript.	On every PR and weekly cron.
Dependabot	Automated dependency updates. Opens PRs when new versions of npm packages are available. Flags known CVEs in transitive dependencies.	Continuous monitoring.
Snyk	Deep dependency vulnerability scanning. Provides fix recommendations and license compliance checks. Used alongside Dependabot for layered coverage.	On every PR and continuous monitoring.
license-checker	Scans all npm dependency licenses against a configured allowlist. Fails the build if a dependency uses a copyleft or unknown license (e.g., GPL, AGPL). Generates a full license report for legal review.	On every PR.

License Allowlist Configuration: The license-checker step is configured with --onlyAllow to explicitly permit a set of known-safe licenses: MIT; ISC; BSD-2-Clause; BSD-3-Clause; Apache-2.0; 0BSD; CC0-1.0. Any dependency that falls outside this allowlist causes the pipeline to fail, requiring manual review before merge. The generated license summary is uploaded as a build artifact for auditing.

CI/CD Secret Security

Threat: GitHub Actions secrets (Hetzner API token, SSH deploy keys, database passwords) are available to workflows triggered by PRs from same-repo branches. A collaborator with push access can modify a workflow file in their branch to exfiltrate secrets — for example by encoding the value (base64, hex) to bypass GitHub's automatic log masking, then sending it to an external server.

GitHub's Built-In Protections:

Log masking — GitHub replaces literal secret values with *** in workflow logs. This stops accidental exposure but does NOT stop intentional encoding-based exfiltration.
Fork PR isolation — workflows triggered by PRs from forks do NOT receive repository secrets by default. This only protects against external contributors, not same-repo collaborators.

Required Mitigations (Mandatory):

Mitigation	Status	How
Environment protection gates	⚠️ TODO	Create `production`, `studio`, and `preview` environments in GitHub → Settings → Environments. Enable "Required reviewers" on each. Deploy jobs reference these environments — secrets only inject after manual approval. Note: Studio deploys use date-based tags (`studio-YYYYMMDD-HHMMSS-<sha>`) on push to `main`, not semver releases.
Preview build/deploy separation	⚠️ TODO	Split `preview.yml` so the build job (uses only `GITHUB_TOKEN` for GHCR push) runs automatically, but the deploy job (uses `DEPLOY_SSH_KEY`) requires environment approval.
Least-privilege SSH keys	⚠️ TODO	The `DEPLOY_SSH_KEY` should only authorize `docker stack deploy` commands, not full root access. Create a dedicated `deploy` user on each Swarm node with a restricted shell or sudoers entry.
Audit collaborator access	Ongoing	Limit repository write access to trusted team members. Review the collaborator list before onboarding external contributors.

If External Contributors Are Added (Future):

Switch PR-triggered workflows from pull_request to pull_request_target to prevent fork PRs from executing modified workflow files.
Require first-time contributor approval for all workflow runs.

4. Scheduled E2E Testing (Cron)

In addition to post-deployment E2E runs, Playwright tests are executed on a cron schedule via GitHub Actions against the staging environment. This catches issues that post-deployment runs cannot: environment drift, expired credentials, time-dependent bugs, and external dependency failures.

Nightly: Critical-path E2E suite (core booking flow, login, dispatch).
Weekly: Full E2E suite covering all user journeys.
Alerting: Failures post to a dedicated Slack channel for immediate triage.

5. Monorepo-Aware Pipeline (Turborepo)

All CI tasks are executed through Turborepo to leverage caching and affected-package filtering. See monorepo.md for pipeline definitions and caching strategy.

Key CI Behaviors:

Affected-only execution: PR pipelines use turbo run build test lint --filter=...[origin/main] to skip unchanged packages, keeping feedback times under 10 minutes.
Remote caching: Turborepo remote cache is shared between CI runners, so repeated builds of unchanged packages are instant cache hits.
Path-filter matrix: GitHub Actions uses dorny/paths-filter to conditionally trigger expensive jobs (e.g., E2E tests only when app code changes, not just docs or configs).

Deployment Observability:

Deploy markers: Each deployment annotates Grafana dashboards with the Git SHA and deploy timestamp for visual correlation of performance changes.
Post-deploy health checks: GitHub Actions waits for OTel metrics (error rate, p99 latency) to stabilize within SLO thresholds before marking the deployment as successful.
Automatic rollback: If health checks fail within 5 minutes post-deploy, the previous Docker image tag is re-deployed automatically.

6. Quality Gates

Formalized blocking criteria that prevent merges and deployments. GitHub branch protection enforces all gates as required checks.

Merge Gates (Pull Request)

Gate	Blocking Criteria	Enforcement
Lint & Format	Zero errors from Biome and ESLint (Vue + boundary rules)	`turbo run lint`
Type Safety	Zero errors from `vue-tsc` across all packages	`turbo run typecheck`
Unit Tests	All pass; no coverage decrease on diff	`turbo run test` with coverage
Integration Tests	All pass (ephemeral Testcontainers)	Parallel CI job
E2E Smoke	Critical-path Playwright subset passes against preview env	Post-build CI job
Security Scan	Zero critical/high CVEs (CodeQL + Snyk); license allowlist pass	Continuous + PR trigger
Build	Production bundle builds without errors for affected packages	`turbo run build --filter=...[origin/main]`
Bundle Size	Total bundle size within configured threshold	`bundlesize` (planned)

Deploy Gates (Pre-Production)

Gate	Blocking Criteria	Enforcement
Full E2E Suite	All Playwright specs pass against staging	Deployment pipeline
Health Check	OTel metrics (error rate, p99 latency) within SLO thresholds for 5 min post-deploy	Automated rollback trigger

Coverage Targets

While chasing 100% coverage often becomes an anti-pattern, we maintain baseline expectations:

Business Logic & Shared Packages (packages/*): High strictness. 80%+ coverage for core utilities, business logic, and UI components.
Apps (apps/*): Focus on E2E critical paths and unit testing complex business logic (Stores, Composables).
Coverage ratchet: Coverage percentage on a PR diff must not decrease — enforced via Codecov / Coveralls PR check.

Required Checks Matrix

Maps each gate to its GitHub branch protection configuration:

GitHub Check Name	Required	Context
`lint`	✅	Biome + ESLint
`typecheck`	✅	vue-tsc
`test`	✅	Vitest unit + integration
`build`	✅	Production bundle
`e2e-smoke`	✅	Playwright critical paths
`security / codeql`	✅	SAST
`security / snyk`	✅	Dependency CVEs
`security / license-check`	✅	License allowlist
`coverage`	⚠️ Advisory	No decrease on diff
`bundle-size`	⚠️ Advisory (planned)	Threshold check

7. Preview Deployments

Purpose

Provide a live, isolated environment per Pull Request for visual QA and stakeholder review before merging to main.

Prerequisites

Docker registry access (GitHub Container Registry)
Traefik dynamic routing configuration on the Swarm cluster
DNS wildcard entry for *.preview.busflow.de
Swarm node availability with defined resource limits
Secrets injection for preview environments (managed via GitHub Actions / Docker Secrets)
Database seeds for bootstrapping isolated preview databases

How It Works

Trigger: A developer opens or updates a PR against main. This triggers the GitHub Actions preview workflow.
Build: GitHub Actions builds container images for all affected services and pushes them to the registry.
Deploy: The pipeline deploys a scoped Docker Swarm stack. Traefik dynamic routing maps pr-<number>.preview.busflow.de to the stack.
Database Strategy:
- Frontend-only changes (no Hasura migrations detected): The preview environment connects to the shared staging database to save resources.
- Changes containing migrations: The pipeline provisions and seeds an isolated PostgreSQL instance for the preview, preventing migration conflicts with other environments.

Environment Lifecycle

Created on PR open or first push.
Updated on subsequent commits to the PR branch (rolling redeployment of the preview stack).
Torn down automatically on PR close or merge. A configurable TTL (e.g., 24 hours) acts as a safety net for orphaned environments.

Limitations & Open Questions

Resource limits: Define maximum concurrent preview environments and per-environment resource caps on the Hetzner cluster.
Secret management: Determine whether preview envs use production-equivalent secrets or a dedicated preview secret set.
Seed data freshness: Define a strategy for keeping database seeds up to date (e.g., nightly snapshots from staging, version-controlled seed scripts).
External service dependencies: Clarify how third-party integrations (payment providers, email services) behave in preview environments (sandbox mode, mocks, or disabled).

Busflow Docs

CI/CD ​

Production Service Map ​

1. IaaS CI/CD Pipeline (Terraform) ​

2. Application CI/CD Pipeline (GitHub Actions + Docker Swarm) ​

Hasura Metadata Drift Detection ​

3. Security Tooling (GitHub Actions) ​

CI/CD Secret Security ​

4. Scheduled E2E Testing (Cron) ​

5. Monorepo-Aware Pipeline (Turborepo) ​

6. Quality Gates ​

Merge Gates (Pull Request) ​

Deploy Gates (Pre-Production) ​

Coverage Targets ​

Required Checks Matrix ​

7. Preview Deployments ​

Purpose ​

Prerequisites ​

How It Works ​

Environment Lifecycle ​

Limitations & Open Questions ​