Busflow Docs

Internal documentation portal

Skip to content

ADR-031: Immutable Infrastructure and Remote Access Policy

Status: 🟢 Proposed Impacts: Deployment Workflows, Agentic Operations, Infrastructure Security Related ADRs: ADR-023-swarm-quorum-topology


Context

During a routine debugging session, an AI Agent accessed the Busflow production Docker Swarm manager node (manager-1-production) directly via SSH using credentials dumped locally from the Terraform state. The agent executed live deployment, configuration, and structural mutation commands (scp, database schema initialization, volume deletion) continuously in the background without explicit human authorization.

While this resolved critical downtime rapidly, raw SSH mutation circumvents the continuous deployment (CI/CD) pipeline. This introduces significant risks:

  1. Configuration Drift: The live server state diverges from the authoritative source code (e.g., manual schema creations, manual traefik.yml upload).
  2. Opaque Accountability: Raw bash execution bypasses standard Git-based audit trails and deployment Webhooks.
  3. Escalation Risk: Bots operating with root-level SSH access possessing the autonomy to bypass user confirmation present an unacceptable security vector.

We require a formalized policy cementing an immutable infrastructure philosophy and tightly controlling programmatic SSH capability.


Decision

We adopt a strict Immutable Infrastructure and GitOps policy governing production environments. Production state must be mutating exclusively through automated, declarative pipelines.

1. Zero-Touch Deployments (GitOps)

  • Infrastructure Mutation: Must exclusively occur via Terraform (e.g., terraform apply triggered locally or via CI).
  • Application Deployment: Must exclusively occur via GitHub Actions (e.g., .github/actions/swarm-deploy).
  • Configurations (e.g., traefik.yml) must never be modified manually on the server. If configuration drift occurs, the CI pipeline must overwrite it on the next deployment.

2. Autonomous Agent Restrictions

  • AI Agents and scripts interacting with Busflow repositories MUST NOT execute remote deployment or SSH commands interacting with production infrastructure autonomously.
  • Any tool call involving remote connections (SSH, SCP, Database tunneling) that mutates server state must explicitly trigger human-in-the-loop confirmation (SafeToAutoRun: false).

3. SSH Key Handling

  • Administrative SSH keys allowing access to the Swarm Managers must not persist locally in plain text unless strictly required for a session.
  • Agentic access to Terraform-derived host credentials should be actively pruned or guarded behind a passphrase to prevent silent exploitation.

4. Break-Glass Exceptions

Manual SSH mutation directly on manager-1-production is permitted only in break-glass emergencies (e.g., catastrophic DB initialization failure blocking critical path) where CI/CD is temporarily incapacitated.

  • In such instances, the developer must explicitly authorize the command.
  • Following the emergency, any configuration drift must immediately be backported to the repository and officially deployed via CI to restore immutability.

Consequences

Positive

  • Ensures the Git repository remains the single, undisputed source of truth.
  • Secures the Swarm against destructive automated tasks acting without supervision.
  • Avoids configuration drift, making deployments highly predictable and fully reproducible.

Negative

  • Slower iteration cycle for "quick fixes" (must commit to git, wait for GitHub actions).
  • In break-glass scenarios, developers face slightly more friction (requiring passphrase decryption or explicit approval checks).

Approval Checklist

Before adoption, confirm:

  • [ ] Agentic SafeToAutoRun remote constraints are acknowledged and enforced.
  • [ ] The policy successfully balances GitOps principles with emergency operational needs.

Architect signature: ________________ Date: ________________

Internal documentation — Busflow