Workflow Orchestration โ
Decision โ
Three product-owned layers cover durable workflow and automation needs. Each handles a distinct concern to avoid overlap. n8n is not a default production planning layer; it is an optional early adapter for time-boxed prototypes, demos, or external connector exploration.
| Layer | Tool | Hosting |
|---|---|---|
| Trigger layer (reactive + scheduled) | Hasura Events + Cron Triggers | Existing Hasura instance |
| UI & domain state machines | XState | In-app (client + server) |
| Durable job queues, background work, human-in-the-loop agent tasks | BullMQ + NestJS | Self-hosted (existing Redis + NestJS workers) |
| Optional prototype adapter | n8n | Only if a short-lived validation workflow needs it |
Hasura โ Trigger Layer โ
Purpose โ
Detect when something should happen. Hasura Events + Cron Triggers serve as the unified entry point for all reactive and scheduled workflows. They fire webhooks to NestJS endpoints, which decide whether to handle inline, enqueue a BullMQ job, or call an optional prototype adapter.
Trigger Types โ
- Event Triggers: Fire on Postgres row changes (INSERT/UPDATE/DELETE). Used for reactive workflows (e.g., "created a quote", "changed a tour status").
- Cron Triggers: Fire on a schedule. Used for periodic tasks ("daily compliance scan at 06:00", "T-48h pre-departure checks").
Integration: @golevelup/nestjs-hasura โ
- We declare trigger definitions as TypeScript decorators on NestJS handler methods using
@golevelup/nestjs-hasura. - NestJS is the source of truth for which triggers exist and what they do. The library auto-registers event/cron triggers with Hasura's metadata API on application startup.
- This eliminates manual trigger configuration in the Hasura Console and keeps trigger definitions co-located with their handler logic in the NestJS codebase.
- Triggers are version-controlled, reviewable in PRs, and type-safe โ no drift between Hasura metadata and application code.
Routing Rule โ
When a Hasura trigger fires a NestJS webhook:
- Trivial side effect (cache invalidation, counter update) โ handle inline in the endpoint.
- Heavy / stateful / needs human approval (AI inference, cost calculations, agent tasks) โ enqueue a BullMQ job.
- External comms chain (multi-channel messaging, API integrations) โ enqueue the Communications/BullMQ pipeline by default; use n8n only for a short-lived, explicitly marked prototype.
Why Hasura, not BullMQ cron โ
- Hasura is already in the stack; no extra scheduling infrastructure needed.
- Trigger definitions live in NestJS code (via
@golevelup/nestjs-hasuradecorators), synced to Hasura automatically. - BullMQ's value is in execution (retries, concurrency, state), not in triggering.
XState โ State Machines โ
Purpose โ
Model deterministic, finite-state logic for UI flows and domain entities.
Intended Use Cases โ
- Booking flow lifecycle (
idle โ selecting โ confirming โ paid โ ticketed) - Tour status tracking (
draft โ planned โ active โ completed โ reconciled) - Form wizards and multi-step UI interactions
Why XState โ
- TypeScript-native, runs on client and server
- Visualizer for debugging complex states
- Strong guarantees: no impossible state transitions, exhaustive event handling
- Replaces ad-hoc boolean flags and status enums across the codebase
Optional n8n Adapter โ Prototype & Validation โ
Purpose โ
Time-boxed validation workflows for external integrations when the cost of building a typed service would slow down learning. n8n is allowed to help answer "does this workflow matter to customers?" It is not the planned long-term home for core domain logic, passenger communications, pricing, compliance, or AI agent orchestration.
Intended Use Cases โ
- Demo or concierge-only workflows where the output is reviewed before customer impact.
- External connector spikes where n8n's built-in nodes reduce validation time.
- Temporary email/webhook glue while the production NestJS endpoint is still being designed.
- Disposable sales prototypes (e.g., a mocked Wallet Pass or WhatsApp sandbox flow) that will be deleted or graduated.
Why n8n โ
- Visual workflow editor can compress early discovery work.
- Built-in connectors reduce one-off API boilerplate during validation.
- Self-hosting can keep prototype data on Busflow-controlled infrastructure if we choose to run it.
- The low setup cost is useful only before the workflow becomes a product commitment.
Non-Goals โ
- Do not plan core production workflows around n8n.
- Do not encode cross-context domain decisions in n8n.
- Do not make n8n the default Communications pipeline.
- Do not expose n8n to operators as an automation builder without a separate commercial/license decision.
Graduation Rule โ
Once a workflow is validated, customer-facing, compliance-relevant, or operationally critical, it moves into dedicated NestJS/BullMQ code with typed contracts, tests, observability, and code review. Any remaining n8n workflow must carry an owner, expiry condition, and migration target.
Monorepo Placement (Only If Adopted) โ
- Workflow JSONs are application code (business logic: triggers, API calls, data transformations), not infrastructure config.
- Stored in
apps/automations/as a pnpm workspace member, following the same convention asapps/workspace,apps/passenger, etc. apps/automations/workflows/โ exported workflow JSON files, version-controlled and reviewed in PRs.apps/automations/credentials/โ encrypted credential exports.apps/automations/package.jsonโ convenience scripts forn8n export:workflow/n8n import:workflow.- Authoring model: Edit visually in the n8n UI โ export to JSON โ commit. JSON hand-editing is possible but not the primary workflow.
Deployment (Only If Adopted) โ
- Docker container in the existing Swarm stack (service definition in
terraform/). - Workflow JSON files mounted as a Docker volume or synced via n8n REST API in CI/CD.
- Persistent storage via PostgreSQL (shared instance or dedicated).
- Webhook endpoints are internal-only โ reachable via the
busflow-nestoverlay network from NestJS. n8n is not exposed through Traefik (no public route). Administer via SSH tunnel:ssh -L 5678:localhost:5678 root@<manager-ip>.
Observability & Tracing โ
n8n breaks the W3C Trace Context chain by default. If it is adopted for a prototype that touches observable systems, maintain end-to-end tracing (see observability.md):
- Inbound: NestJS injects the active
traceparentheader when forwarding to n8n webhooks. - Outbound: Configure n8n HTTP Request nodes to forward
traceparentandtracestateheaders to downstream services (NestJS API, external APIs). - Logging: The system ships n8n execution logs to Loki via the Docker logging driver. Workflow execution IDs serve as the correlation key when native trace propagation is not possible.
Error Handling & Dead Letters โ
To ensure resilience against prototype automation unreliability, any adopted n8n workflow must handle edge states procedurally:
- Retry Policy: Idempotent workflows (e.g., sending generic notifications or data syncing) utilize n8n's Node-level setting:
On Error: Retry(max 3 retries, exponential backoff). Non-idempotent workflows (e.g., issuing refunds or generating unique invoice IDs) fail immediately to prevent duplicate execution. - Dead-Letter Queue: For persistent failures, the n8n Error Trigger workflow catches the exception and writes the entire execution state (including incoming payload) to a dedicated
n8n_dead_lettersPostgreSQL table. This allows engineering to manually inspect and re-ingest the payload after patching the workflow. - Alerting Integration: The same n8n Error Trigger workflow executes an HTTP Request to the Grafana Alertmanager webhook, which handles grouping and routes the alert to the designated
#ops-alertsSlack channel. - Fallback Behavior (Circuit Breaker): The NestJS API acts as the protective membrane. If a prototype calls n8n, NestJS wraps the outbound webhook call in an
opossumCircuit Breaker. If n8n becomes unresponsive, the circuit opens and payloads route to a recovery queue or are marked for manual replay, depending on the prototype's risk profile.
Credential & Secret Management โ
- n8n encrypts credentials at rest using a symmetric encryption key.
- The system stores the encryption key as a Docker Secret (
/run/secrets/n8n_encryption_key), consistent with the platform-wide secrets strategy (see gdpr-strategy.md). - The
apps/automations/credentials/directory contains encrypted credential exports only. Plaintext credentials are never committed. - Credential rotation: re-encrypt and re-import via
n8n import:credentialswith the updated encryption key.
CI/CD & Workflow Versioning โ
- Deploy pipeline: If production n8n exists, on merge to
main, a CI job runsn8n import:workflow --separate --input=apps/automations/workflows/against the production n8n instance via its REST API. - PR workflow: Developers edit visually in a staging n8n instance โ
n8n export:workflow --separate --output=apps/automations/workflows/โ commit JSON โ PR review. - Rollback: Git history allows recovery of previous workflow versions. Re-run the import step against the target commit.
Licensing โ
WARNING
n8n Community Edition uses the Sustainable Use License (not MIT/Apache). This permits free self-hosting for internal use but restricts offering n8n as a managed service to third parties. If n8n becomes operator-facing (i.e., tenants author their own automations), you require a commercial license negotiation with n8n GmbH. See n8n.io/pricing.
BullMQ + NestJS โ Durable Job Queues โ
Purpose โ
Background processing, scheduled tasks, and the "virtual employee" agentic pattern requiring human-in-the-loop approval.
Intended Use Cases โ
- Background jobs: PDF generation, email dispatch, heavy AI inference, Legacy Data ETL (Magic Upload)
- Scheduled tasks: Cron-based proactive agent checks (e.g., daily cost anomaly scan)
- Human-in-the-loop agent workflows:
- Event/schedule triggers agent analysis โ agent proposes action
- Workflow pauses; user reviews and approves/rejects proposal
- Agent executes; workflow pauses again for output review
- User accepts result or requests revision
Architecture โ
- Already part of the infrastructure (see infrastructure.md, ยง3)
- Redis as message broker (in-Swarm)
- NestJS workers: same codebase, HTTP disabled, consume jobs from Redis
- Workflow state persisted in PostgreSQL (
WorkflowInstanceentity) for durability across restarts - REST endpoints on the API layer for human review actions (
POST /workflows/:id/review)
Rate Limiting & Error Handling for CPaaS Dispatch โ
The DispatchWorker (BullMQ processor for message dispatch) enforces per-tenant rate limiting with provider-aware error handling:
- Per-tenant rate limiting: Each tenant holds an isolated in-memory token bucket per channel (WhatsApp, Email, SMS). WhatsApp refill rate adjusts by the tenant's
messaging_tier(Tier 1: 1 msg/sec, Tier 4: 50 msg/sec). - Error classification: Each CPaaS adapter (Meta, SES, SNS) classifies errors as transient (BullMQ retries with exponential backoff) or permanent (triggers fallback chain to next channel).
- Circuit breaker: Provider adapters use circuit breaking around external CPaaS APIs. If an optional n8n prototype is in the path, the NestJS โ n8n call also uses opossum circuit breaking and routes failed payloads to a recovery queue or manual replay flow.
For the complete error classification table, adapter implementations, and fallback chain logic, see notification-pipeline-protocol.md ยง4.
Key Design Decisions โ
- State persistence in PostgreSQL, not Redis: Workflow instances may wait days for human review; Redis is ephemeral. BullMQ handles queueing; Postgres owns state.
- No BPMN: We define workflows in TypeScript code, not XML/visual diagrams. Keeps everything in a single language, version-controlled, and unit-testable.
- Bull Board (
@bull-board/nestjs) provides queue monitoring out of the box.
Boundary Rules โ
| Signal | Route to |
|---|---|
| Something changed in the DB / time-based schedule | Hasura Event / Cron Trigger โ NestJS webhook |
| Finite states with defined transitions (UI or domain) | XState |
| External API calls and webhook chains | NestJS provider adapters + BullMQ jobs by default; n8n only for marked prototypes |
| Async background work, retries, scheduled jobs | BullMQ |
| Agent task with human approval gates | BullMQ + PostgreSQL workflow state |
- Hasura triggers are the primary entry point. DB changes and schedules flow through Hasura, not BullMQ cron or application-level polling.
- Prototype n8n workflows may enqueue BullMQ jobs only when the workflow is explicitly marked as temporary. Core NestJS workers should not depend on n8n.
- XState machines may live inside BullMQ processors when a job follows a complex state lifecycle internally.
Reference Examples โ
1. "Group Booking" Negotiator โ
Problem: A school emails asking for a quote for 3 buses, 2-day trip to Berlin. A human currently checks availability, calculates costs, and drafts a PDF quote manually.
| Step | Tool | Why |
|---|---|---|
| Email arrives | NestJS inbound endpoint (or n8n prototype) | External channel ingestion |
| Parse email, extract intent via AI | BullMQ processor | LLM call with typed validation and retry control |
| Enqueue agent job with structured data | BullMQ | Heavy work ahead |
| Check bus + driver availability | BullMQ processor | DB queries, domain logic |
| Calculate cost breakdown | BullMQ processor | Complex domain rules |
| Draft PDF quote | BullMQ processor | Slow, retryable |
| Pause for dispatcher review | BullMQ โ pending_review | Human-in-the-loop |
| Quote lifecycle in UI | XState | draft โ pending_review โ approved โ sent |
| Dispatcher approves โ send quote | Communications pipeline | Outbound email + PDF |
2. "Driver Compliance" Monitor โ
Problem: EU regulation VO (EG) 561/2006 strictly governs driving/resting hours. Tracking is tedious and error-prone.
| Step | Tool | Why |
|---|---|---|
| Daily scan at 06:00 | Hasura Cron Trigger | Scheduled trigger |
| For each driver: calculate remaining legal hours | BullMQ processor | Domain logic, per-driver fan-out |
| Flag violations or upcoming limits | BullMQ processor | Rule evaluation |
| Notify dispatcher of violations | BullMQ โ provider adapter | Outbound alert (Slack/email) if not in-app |
| Dispatcher acknowledges alert | BullMQ โ pending_review | Human-in-the-loop |
| Driver compliance status in UI | XState | compliant โ warning โ violation โ acknowledged |
3. "Pre-Departure Logistics" Coordinator โ
Problem: 48 hours before a multi-bus tour, passengers need reminders and drivers need manifests.
| Step | Tool | Why |
|---|---|---|
Tour status updated to pre_departure | Hasura Event Trigger | Reactive trigger on DB change |
| Enqueue T-48h job (or fire immediately) | NestJS webhook โ BullMQ | Durable scheduling and fan-out |
| Fetch passenger list, generate reminders | BullMQ processor | Typed queries and template context generation |
| Send batch messages | Communications pipeline | External API calls (WhatsApp Business, SES) |
| Generate + send driver manifests (PDF) | BullMQ processor โ Communications pipeline | Durable PDF generation plus outbound dispatch |
| Mark pre-departure checklist completed | BullMQ | Status tracking |
| Tour lifecycle in UI | XState | planned โ pre_departure_sent โ ready โ active |
4. "Legacy Data ETL" Pipeline โ
Problem: A new tenant uploads a 50k-row CSV of past customers from their legacy ERP in Windows-1252 ANSI format, requiring deterministic mapping to Busflow GraphQL schemas without crushing memory or risking an all-or-nothing rollback.
| Step | Tool | Why |
|---|---|---|
| User drops CSV into UI | Direct GraphQL | Saves S3 key to import_jobs |
| Hasura Event Trigger fires | Hasura โ NestJS | Bridges DB insert to queuing system |
Enqueue magic-etl-queue job | NestJS โ BullMQ | Decoupled background processing |
| Stream+Decode CSV from S3 | BullMQ processor | Memory-safe ingestion |
| Map chunks to GraphQL format | BullMQ processor | Domain logic implementation |
| GraphQL Batch Mutations | BullMQ processor | Fast bulk inserts (insert_table) |
| Dead Letter Queue insertion | BullMQ processor | Captures validation errors gracefully |
| Cleanup S3 Blob | BullMQ processor | GDPR hygiene requirement |
Pattern Summary โ
| Use Case | Trigger | Primary Driver | n8n | BullMQ | XState |
|---|---|---|---|---|---|
| Group Booking | Email webhook | BullMQ (agent + approval) | Optional ingestion prototype | Core orchestration | Quote lifecycle |
| Driver Compliance | Hasura Cron | BullMQ (domain logic) | โ | Core logic + scheduling | Compliance status |
| Pre-Departure Coordinator | Hasura Event | BullMQ + Communications | Optional prototype only | Scheduled trigger, status, fan-out | Tour lifecycle |
| Legacy Data ETL | Hasura Event | BullMQ (ETL processor) | โ | Core orchestration | โ |
Rejected Alternatives โ
Camunda โ
- What: Enterprise BPMN process orchestration engine.
- Why considered: Complex business processes (pre/post-tour cost breakdowns, compliance, multi-step agent approval flows).
- Why rejected:
- JVM runtime โ adds a second language ecosystem and deployment target to a TypeScript stack.
- Requires BPMN XML modeling; no team members benefit from visual process design at current scale.
- Massive infrastructure overhead for a 2-node Docker Swarm.
- The workflows in scope (cost calculations, agent approvals) are better expressed as domain logic + durable job queues than as orchestrated BPMN processes.
- Revisit trigger: Enterprise customers demanding ISO-auditable process definitions, or >20 distinct multi-party approval workflows.
Inngest / Trigger.dev โ
- What: Managed TypeScript-native durable workflow platforms.
- Why considered: Elegant
step.waitForEvent()API for human-in-the-loop, zero infrastructure management. - Why rejected:
- Managed cloud conflicts with self-hosted Hetzner strategy (data sovereignty, GDPR, cost control).
- Inngest self-hosted option is less mature than the cloud offering.
- BullMQ + NestJS achieves the same durability guarantees with infrastructure already in the stack (Redis, Postgres, NestJS workers).
- Revisit trigger: If BullMQ workflow boilerplate becomes a maintenance burden across >5 distinct workflow types.
Temporal โ
- What: Open-source durable execution framework (Go/Java/TypeScript SDKs).
- Why not evaluated in depth: Similar trade-off profile to Camunda โ powerful but heavy. Requires dedicated Temporal Server cluster. Overkill for current scale; same problems solved more simply by BullMQ.