Workflow Orchestration Strategy โ
Decision โ
Four complementary layers cover all workflow and automation needs. Each handles a distinct concern to avoid overlap.
| Layer | Tool | Hosting |
|---|---|---|
| Trigger layer (reactive + scheduled) | Hasura Events + Cron Triggers | Existing Hasura instance |
| UI & domain state machines | XState | In-app (client + server) |
| Automation, agent pipelines, API integrations | n8n | Self-hosted (Docker Swarm) |
| Durable job queues, background work, human-in-the-loop agent tasks | BullMQ + NestJS | Self-hosted (existing Redis + NestJS workers) |
Hasura โ Trigger Layer โ
Purpose โ
Detect when something should happen. Hasura Events + Cron Triggers serve as the unified entry point for all reactive and scheduled workflows. They fire webhooks to NestJS endpoints, which decide whether to handle inline, enqueue a BullMQ job, or call an n8n webhook.
Trigger Types โ
- Event Triggers: Fire on Postgres row changes (INSERT/UPDATE/DELETE). Used for reactive workflows (e.g., "created a quote", "changed a tour status").
- Cron Triggers: Fire on a schedule. Used for periodic tasks ("daily compliance scan at 06:00", "T-48h pre-trip checks").
Integration: @golevelup/nestjs-hasura โ
- We declare trigger definitions as TypeScript decorators on NestJS handler methods using
@golevelup/nestjs-hasura. - NestJS is the source of truth for which triggers exist and what they do. The library auto-registers event/cron triggers with Hasura's metadata API on application startup.
- This eliminates manual trigger configuration in the Hasura Console and keeps trigger definitions co-located with their handler logic in the NestJS codebase.
- Triggers are version-controlled, reviewable in PRs, and type-safe โ no drift between Hasura metadata and application code.
Routing Rule โ
When a Hasura trigger fires a NestJS webhook:
- Trivial side effect (cache invalidation, counter update) โ handle inline in the endpoint.
- External comms chain (multi-channel messaging, API integrations) โ forward to n8n webhook.
- Heavy / stateful / needs human approval (AI inference, cost calculations, agent tasks) โ enqueue a BullMQ job.
Why Hasura, not BullMQ cron โ
- Hasura is already in the stack; no extra scheduling infrastructure needed.
- Trigger definitions live in NestJS code (via
@golevelup/nestjs-hasuradecorators), synced to Hasura automatically. - BullMQ's value is in execution (retries, concurrency, state), not in triggering.
XState โ State Machines โ
Purpose โ
Model deterministic, finite-state logic for UI flows and domain entities.
Intended Use Cases โ
- Booking flow lifecycle (
idle โ selecting โ confirming โ paid โ ticketed) - Tour status tracking (
draft โ planned โ active โ completed โ reconciled) - Form wizards and multi-step UI interactions
Why XState โ
- TypeScript-native, runs on client and server
- Visualizer for debugging complex states
- Strong guarantees: no impossible state transitions, exhaustive event handling
- Replaces ad-hoc boolean flags and status enums across the codebase
n8n โ Automation & Agent Orchestration โ
Purpose โ
Event-driven automation workflows and external API integration, particularly for AI agent pipelines.
Intended Use Cases โ
- PDF Parsing: PDF received โ AI parse โ structured tour data โ notify dispatcher
- Automated WhatsApp/email passenger updates triggered by tour status changes
- Compliance monitoring (driving hours) with scheduled checks
- Third-party integrations (payment providers, CRM, accounting)
Why n8n โ
- Visual workflow editor for non-trivial integration chains
- 400+ built-in connectors reduce boilerplate for external APIs
- Self-hosted: data stays on Hetzner infrastructure (GDPR)
- Community edition is free; aligns with cost-conscious early-stage constraints
Prototyping vs. Core Production (Strategic Deprecation) โ
- Prototyping & Validation: In the initial phases, n8n serves as a rapid prototyping layer allowing us to quickly "fake" complex integrations (e.g., Apple Wallet passes, WhatsApp workflows) to validate customer willingness-to-pay without heavy engineering investment.
- Migration Path: The explicit plan is to not rely on n8n too heavily for core, validated execution paths. Once an automated workflow solidifies and becomes critical to the business, it will be migrated from n8n into dedicated, highly observable NestJS + BullMQ services for superior resilience, performance, and deterministic version control.
Monorepo Placement โ
- Workflow JSONs are application code (business logic: triggers, API calls, data transformations), not infrastructure config.
- Stored in
apps/automations/as a pnpm workspace member, following the same convention asapps/workspace,apps/passenger, etc. apps/automations/workflows/โ exported workflow JSON files, version-controlled and reviewed in PRs.apps/automations/credentials/โ encrypted credential exports.apps/automations/package.jsonโ convenience scripts forn8n export:workflow/n8n import:workflow.- Authoring model: Edit visually in the n8n UI โ export to JSON โ commit. JSON hand-editing is possible but not the primary workflow.
Deployment โ
- Docker container in the existing Swarm stack (service definition in
terraform/). - Workflow JSON files mounted as a Docker volume or synced via n8n REST API in CI/CD.
- Persistent storage via PostgreSQL (shared instance or dedicated).
- Webhook endpoints are internal-only โ reachable via the
busflow-nestoverlay network from NestJS. n8n is not exposed through Traefik (no public route). Administer via SSH tunnel:ssh -L 5678:localhost:5678 root@<manager-ip>.
Observability & Tracing โ
n8n breaks the W3C Trace Context chain by default. To maintain end-to-end tracing (see observability.md):
- Inbound: NestJS injects the active
traceparentheader when forwarding to n8n webhooks. - Outbound: Configure n8n HTTP Request nodes to forward
traceparentandtracestateheaders to downstream services (NestJS API, external APIs). - Logging: The system ships n8n execution logs to Loki via the Docker logging driver. Workflow execution IDs serve as the correlation key when native trace propagation is not possible.
Error Handling & Dead Letters โ
To ensure resilience against automation unreliability, the system handles n8n edge states procedurally:
- Retry Policy: Idempotent workflows (e.g., sending generic notifications or data syncing) utilize n8n's Node-level setting:
On Error: Retry(max 3 retries, exponential backoff). Non-idempotent workflows (e.g., issuing refunds or generating unique invoice IDs) fail immediately to prevent duplicate execution. - Dead-Letter Queue: For persistent failures, the n8n Error Trigger workflow catches the exception and writes the entire execution state (including incoming payload) to a dedicated
n8n_dead_lettersPostgreSQL table. This allows engineering to manually inspect and re-ingest the payload after patching the workflow. - Alerting Integration: The same n8n Error Trigger workflow executes an HTTP Request to the Grafana Alertmanager webhook, which handles grouping and routes the alert to the designated
#ops-alertsSlack channel. - Fallback Behavior (Circuit Breaker): The NestJS API acts as the protective membrane. It wraps outbound webhook calls to n8n in an
opossumCircuit Breaker. If n8n becomes unresponsive, the circuit opens, and NestJS redirects incoming payloads to a fallbackn8n_recoveryBullMQ queue for draining when n8n recovers.
Credential & Secret Management โ
- n8n encrypts credentials at rest using a symmetric encryption key.
- The system stores the encryption key as a Docker Secret (
/run/secrets/n8n_encryption_key), consistent with the platform-wide secrets strategy (see gdpr-strategy.md). - The
apps/automations/credentials/directory contains encrypted credential exports only. Plaintext credentials are never committed. - Credential rotation: re-encrypt and re-import via
n8n import:credentialswith the updated encryption key.
CI/CD & Workflow Versioning โ
- Deploy pipeline: On merge to
main, a CI job runsn8n import:workflow --separate --input=apps/automations/workflows/against the production n8n instance via its REST API. - PR workflow: Developers edit visually in a staging n8n instance โ
n8n export:workflow --separate --output=apps/automations/workflows/โ commit JSON โ PR review. - Rollback: Git history allows recovery of previous workflow versions. Re-run the import step against the target commit.
Licensing โ
WARNING
n8n Community Edition uses the Sustainable Use License (not MIT/Apache). This permits free self-hosting for internal use but restricts offering n8n as a managed service to third parties. If n8n becomes operator-facing (i.e., tenants author their own automations), you require a commercial license negotiation with n8n GmbH. See n8n.io/pricing.
BullMQ + NestJS โ Durable Job Queues โ
Purpose โ
Background processing, scheduled tasks, and the "virtual employee" agentic pattern requiring human-in-the-loop approval.
Intended Use Cases โ
- Background jobs: PDF generation, email dispatch, heavy AI inference, Legacy Data ETL (Magic Upload)
- Scheduled tasks: Cron-based proactive agent checks (e.g., daily cost anomaly scan)
- Human-in-the-loop agent workflows:
- Event/schedule triggers agent analysis โ agent proposes action
- Workflow pauses; user reviews and approves/rejects proposal
- Agent executes; workflow pauses again for output review
- User accepts result or requests revision
Architecture โ
- Already part of the infrastructure (see infrastructure.md, ยง3)
- Redis as message broker (in-Swarm)
- NestJS workers: same codebase, HTTP disabled, consume jobs from Redis
- Workflow state persisted in PostgreSQL (
WorkflowInstanceentity) for durability across restarts - REST endpoints on the API layer for human review actions (
POST /workflows/:id/review)
Rate Limiting & Error Handling for CPaaS Dispatch โ
The DispatchWorker (BullMQ processor for message dispatch) enforces per-tenant rate limiting with provider-aware error handling:
- Per-tenant rate limiting: Each tenant holds an isolated in-memory token bucket per channel (WhatsApp, Email, SMS). WhatsApp refill rate adjusts by the tenant's
messaging_tier(Tier 1: 1 msg/sec, Tier 4: 50 msg/sec). - Error classification: Each CPaaS adapter (Meta, SES, SNS) classifies errors as transient (BullMQ retries with exponential backoff) or permanent (triggers fallback chain to next channel).
- Circuit breaker: The NestJS โ n8n webhook call uses opossum circuit breaking. On circuit open, payloads route to a
notifications_recoveryBullMQ queue.
For the complete error classification table, adapter implementations, and fallback chain logic, see notification-pipeline-protocol.md ยง4.
Key Design Decisions โ
- State persistence in PostgreSQL, not Redis: Workflow instances may wait days for human review; Redis is ephemeral. BullMQ handles queueing; Postgres owns state.
- No BPMN: We define workflows in TypeScript code, not XML/visual diagrams. Keeps everything in a single language, version-controlled, and unit-testable.
- Bull Board (
@bull-board/nestjs) provides queue monitoring out of the box.
Boundary Rules โ
| Signal | Route to |
|---|---|
| Something changed in the DB / time-based schedule | Hasura Event / Cron Trigger โ NestJS webhook |
| Finite states with defined transitions (UI or domain) | XState |
| External API calls, webhook chains, visual automation | n8n |
| Async background work, retries, scheduled jobs | BullMQ |
| Agent task with human approval gates | BullMQ + PostgreSQL workflow state |
- Hasura triggers are the primary entry point. DB changes and schedules flow through Hasura, not BullMQ cron or application-level polling.
- n8n may enqueue BullMQ jobs (e.g., n8n webhook triggers a heavy AI task via Redis). The reverse should not happen โ NestJS workers do not call n8n.
- XState machines may live inside BullMQ processors when a job follows a complex state lifecycle internally.
Reference Examples โ
1. "Group Booking" Negotiator โ
Problem: A school emails asking for a quote for 3 buses, 2-day trip to Berlin. A human currently checks availability, calculates costs, and drafts a PDF quote manually.
| Step | Tool | Why |
|---|---|---|
| Email arrives | n8n (webhook) | External channel ingestion |
| Parse email, extract intent via AI | n8n | API integration chain (LLM call) |
| Enqueue agent job with structured data | n8n โ BullMQ | Heavy work ahead |
| Check bus + driver availability | BullMQ processor | DB queries, domain logic |
| Calculate cost breakdown | BullMQ processor | Complex domain rules |
| Draft PDF quote | BullMQ processor | Slow, retryable |
| Pause for dispatcher review | BullMQ โ pending_review | Human-in-the-loop |
| Quote lifecycle in UI | XState | draft โ pending_review โ approved โ sent |
| Dispatcher approves โ send quote | BullMQ โ n8n | Outbound email + PDF |
2. "Driver Compliance" Monitor โ
Problem: EU regulation VO (EG) 561/2006 strictly governs driving/resting hours. Tracking is tedious and error-prone.
| Step | Tool | Why |
|---|---|---|
| Daily scan at 06:00 | Hasura Cron Trigger | Scheduled trigger |
| For each driver: calculate remaining legal hours | BullMQ processor | Domain logic, per-driver fan-out |
| Flag violations or upcoming limits | BullMQ processor | Rule evaluation |
| Notify dispatcher of violations | BullMQ โ n8n (optional) | Outbound alert (Slack/email) if not in-app |
| Dispatcher acknowledges alert | BullMQ โ pending_review | Human-in-the-loop |
| Driver compliance status in UI | XState | compliant โ warning โ violation โ acknowledged |
3. "Pre-Trip Logistics" Coordinator โ
Problem: 48 hours before a multi-bus tour, passengers need reminders and drivers need manifests.
| Step | Tool | Why |
|---|---|---|
Tour status updated to pre_trip | Hasura Event Trigger | Reactive trigger on DB change |
| Enqueue T-48h job (or fire immediately) | NestJS webhook โ BullMQ or n8n | Depends on timing |
| Fetch passenger list, generate reminders | n8n | Multi-channel comms (WhatsApp, email) |
| Send batch messages | n8n | External API calls (WhatsApp Business, SES) |
| Generate + send driver manifests (PDF) | n8n | External comms chain |
| Mark pre-trip checklist completed | BullMQ | Status tracking |
| Tour lifecycle in UI | XState | planned โ pre_trip_sent โ ready โ active |
4. "Legacy Data ETL" Pipeline โ
Problem: A new tenant uploads a 50k-row CSV of past customers from their legacy ERP in Windows-1252 ANSI format, requiring deterministic mapping to Busflow GraphQL schemas without crushing memory or risking an all-or-nothing rollback.
| Step | Tool | Why |
|---|---|---|
| User drops CSV into UI | Direct GraphQL | Saves S3 key to import_jobs |
| Hasura Event Trigger fires | Hasura โ NestJS | Bridges DB insert to queuing system |
Enqueue magic-etl-queue job | NestJS โ BullMQ | Decoupled background processing |
| Stream+Decode CSV from S3 | BullMQ processor | Memory-safe ingestion |
| Map chunks to GraphQL format | BullMQ processor | Domain logic implementation |
| GraphQL Batch Mutations | BullMQ processor | Fast bulk inserts (insert_table) |
| Dead Letter Queue insertion | BullMQ processor | Captures validation errors gracefully |
| Cleanup S3 Blob | BullMQ processor | GDPR hygiene requirement |
Pattern Summary โ
| Use Case | Trigger | Primary Driver | n8n | BullMQ | XState |
|---|---|---|---|---|---|
| Group Booking | n8n (email webhook) | BullMQ (agent + approval) | Email I/O, AI parse | Core orchestration | Quote lifecycle |
| Driver Compliance | Hasura Cron | BullMQ (domain logic) | Alert delivery only | Core logic + scheduling | Compliance status |
| Pre-Trip Coordinator | Hasura Event | n8n (external comms) | Core orchestration | Scheduled trigger, status | Tour lifecycle |
| Legacy Data ETL | Hasura Event | BullMQ (ETL processor) | โ | Core orchestration | โ |
Rejected Alternatives โ
Camunda โ
- What: Enterprise BPMN process orchestration engine.
- Why considered: Complex business processes (pre/post-tour cost breakdowns, compliance, multi-step agent approval flows).
- Why rejected:
- JVM runtime โ adds a second language ecosystem and deployment target to a TypeScript stack.
- Requires BPMN XML modeling; no team members benefit from visual process design at current scale.
- Massive infrastructure overhead for a 2-node Docker Swarm.
- The workflows in scope (cost calculations, agent approvals) are better expressed as domain logic + durable job queues than as orchestrated BPMN processes.
- Revisit trigger: Enterprise customers demanding ISO-auditable process definitions, or >20 distinct multi-party approval workflows.
Inngest / Trigger.dev โ
- What: Managed TypeScript-native durable workflow platforms.
- Why considered: Elegant
step.waitForEvent()API for human-in-the-loop, zero infrastructure management. - Why rejected:
- Managed cloud conflicts with self-hosted Hetzner strategy (data sovereignty, GDPR, cost control).
- Inngest self-hosted option is less mature than the cloud offering.
- BullMQ + NestJS achieves the same durability guarantees with infrastructure already in the stack (Redis, Postgres, NestJS workers).
- Revisit trigger: If BullMQ workflow boilerplate becomes a maintenance burden across >5 distinct workflow types.
Temporal โ
- What: Open-source durable execution framework (Go/Java/TypeScript SDKs).
- Why not evaluated in depth: Similar trade-off profile to Camunda โ powerful but heavy. Requires dedicated Temporal Server cluster. Overkill for current scale; same problems solved more simply by BullMQ.