Busflow Docs

Internal documentation portal

Skip to content
Reviewed 02 May 2026

Workflow Orchestration โ€‹

Decision โ€‹

Three product-owned layers cover durable workflow and automation needs. Each handles a distinct concern to avoid overlap. n8n is not a default production planning layer; it is an optional early adapter for time-boxed prototypes, demos, or external connector exploration.

LayerToolHosting
Trigger layer (reactive + scheduled)Hasura Events + Cron TriggersExisting Hasura instance
UI & domain state machinesXStateIn-app (client + server)
Durable job queues, background work, human-in-the-loop agent tasksBullMQ + NestJSSelf-hosted (existing Redis + NestJS workers)
Optional prototype adaptern8nOnly if a short-lived validation workflow needs it

Hasura โ€” Trigger Layer โ€‹

Purpose โ€‹

Detect when something should happen. Hasura Events + Cron Triggers serve as the unified entry point for all reactive and scheduled workflows. They fire webhooks to NestJS endpoints, which decide whether to handle inline, enqueue a BullMQ job, or call an optional prototype adapter.

Trigger Types โ€‹

  • Event Triggers: Fire on Postgres row changes (INSERT/UPDATE/DELETE). Used for reactive workflows (e.g., "created a quote", "changed a tour status").
  • Cron Triggers: Fire on a schedule. Used for periodic tasks ("daily compliance scan at 06:00", "T-48h pre-departure checks").

Integration: @golevelup/nestjs-hasura โ€‹

  • We declare trigger definitions as TypeScript decorators on NestJS handler methods using @golevelup/nestjs-hasura.
  • NestJS is the source of truth for which triggers exist and what they do. The library auto-registers event/cron triggers with Hasura's metadata API on application startup.
  • This eliminates manual trigger configuration in the Hasura Console and keeps trigger definitions co-located with their handler logic in the NestJS codebase.
  • Triggers are version-controlled, reviewable in PRs, and type-safe โ€” no drift between Hasura metadata and application code.

Routing Rule โ€‹

When a Hasura trigger fires a NestJS webhook:

  • Trivial side effect (cache invalidation, counter update) โ†’ handle inline in the endpoint.
  • Heavy / stateful / needs human approval (AI inference, cost calculations, agent tasks) โ†’ enqueue a BullMQ job.
  • External comms chain (multi-channel messaging, API integrations) โ†’ enqueue the Communications/BullMQ pipeline by default; use n8n only for a short-lived, explicitly marked prototype.

Why Hasura, not BullMQ cron โ€‹

  • Hasura is already in the stack; no extra scheduling infrastructure needed.
  • Trigger definitions live in NestJS code (via @golevelup/nestjs-hasura decorators), synced to Hasura automatically.
  • BullMQ's value is in execution (retries, concurrency, state), not in triggering.

XState โ€” State Machines โ€‹

Purpose โ€‹

Model deterministic, finite-state logic for UI flows and domain entities.

Intended Use Cases โ€‹

  • Booking flow lifecycle (idle โ†’ selecting โ†’ confirming โ†’ paid โ†’ ticketed)
  • Tour status tracking (draft โ†’ planned โ†’ active โ†’ completed โ†’ reconciled)
  • Form wizards and multi-step UI interactions

Why XState โ€‹

  • TypeScript-native, runs on client and server
  • Visualizer for debugging complex states
  • Strong guarantees: no impossible state transitions, exhaustive event handling
  • Replaces ad-hoc boolean flags and status enums across the codebase

Optional n8n Adapter โ€” Prototype & Validation โ€‹

Purpose โ€‹

Time-boxed validation workflows for external integrations when the cost of building a typed service would slow down learning. n8n is allowed to help answer "does this workflow matter to customers?" It is not the planned long-term home for core domain logic, passenger communications, pricing, compliance, or AI agent orchestration.

Intended Use Cases โ€‹

  • Demo or concierge-only workflows where the output is reviewed before customer impact.
  • External connector spikes where n8n's built-in nodes reduce validation time.
  • Temporary email/webhook glue while the production NestJS endpoint is still being designed.
  • Disposable sales prototypes (e.g., a mocked Wallet Pass or WhatsApp sandbox flow) that will be deleted or graduated.

Why n8n โ€‹

  • Visual workflow editor can compress early discovery work.
  • Built-in connectors reduce one-off API boilerplate during validation.
  • Self-hosting can keep prototype data on Busflow-controlled infrastructure if we choose to run it.
  • The low setup cost is useful only before the workflow becomes a product commitment.

Non-Goals โ€‹

  • Do not plan core production workflows around n8n.
  • Do not encode cross-context domain decisions in n8n.
  • Do not make n8n the default Communications pipeline.
  • Do not expose n8n to operators as an automation builder without a separate commercial/license decision.

Graduation Rule โ€‹

Once a workflow is validated, customer-facing, compliance-relevant, or operationally critical, it moves into dedicated NestJS/BullMQ code with typed contracts, tests, observability, and code review. Any remaining n8n workflow must carry an owner, expiry condition, and migration target.

Monorepo Placement (Only If Adopted) โ€‹

  • Workflow JSONs are application code (business logic: triggers, API calls, data transformations), not infrastructure config.
  • Stored in apps/automations/ as a pnpm workspace member, following the same convention as apps/workspace, apps/passenger, etc.
  • apps/automations/workflows/ โ€” exported workflow JSON files, version-controlled and reviewed in PRs.
  • apps/automations/credentials/ โ€” encrypted credential exports.
  • apps/automations/package.json โ€” convenience scripts for n8n export:workflow / n8n import:workflow.
  • Authoring model: Edit visually in the n8n UI โ†’ export to JSON โ†’ commit. JSON hand-editing is possible but not the primary workflow.

Deployment (Only If Adopted) โ€‹

  • Docker container in the existing Swarm stack (service definition in terraform/).
  • Workflow JSON files mounted as a Docker volume or synced via n8n REST API in CI/CD.
  • Persistent storage via PostgreSQL (shared instance or dedicated).
  • Webhook endpoints are internal-only โ€” reachable via the busflow-nest overlay network from NestJS. n8n is not exposed through Traefik (no public route). Administer via SSH tunnel: ssh -L 5678:localhost:5678 root@<manager-ip>.

Observability & Tracing โ€‹

n8n breaks the W3C Trace Context chain by default. If it is adopted for a prototype that touches observable systems, maintain end-to-end tracing (see observability.md):

  • Inbound: NestJS injects the active traceparent header when forwarding to n8n webhooks.
  • Outbound: Configure n8n HTTP Request nodes to forward traceparent and tracestate headers to downstream services (NestJS API, external APIs).
  • Logging: The system ships n8n execution logs to Loki via the Docker logging driver. Workflow execution IDs serve as the correlation key when native trace propagation is not possible.

Error Handling & Dead Letters โ€‹

To ensure resilience against prototype automation unreliability, any adopted n8n workflow must handle edge states procedurally:

  • Retry Policy: Idempotent workflows (e.g., sending generic notifications or data syncing) utilize n8n's Node-level setting: On Error: Retry (max 3 retries, exponential backoff). Non-idempotent workflows (e.g., issuing refunds or generating unique invoice IDs) fail immediately to prevent duplicate execution.
  • Dead-Letter Queue: For persistent failures, the n8n Error Trigger workflow catches the exception and writes the entire execution state (including incoming payload) to a dedicated n8n_dead_letters PostgreSQL table. This allows engineering to manually inspect and re-ingest the payload after patching the workflow.
  • Alerting Integration: The same n8n Error Trigger workflow executes an HTTP Request to the Grafana Alertmanager webhook, which handles grouping and routes the alert to the designated #ops-alerts Slack channel.
  • Fallback Behavior (Circuit Breaker): The NestJS API acts as the protective membrane. If a prototype calls n8n, NestJS wraps the outbound webhook call in an opossum Circuit Breaker. If n8n becomes unresponsive, the circuit opens and payloads route to a recovery queue or are marked for manual replay, depending on the prototype's risk profile.

Credential & Secret Management โ€‹

  • n8n encrypts credentials at rest using a symmetric encryption key.
  • The system stores the encryption key as a Docker Secret (/run/secrets/n8n_encryption_key), consistent with the platform-wide secrets strategy (see gdpr-strategy.md).
  • The apps/automations/credentials/ directory contains encrypted credential exports only. Plaintext credentials are never committed.
  • Credential rotation: re-encrypt and re-import via n8n import:credentials with the updated encryption key.

CI/CD & Workflow Versioning โ€‹

  • Deploy pipeline: If production n8n exists, on merge to main, a CI job runs n8n import:workflow --separate --input=apps/automations/workflows/ against the production n8n instance via its REST API.
  • PR workflow: Developers edit visually in a staging n8n instance โ†’ n8n export:workflow --separate --output=apps/automations/workflows/ โ†’ commit JSON โ†’ PR review.
  • Rollback: Git history allows recovery of previous workflow versions. Re-run the import step against the target commit.

Licensing โ€‹

WARNING

n8n Community Edition uses the Sustainable Use License (not MIT/Apache). This permits free self-hosting for internal use but restricts offering n8n as a managed service to third parties. If n8n becomes operator-facing (i.e., tenants author their own automations), you require a commercial license negotiation with n8n GmbH. See n8n.io/pricing.


BullMQ + NestJS โ€” Durable Job Queues โ€‹

Purpose โ€‹

Background processing, scheduled tasks, and the "virtual employee" agentic pattern requiring human-in-the-loop approval.

Intended Use Cases โ€‹

  • Background jobs: PDF generation, email dispatch, heavy AI inference, Legacy Data ETL (Magic Upload)
  • Scheduled tasks: Cron-based proactive agent checks (e.g., daily cost anomaly scan)
  • Human-in-the-loop agent workflows:
    • Event/schedule triggers agent analysis โ†’ agent proposes action
    • Workflow pauses; user reviews and approves/rejects proposal
    • Agent executes; workflow pauses again for output review
    • User accepts result or requests revision

Architecture โ€‹

  • Already part of the infrastructure (see infrastructure.md, ยง3)
  • Redis as message broker (in-Swarm)
  • NestJS workers: same codebase, HTTP disabled, consume jobs from Redis
  • Workflow state persisted in PostgreSQL (WorkflowInstance entity) for durability across restarts
  • REST endpoints on the API layer for human review actions (POST /workflows/:id/review)

Rate Limiting & Error Handling for CPaaS Dispatch โ€‹

The DispatchWorker (BullMQ processor for message dispatch) enforces per-tenant rate limiting with provider-aware error handling:

  • Per-tenant rate limiting: Each tenant holds an isolated in-memory token bucket per channel (WhatsApp, Email, SMS). WhatsApp refill rate adjusts by the tenant's messaging_tier (Tier 1: 1 msg/sec, Tier 4: 50 msg/sec).
  • Error classification: Each CPaaS adapter (Meta, SES, SNS) classifies errors as transient (BullMQ retries with exponential backoff) or permanent (triggers fallback chain to next channel).
  • Circuit breaker: Provider adapters use circuit breaking around external CPaaS APIs. If an optional n8n prototype is in the path, the NestJS โ†’ n8n call also uses opossum circuit breaking and routes failed payloads to a recovery queue or manual replay flow.

For the complete error classification table, adapter implementations, and fallback chain logic, see notification-pipeline-protocol.md ยง4.

Key Design Decisions โ€‹

  • State persistence in PostgreSQL, not Redis: Workflow instances may wait days for human review; Redis is ephemeral. BullMQ handles queueing; Postgres owns state.
  • No BPMN: We define workflows in TypeScript code, not XML/visual diagrams. Keeps everything in a single language, version-controlled, and unit-testable.
  • Bull Board (@bull-board/nestjs) provides queue monitoring out of the box.

Boundary Rules โ€‹

SignalRoute to
Something changed in the DB / time-based scheduleHasura Event / Cron Trigger โ†’ NestJS webhook
Finite states with defined transitions (UI or domain)XState
External API calls and webhook chainsNestJS provider adapters + BullMQ jobs by default; n8n only for marked prototypes
Async background work, retries, scheduled jobsBullMQ
Agent task with human approval gatesBullMQ + PostgreSQL workflow state
  • Hasura triggers are the primary entry point. DB changes and schedules flow through Hasura, not BullMQ cron or application-level polling.
  • Prototype n8n workflows may enqueue BullMQ jobs only when the workflow is explicitly marked as temporary. Core NestJS workers should not depend on n8n.
  • XState machines may live inside BullMQ processors when a job follows a complex state lifecycle internally.

Reference Examples โ€‹

1. "Group Booking" Negotiator โ€‹

Problem: A school emails asking for a quote for 3 buses, 2-day trip to Berlin. A human currently checks availability, calculates costs, and drafts a PDF quote manually.

StepToolWhy
Email arrivesNestJS inbound endpoint (or n8n prototype)External channel ingestion
Parse email, extract intent via AIBullMQ processorLLM call with typed validation and retry control
Enqueue agent job with structured dataBullMQHeavy work ahead
Check bus + driver availabilityBullMQ processorDB queries, domain logic
Calculate cost breakdownBullMQ processorComplex domain rules
Draft PDF quoteBullMQ processorSlow, retryable
Pause for dispatcher reviewBullMQ โ†’ pending_reviewHuman-in-the-loop
Quote lifecycle in UIXStatedraft โ†’ pending_review โ†’ approved โ†’ sent
Dispatcher approves โ†’ send quoteCommunications pipelineOutbound email + PDF

2. "Driver Compliance" Monitor โ€‹

Problem: EU regulation VO (EG) 561/2006 strictly governs driving/resting hours. Tracking is tedious and error-prone.

StepToolWhy
Daily scan at 06:00Hasura Cron TriggerScheduled trigger
For each driver: calculate remaining legal hoursBullMQ processorDomain logic, per-driver fan-out
Flag violations or upcoming limitsBullMQ processorRule evaluation
Notify dispatcher of violationsBullMQ โ†’ provider adapterOutbound alert (Slack/email) if not in-app
Dispatcher acknowledges alertBullMQ โ†’ pending_reviewHuman-in-the-loop
Driver compliance status in UIXStatecompliant โ†’ warning โ†’ violation โ†’ acknowledged

3. "Pre-Departure Logistics" Coordinator โ€‹

Problem: 48 hours before a multi-bus tour, passengers need reminders and drivers need manifests.

StepToolWhy
Tour status updated to pre_departureHasura Event TriggerReactive trigger on DB change
Enqueue T-48h job (or fire immediately)NestJS webhook โ†’ BullMQDurable scheduling and fan-out
Fetch passenger list, generate remindersBullMQ processorTyped queries and template context generation
Send batch messagesCommunications pipelineExternal API calls (WhatsApp Business, SES)
Generate + send driver manifests (PDF)BullMQ processor โ†’ Communications pipelineDurable PDF generation plus outbound dispatch
Mark pre-departure checklist completedBullMQStatus tracking
Tour lifecycle in UIXStateplanned โ†’ pre_departure_sent โ†’ ready โ†’ active

4. "Legacy Data ETL" Pipeline โ€‹

Problem: A new tenant uploads a 50k-row CSV of past customers from their legacy ERP in Windows-1252 ANSI format, requiring deterministic mapping to Busflow GraphQL schemas without crushing memory or risking an all-or-nothing rollback.

StepToolWhy
User drops CSV into UIDirect GraphQLSaves S3 key to import_jobs
Hasura Event Trigger firesHasura โ†’ NestJSBridges DB insert to queuing system
Enqueue magic-etl-queue jobNestJS โ†’ BullMQDecoupled background processing
Stream+Decode CSV from S3BullMQ processorMemory-safe ingestion
Map chunks to GraphQL formatBullMQ processorDomain logic implementation
GraphQL Batch MutationsBullMQ processorFast bulk inserts (insert_table)
Dead Letter Queue insertionBullMQ processorCaptures validation errors gracefully
Cleanup S3 BlobBullMQ processorGDPR hygiene requirement

Pattern Summary โ€‹

Use CaseTriggerPrimary Drivern8nBullMQXState
Group BookingEmail webhookBullMQ (agent + approval)Optional ingestion prototypeCore orchestrationQuote lifecycle
Driver ComplianceHasura CronBullMQ (domain logic)โ€”Core logic + schedulingCompliance status
Pre-Departure CoordinatorHasura EventBullMQ + CommunicationsOptional prototype onlyScheduled trigger, status, fan-outTour lifecycle
Legacy Data ETLHasura EventBullMQ (ETL processor)โ€”Core orchestrationโ€”

Rejected Alternatives โ€‹

Camunda โ€‹

  • What: Enterprise BPMN process orchestration engine.
  • Why considered: Complex business processes (pre/post-tour cost breakdowns, compliance, multi-step agent approval flows).
  • Why rejected:
    • JVM runtime โ€” adds a second language ecosystem and deployment target to a TypeScript stack.
    • Requires BPMN XML modeling; no team members benefit from visual process design at current scale.
    • Massive infrastructure overhead for a 2-node Docker Swarm.
    • The workflows in scope (cost calculations, agent approvals) are better expressed as domain logic + durable job queues than as orchestrated BPMN processes.
    • Revisit trigger: Enterprise customers demanding ISO-auditable process definitions, or >20 distinct multi-party approval workflows.

Inngest / Trigger.dev โ€‹

  • What: Managed TypeScript-native durable workflow platforms.
  • Why considered: Elegant step.waitForEvent() API for human-in-the-loop, zero infrastructure management.
  • Why rejected:
    • Managed cloud conflicts with self-hosted Hetzner strategy (data sovereignty, GDPR, cost control).
    • Inngest self-hosted option is less mature than the cloud offering.
    • BullMQ + NestJS achieves the same durability guarantees with infrastructure already in the stack (Redis, Postgres, NestJS workers).
    • Revisit trigger: If BullMQ workflow boilerplate becomes a maintenance burden across >5 distinct workflow types.

Temporal โ€‹

  • What: Open-source durable execution framework (Go/Java/TypeScript SDKs).
  • Why not evaluated in depth: Similar trade-off profile to Camunda โ€” powerful but heavy. Requires dedicated Temporal Server cluster. Overkill for current scale; same problems solved more simply by BullMQ.

Internal documentation โ€” Busflow