Busflow Docs

Internal documentation portal

Skip to content

Workflow Orchestration Strategy โ€‹

Decision โ€‹

Four complementary layers cover all workflow and automation needs. Each handles a distinct concern to avoid overlap.

LayerToolHosting
Trigger layer (reactive + scheduled)Hasura Events + Cron TriggersExisting Hasura instance
UI & domain state machinesXStateIn-app (client + server)
Automation, agent pipelines, API integrationsn8nSelf-hosted (Docker Swarm)
Durable job queues, background work, human-in-the-loop agent tasksBullMQ + NestJSSelf-hosted (existing Redis + NestJS workers)

Hasura โ€” Trigger Layer โ€‹

Purpose โ€‹

Detect when something should happen. Hasura Events + Cron Triggers serve as the unified entry point for all reactive and scheduled workflows. They fire webhooks to NestJS endpoints, which decide whether to handle inline, enqueue a BullMQ job, or call an n8n webhook.

Trigger Types โ€‹

  • Event Triggers: Fire on Postgres row changes (INSERT/UPDATE/DELETE). Used for reactive workflows (e.g., "created a quote", "changed a tour status").
  • Cron Triggers: Fire on a schedule. Used for periodic tasks ("daily compliance scan at 06:00", "T-48h pre-trip checks").

Integration: @golevelup/nestjs-hasura โ€‹

  • We declare trigger definitions as TypeScript decorators on NestJS handler methods using @golevelup/nestjs-hasura.
  • NestJS is the source of truth for which triggers exist and what they do. The library auto-registers event/cron triggers with Hasura's metadata API on application startup.
  • This eliminates manual trigger configuration in the Hasura Console and keeps trigger definitions co-located with their handler logic in the NestJS codebase.
  • Triggers are version-controlled, reviewable in PRs, and type-safe โ€” no drift between Hasura metadata and application code.

Routing Rule โ€‹

When a Hasura trigger fires a NestJS webhook:

  • Trivial side effect (cache invalidation, counter update) โ†’ handle inline in the endpoint.
  • External comms chain (multi-channel messaging, API integrations) โ†’ forward to n8n webhook.
  • Heavy / stateful / needs human approval (AI inference, cost calculations, agent tasks) โ†’ enqueue a BullMQ job.

Why Hasura, not BullMQ cron โ€‹

  • Hasura is already in the stack; no extra scheduling infrastructure needed.
  • Trigger definitions live in NestJS code (via @golevelup/nestjs-hasura decorators), synced to Hasura automatically.
  • BullMQ's value is in execution (retries, concurrency, state), not in triggering.

XState โ€” State Machines โ€‹

Purpose โ€‹

Model deterministic, finite-state logic for UI flows and domain entities.

Intended Use Cases โ€‹

  • Booking flow lifecycle (idle โ†’ selecting โ†’ confirming โ†’ paid โ†’ ticketed)
  • Tour status tracking (draft โ†’ planned โ†’ active โ†’ completed โ†’ reconciled)
  • Form wizards and multi-step UI interactions

Why XState โ€‹

  • TypeScript-native, runs on client and server
  • Visualizer for debugging complex states
  • Strong guarantees: no impossible state transitions, exhaustive event handling
  • Replaces ad-hoc boolean flags and status enums across the codebase

n8n โ€” Automation & Agent Orchestration โ€‹

Purpose โ€‹

Event-driven automation workflows and external API integration, particularly for AI agent pipelines.

Intended Use Cases โ€‹

  • PDF Parsing: PDF received โ†’ AI parse โ†’ structured tour data โ†’ notify dispatcher
  • Automated WhatsApp/email passenger updates triggered by tour status changes
  • Compliance monitoring (driving hours) with scheduled checks
  • Third-party integrations (payment providers, CRM, accounting)

Why n8n โ€‹

  • Visual workflow editor for non-trivial integration chains
  • 400+ built-in connectors reduce boilerplate for external APIs
  • Self-hosted: data stays on Hetzner infrastructure (GDPR)
  • Community edition is free; aligns with cost-conscious early-stage constraints

Prototyping vs. Core Production (Strategic Deprecation) โ€‹

  • Prototyping & Validation: In the initial phases, n8n serves as a rapid prototyping layer allowing us to quickly "fake" complex integrations (e.g., Apple Wallet passes, WhatsApp workflows) to validate customer willingness-to-pay without heavy engineering investment.
  • Migration Path: The explicit plan is to not rely on n8n too heavily for core, validated execution paths. Once an automated workflow solidifies and becomes critical to the business, it will be migrated from n8n into dedicated, highly observable NestJS + BullMQ services for superior resilience, performance, and deterministic version control.

Monorepo Placement โ€‹

  • Workflow JSONs are application code (business logic: triggers, API calls, data transformations), not infrastructure config.
  • Stored in apps/automations/ as a pnpm workspace member, following the same convention as apps/workspace, apps/passenger, etc.
  • apps/automations/workflows/ โ€” exported workflow JSON files, version-controlled and reviewed in PRs.
  • apps/automations/credentials/ โ€” encrypted credential exports.
  • apps/automations/package.json โ€” convenience scripts for n8n export:workflow / n8n import:workflow.
  • Authoring model: Edit visually in the n8n UI โ†’ export to JSON โ†’ commit. JSON hand-editing is possible but not the primary workflow.

Deployment โ€‹

  • Docker container in the existing Swarm stack (service definition in terraform/).
  • Workflow JSON files mounted as a Docker volume or synced via n8n REST API in CI/CD.
  • Persistent storage via PostgreSQL (shared instance or dedicated).
  • Webhook endpoints are internal-only โ€” reachable via the busflow-nest overlay network from NestJS. n8n is not exposed through Traefik (no public route). Administer via SSH tunnel: ssh -L 5678:localhost:5678 root@<manager-ip>.

Observability & Tracing โ€‹

n8n breaks the W3C Trace Context chain by default. To maintain end-to-end tracing (see observability.md):

  • Inbound: NestJS injects the active traceparent header when forwarding to n8n webhooks.
  • Outbound: Configure n8n HTTP Request nodes to forward traceparent and tracestate headers to downstream services (NestJS API, external APIs).
  • Logging: The system ships n8n execution logs to Loki via the Docker logging driver. Workflow execution IDs serve as the correlation key when native trace propagation is not possible.

Error Handling & Dead Letters โ€‹

To ensure resilience against automation unreliability, the system handles n8n edge states procedurally:

  • Retry Policy: Idempotent workflows (e.g., sending generic notifications or data syncing) utilize n8n's Node-level setting: On Error: Retry (max 3 retries, exponential backoff). Non-idempotent workflows (e.g., issuing refunds or generating unique invoice IDs) fail immediately to prevent duplicate execution.
  • Dead-Letter Queue: For persistent failures, the n8n Error Trigger workflow catches the exception and writes the entire execution state (including incoming payload) to a dedicated n8n_dead_letters PostgreSQL table. This allows engineering to manually inspect and re-ingest the payload after patching the workflow.
  • Alerting Integration: The same n8n Error Trigger workflow executes an HTTP Request to the Grafana Alertmanager webhook, which handles grouping and routes the alert to the designated #ops-alerts Slack channel.
  • Fallback Behavior (Circuit Breaker): The NestJS API acts as the protective membrane. It wraps outbound webhook calls to n8n in an opossum Circuit Breaker. If n8n becomes unresponsive, the circuit opens, and NestJS redirects incoming payloads to a fallback n8n_recovery BullMQ queue for draining when n8n recovers.

Credential & Secret Management โ€‹

  • n8n encrypts credentials at rest using a symmetric encryption key.
  • The system stores the encryption key as a Docker Secret (/run/secrets/n8n_encryption_key), consistent with the platform-wide secrets strategy (see gdpr-strategy.md).
  • The apps/automations/credentials/ directory contains encrypted credential exports only. Plaintext credentials are never committed.
  • Credential rotation: re-encrypt and re-import via n8n import:credentials with the updated encryption key.

CI/CD & Workflow Versioning โ€‹

  • Deploy pipeline: On merge to main, a CI job runs n8n import:workflow --separate --input=apps/automations/workflows/ against the production n8n instance via its REST API.
  • PR workflow: Developers edit visually in a staging n8n instance โ†’ n8n export:workflow --separate --output=apps/automations/workflows/ โ†’ commit JSON โ†’ PR review.
  • Rollback: Git history allows recovery of previous workflow versions. Re-run the import step against the target commit.

Licensing โ€‹

WARNING

n8n Community Edition uses the Sustainable Use License (not MIT/Apache). This permits free self-hosting for internal use but restricts offering n8n as a managed service to third parties. If n8n becomes operator-facing (i.e., tenants author their own automations), you require a commercial license negotiation with n8n GmbH. See n8n.io/pricing.


BullMQ + NestJS โ€” Durable Job Queues โ€‹

Purpose โ€‹

Background processing, scheduled tasks, and the "virtual employee" agentic pattern requiring human-in-the-loop approval.

Intended Use Cases โ€‹

  • Background jobs: PDF generation, email dispatch, heavy AI inference, Legacy Data ETL (Magic Upload)
  • Scheduled tasks: Cron-based proactive agent checks (e.g., daily cost anomaly scan)
  • Human-in-the-loop agent workflows:
    • Event/schedule triggers agent analysis โ†’ agent proposes action
    • Workflow pauses; user reviews and approves/rejects proposal
    • Agent executes; workflow pauses again for output review
    • User accepts result or requests revision

Architecture โ€‹

  • Already part of the infrastructure (see infrastructure.md, ยง3)
  • Redis as message broker (in-Swarm)
  • NestJS workers: same codebase, HTTP disabled, consume jobs from Redis
  • Workflow state persisted in PostgreSQL (WorkflowInstance entity) for durability across restarts
  • REST endpoints on the API layer for human review actions (POST /workflows/:id/review)

Rate Limiting & Error Handling for CPaaS Dispatch โ€‹

The DispatchWorker (BullMQ processor for message dispatch) enforces per-tenant rate limiting with provider-aware error handling:

  • Per-tenant rate limiting: Each tenant holds an isolated in-memory token bucket per channel (WhatsApp, Email, SMS). WhatsApp refill rate adjusts by the tenant's messaging_tier (Tier 1: 1 msg/sec, Tier 4: 50 msg/sec).
  • Error classification: Each CPaaS adapter (Meta, SES, SNS) classifies errors as transient (BullMQ retries with exponential backoff) or permanent (triggers fallback chain to next channel).
  • Circuit breaker: The NestJS โ†’ n8n webhook call uses opossum circuit breaking. On circuit open, payloads route to a notifications_recovery BullMQ queue.

For the complete error classification table, adapter implementations, and fallback chain logic, see notification-pipeline-protocol.md ยง4.

Key Design Decisions โ€‹

  • State persistence in PostgreSQL, not Redis: Workflow instances may wait days for human review; Redis is ephemeral. BullMQ handles queueing; Postgres owns state.
  • No BPMN: We define workflows in TypeScript code, not XML/visual diagrams. Keeps everything in a single language, version-controlled, and unit-testable.
  • Bull Board (@bull-board/nestjs) provides queue monitoring out of the box.

Boundary Rules โ€‹

SignalRoute to
Something changed in the DB / time-based scheduleHasura Event / Cron Trigger โ†’ NestJS webhook
Finite states with defined transitions (UI or domain)XState
External API calls, webhook chains, visual automationn8n
Async background work, retries, scheduled jobsBullMQ
Agent task with human approval gatesBullMQ + PostgreSQL workflow state
  • Hasura triggers are the primary entry point. DB changes and schedules flow through Hasura, not BullMQ cron or application-level polling.
  • n8n may enqueue BullMQ jobs (e.g., n8n webhook triggers a heavy AI task via Redis). The reverse should not happen โ€” NestJS workers do not call n8n.
  • XState machines may live inside BullMQ processors when a job follows a complex state lifecycle internally.

Reference Examples โ€‹

1. "Group Booking" Negotiator โ€‹

Problem: A school emails asking for a quote for 3 buses, 2-day trip to Berlin. A human currently checks availability, calculates costs, and drafts a PDF quote manually.

StepToolWhy
Email arrivesn8n (webhook)External channel ingestion
Parse email, extract intent via AIn8nAPI integration chain (LLM call)
Enqueue agent job with structured datan8n โ†’ BullMQHeavy work ahead
Check bus + driver availabilityBullMQ processorDB queries, domain logic
Calculate cost breakdownBullMQ processorComplex domain rules
Draft PDF quoteBullMQ processorSlow, retryable
Pause for dispatcher reviewBullMQ โ†’ pending_reviewHuman-in-the-loop
Quote lifecycle in UIXStatedraft โ†’ pending_review โ†’ approved โ†’ sent
Dispatcher approves โ†’ send quoteBullMQ โ†’ n8nOutbound email + PDF

2. "Driver Compliance" Monitor โ€‹

Problem: EU regulation VO (EG) 561/2006 strictly governs driving/resting hours. Tracking is tedious and error-prone.

StepToolWhy
Daily scan at 06:00Hasura Cron TriggerScheduled trigger
For each driver: calculate remaining legal hoursBullMQ processorDomain logic, per-driver fan-out
Flag violations or upcoming limitsBullMQ processorRule evaluation
Notify dispatcher of violationsBullMQ โ†’ n8n (optional)Outbound alert (Slack/email) if not in-app
Dispatcher acknowledges alertBullMQ โ†’ pending_reviewHuman-in-the-loop
Driver compliance status in UIXStatecompliant โ†’ warning โ†’ violation โ†’ acknowledged

3. "Pre-Trip Logistics" Coordinator โ€‹

Problem: 48 hours before a multi-bus tour, passengers need reminders and drivers need manifests.

StepToolWhy
Tour status updated to pre_tripHasura Event TriggerReactive trigger on DB change
Enqueue T-48h job (or fire immediately)NestJS webhook โ†’ BullMQ or n8nDepends on timing
Fetch passenger list, generate remindersn8nMulti-channel comms (WhatsApp, email)
Send batch messagesn8nExternal API calls (WhatsApp Business, SES)
Generate + send driver manifests (PDF)n8nExternal comms chain
Mark pre-trip checklist completedBullMQStatus tracking
Tour lifecycle in UIXStateplanned โ†’ pre_trip_sent โ†’ ready โ†’ active

4. "Legacy Data ETL" Pipeline โ€‹

Problem: A new tenant uploads a 50k-row CSV of past customers from their legacy ERP in Windows-1252 ANSI format, requiring deterministic mapping to Busflow GraphQL schemas without crushing memory or risking an all-or-nothing rollback.

StepToolWhy
User drops CSV into UIDirect GraphQLSaves S3 key to import_jobs
Hasura Event Trigger firesHasura โ†’ NestJSBridges DB insert to queuing system
Enqueue magic-etl-queue jobNestJS โ†’ BullMQDecoupled background processing
Stream+Decode CSV from S3BullMQ processorMemory-safe ingestion
Map chunks to GraphQL formatBullMQ processorDomain logic implementation
GraphQL Batch MutationsBullMQ processorFast bulk inserts (insert_table)
Dead Letter Queue insertionBullMQ processorCaptures validation errors gracefully
Cleanup S3 BlobBullMQ processorGDPR hygiene requirement

Pattern Summary โ€‹

Use CaseTriggerPrimary Drivern8nBullMQXState
Group Bookingn8n (email webhook)BullMQ (agent + approval)Email I/O, AI parseCore orchestrationQuote lifecycle
Driver ComplianceHasura CronBullMQ (domain logic)Alert delivery onlyCore logic + schedulingCompliance status
Pre-Trip CoordinatorHasura Eventn8n (external comms)Core orchestrationScheduled trigger, statusTour lifecycle
Legacy Data ETLHasura EventBullMQ (ETL processor)โ€”Core orchestrationโ€”

Rejected Alternatives โ€‹

Camunda โ€‹

  • What: Enterprise BPMN process orchestration engine.
  • Why considered: Complex business processes (pre/post-tour cost breakdowns, compliance, multi-step agent approval flows).
  • Why rejected:
    • JVM runtime โ€” adds a second language ecosystem and deployment target to a TypeScript stack.
    • Requires BPMN XML modeling; no team members benefit from visual process design at current scale.
    • Massive infrastructure overhead for a 2-node Docker Swarm.
    • The workflows in scope (cost calculations, agent approvals) are better expressed as domain logic + durable job queues than as orchestrated BPMN processes.
    • Revisit trigger: Enterprise customers demanding ISO-auditable process definitions, or >20 distinct multi-party approval workflows.

Inngest / Trigger.dev โ€‹

  • What: Managed TypeScript-native durable workflow platforms.
  • Why considered: Elegant step.waitForEvent() API for human-in-the-loop, zero infrastructure management.
  • Why rejected:
    • Managed cloud conflicts with self-hosted Hetzner strategy (data sovereignty, GDPR, cost control).
    • Inngest self-hosted option is less mature than the cloud offering.
    • BullMQ + NestJS achieves the same durability guarantees with infrastructure already in the stack (Redis, Postgres, NestJS workers).
    • Revisit trigger: If BullMQ workflow boilerplate becomes a maintenance burden across >5 distinct workflow types.

Temporal โ€‹

  • What: Open-source durable execution framework (Go/Java/TypeScript SDKs).
  • Why not evaluated in depth: Similar trade-off profile to Camunda โ€” powerful but heavy. Requires dedicated Temporal Server cluster. Overkill for current scale; same problems solved more simply by BullMQ.

Internal documentation โ€” Busflow