ADR-032: Collaborative Trip Planning via CRDTs

Date: 2026-05-04 Status: Proposed [future — Phase 2+]Deciders: Julian Brüning

Context

Busflow's trip planning product does not exist yet. As we design it, a fundamental architecture decision arises: should multiple users be able to edit the same tour plan concurrently, and if so, how?

Two product scenarios motivate this question:

Multi-dispatcher template editing: A small office (2–3 dispatchers) builds and maintains tour templates. Without collaboration support, dispatchers must work sequentially — one edits, saves, then the next takes over. Collaboration enables simultaneous work on the same template.
Charter collaboration portal: An operator and a group leader (e.g., Vereinsvorstand, school teacher) jointly define boarding points and stopovers for a charter trip. A live, shared planning surface replaces email round-trips. (See STRATEGY_future-ideas.md §3.)

The trip planning data model does not exist in the schema yet. A working hypothesis for the structure:

TourTemplate
  └─ Day[]
       └─ Stop[]
            ├─ Timing constraints
            └─ Supplier references (FK → Allotment, Hotel, etc.)

The exact depth and entity set remain open design decisions. The key architectural characteristic is that the data is hierarchical (nested ordered lists) and heterogeneous (different node types carry different schemas and FK constraints).

Two fundamental approaches exist for enabling concurrent editing on this data:

Hasura Subscription LWW (Last-Write-Wins): Use existing Hasura mutations + subscriptions. First write wins, subsequent editors receive a conflict notification and resolve manually in the UI.
CRDTs (Conflict-free Replicated Data Types): Use a CRDT library to enable automatic, conflict-free merging of concurrent edits — including offline edits.

Alternatives Evaluated

Alternative 1: Hasura Subscription LWW with UI-Led Conflict Resolution

The simpler approach: leverage the existing stack entirely.

User A edits Stop 3 → Hasura mutation → DB write
                                       → Subscription fires
User B sees notification: "Sabine changed Stop 3. [Reload] [Keep mine] [Merge]"

Advantages:

Zero new dependencies — uses Hasura mutations + subscriptions as-is
Ships fast (~40h for basic implementation)
Simple mental model: "you were too slow, here's what changed"
Sufficient for Year 1 operating scale. Most Busflow target customers are 1–3 person offices. Sequential editing with conflict notifications covers the common case where dispatchers rarely edit the same template simultaneously

Disadvantages (emerge at scale / in advanced scenarios):

Non-conflicting edits are still flagged as conflicts. If User A edits the stop name and User B edits the stop time, the first-write-wins model discards B's change or forces a manual merge — even though the edits do not conflict
Offline reconnect produces large diffs. A dispatcher who edits a tour plan offline for an extended period returns to a "your version vs. server version" screen requiring manual reconciliation
Array operations are destructive. Drag-and-drop reordering of stops overwrites the entire array. If User B reorders stops while User A edits a stop's name, one change can silently overwrite the other
Charter portal UX degrades. Two people collaborating in real-time experience a "take turns" workflow rather than genuine co-editing

Alternative 2: CRDT-Based Collaborative Editing

Use a CRDT library to maintain a shared, conflict-free document state.

Advantages:

True concurrent editing — non-conflicting changes merge automatically
Offline-first by design: edits accumulate locally and merge seamlessly on reconnect
Presence awareness (cursors, selections, "User X edits Day 3") becomes a natural extension
Full edit history enables point-in-time reconstruction (GoBD-relevant)

Disadvantages:

New library dependency with no multi-tenant SaaS production references
CRDT state lives as opaque binary blobs — not directly queryable via SQL
Requires a custom projection layer to keep the relational model in sync
Introduces a second real-time channel alongside Hasura subscriptions
Hierarchical tree-move operations (reordering days, moving stops between days) remain an active research problem in the CRDT community

Decision

Pursue CRDT-based collaborative trip planning. The Hasura LWW approach remains a viable product fallback if the CRDT integration proves architecturally infeasible after the spike phase.

Technology Candidates

Criterion	Loro	Yjs	Automerge
Tree-move operations	✅ Native `Tree` type with cycle-safe moves	❌ Delete+reinsert (loses node identity)	❌ Delete+reinsert
History retention	✅ Full history by default	⚠️ GC by default (lossy)	✅ Full history by default
Maturity	⚠️ Young (< 2 years, small community)	✅ Battle-tested, large ecosystem	✅ Mature, well-researched
Performance	✅ Rust/WASM, efficient	✅ Optimized JS, fast	⚠️ Heavier, WASM available
GoBD fit	✅ Full history helps audit trail	⚠️ GC conflicts with audit	✅ Full history helps audit trail
Ecosystem	⚠️ Small	✅ ProseMirror, Quill, Monaco bindings	⚠️ Moderate

Primary candidate: Loro. Native tree-move operations are essential for the trip planning data model (reordering days, moving stops between days). Neither Yjs nor Automerge solve this natively.

Fallback candidate: Yjs. If Loro's maturity proves insufficient, Yjs offers the largest ecosystem and battle-tested reliability. Tree-move operations would require a custom application-layer implementation on top of Y.Map/Y.Array.

The technology choice is an explicit output of Phase 1 (spike). The decision above is a starting hypothesis, not a commitment.

Architecture

Dual-Layer Model

The architecture separates collaborative editing state from authoritative domain state:

┌─────────────────────────────────────────────────────────┐
│                    Workspace UI                          │
│                                                          │
│  ┌──────────────────┐       ┌─────────────────────────┐  │
│  │ CRDT Document     │◄─────►│ Collaboration Gateway    │  │
│  │ (Loro/Yjs)        │  WS   │ (NestJS WebSocket)       │  │
│  │                   │       │                          │  │
│  │ Day[], Stop[]     │       │ • tenant_id validation   │  │
│  │                   │       │ • permission enforcement │  │
│  └──────────────────┘       │ • operation relay        │  │
│                              │ • operation journal      │  │
│                              └────────────┬─────────────┘  │
└──────────────────────────────────────────┼──────────────┘
                                           │
                              ┌────────────▼─────────────┐
                              │ Projection Service        │
                              │ (NestJS)                  │
                              │                           │
                              │ • debounce (2s idle)      │
                              │ • doc.toJSON() → diff     │
                              │ • FK validation           │
                              │ • SQL write (transaction) │
                              │ • change_events INSERT    │
                              └────────────┬──────────────┘
                                           │
                              ┌────────────▼──────────────┐
                              │ PostgreSQL                 │
                              │                            │
                              │ tour_templates, days,      │
                              │ stops                      │
                              │ (authoritative relational  │
                              │  model for downstream      │
                              │  systems: pricing, booking,│
                              │  dispatch)                 │
                              │                            │
                              │ + crdt_documents (bytea,   │
                              │   projection_status,       │
                              │   last_projected_frontier) │
                              │                            │
                              │ + published_versions       │
                              │   (frontier checkpoint,    │
                              │    snapshot, immutable)     │
                              └────────────────────────────┘

Collaborative layer (CRDT): Each TourTemplate gets a CRDT document containing the hierarchical Day → Stop structure. The CRDT handles conflict-free merging during concurrent editing sessions.

Authoritative layer (PostgreSQL): The relational model (tour_templates, days, stops) remains the source of truth for all downstream systems — the pricing engine (AP6), booking pipeline, dispatch board. These systems never read from the CRDT directly. They read only from published versions.

Projection Service

The projection service bridges the CRDT and relational layers. This is the most complex component in the architecture.

When does projection fire?

Not on every keystroke. A user editing a stop's description generates dozens of operations per second. Projecting each one produces unacceptable write amplification (WAL bloat, transaction overhead)
Debounced: 2 seconds after the last operation. The projection service buffers CRDT merge events and projects only when editing activity pauses. The relational model lags up to 2 seconds behind the CRDT during active editing — acceptable because no downstream system reads trip plan data while it is actively being edited
On document close. When the last editor disconnects, the gateway triggers a final projection
On explicit publish. When a dispatcher clicks "Veröffentlichen," the system forces an immediate projection before creating the published version (see §Versioning below)

Projection mechanism:

Call doc.toJSON() to extract the current CRDT state as a plain JSON object
Diff the extracted JSON against the last projected state (stored alongside the CRDT blob)
Generate SQL INSERT/UPDATE/DELETE operations from the diff
Execute all SQL operations + update crdt_documents.last_projected_frontier + insert change_events entries — in a single PostgreSQL transaction

Failure handling:

If projection fails, the CRDT blob is intact — no data loss. The projection service retries with exponential backoff (3 attempts)
After exhausting retries: crdt_documents.projection_status is set to 'stale'
A reconciliation cron (every 5 minutes) picks up stale documents and re-projects them
Downstream systems check projection_status before reading. If stale, they either wait for reconciliation or force a synchronous projection

FK constraint validation: CRDT documents have no concept of foreign keys. A stop references a supplier (supplier_id), but the CRDT stores this as a plain UUID string. Referential integrity is enforced at the projection layer:

When projecting a stop, the service validates that supplier_id references an existing supplier record
If the FK target does not exist (e.g., the supplier was deleted between the CRDT edit and the projection), the projection writes the stop with supplier_id = NULL and logs a projection_warning
The editing UI pre-constrains selections via dropdowns (fetched from Hasura), so FK violations are rare edge cases — but they must be handled because the CRDT merges operations without validation
Open question: how to handle cascading FK scenarios (e.g., a supplier deletion that invalidates stops across multiple tour templates currently being edited collaboratively)

Schema migration:

CRDT documents are schema-less (JSON-like). When the relational schema adds a column (e.g., stops.accessibility_rating), the CRDT does not need a migration. The projection service maps missing CRDT fields to column defaults. The next time a user opens the editor, the UI shows the new field
When the relational schema removes a column, the projection service stops reading that key from the CRDT. The key persists in the CRDT blob (harmless) but is no longer projected

Versioning and Publish Lifecycle

CRDTs merge continuously — they have no concept of "done." The business requires discrete lifecycle states for tour templates: draft → published → locked.

Draft mode (default): The CRDT document is live and editable. Multiple dispatchers collaborate freely. The relational projection updates on a debounced basis. No downstream system (pricing, booking) consumes draft state.

Publish action: A dispatcher with manager or dispatcher role clicks "Veröffentlichen." The system:

Forces an immediate, synchronous projection (CRDT → relational)
Stores the current Loro frontiers() as an immutable version checkpoint in published_versions
Snapshots the projected relational state (JSON) alongside the frontier for fast reconstruction
Downstream systems (pricing engine, booking widget, dispatch board) reference the published version, not the live CRDT

Post-publish editing: The CRDT document remains editable after publishing — it is a template, not a frozen artifact. The next publish creates a new version. Existing departures that reference a previous published version are unaffected.

Version history: Loro's checkout(frontiers) API enters a read-only detached mode showing the document at a historical point. The UI provides a version history sidebar where dispatchers can:

Browse published versions with timestamps and author attribution
Preview any version (read-only)
Restore a historical version (creates a new CRDT state from the checkpoint, which becomes the new head)

Charter collaboration: The group leader cannot publish. Only the operator's dispatchers can transition from draft to published. The group leader edits in draft mode; the operator reviews and publishes.

Multi-Tenant Integration

CRDT operations bypass Hasura — they flow through a dedicated WebSocket gateway. This gateway becomes the tenant isolation enforcement point for collaborative editing:

Every WebSocket connection authenticates via the same JWT that Hasura uses (x-hasura-tenant-id, x-hasura-user-id)
The gateway validates that the requested CRDT document belongs to the authenticated tenant's workspace before accepting or relaying operations
The gateway extends the two-layer isolation model from ADR-004: Layer 1 = gateway-level tenant validation (analogous to Hasura permission rules), Layer 2 = CRDT documents stored in PostgreSQL with tenant_id FK and RLS policies on crdt_documents

Cross-tenant charter collaboration: The gateway supports explicit collaboration_invite tokens that grant a group leader scoped access to a specific CRDT document. The gateway enforces field-level permissions server-side: the group leader can edit stops and boarding points but not pricing fields, compliance data, or supplier references. The CRDT library itself has no concept of permissions — enforcement happens at the gateway before relaying operations to other peers.

Permission revocation: When a collaboration invite expires or the operator revokes access, the gateway stops relaying operations from the revoked user. Any operations the revoked user generated while offline are rejected on reconnect. This breaks CRDT convergence guarantees for the revoked user (their local state diverges), which is the correct behavior — a revoked user's edits should not propagate.

Real-Time Channel Exception

realtime-strategy.md §1 establishes: "All Live surfaces use Hasura's native subscription infrastructure. No custom WebSocket server for domain data."

AP18 introduces a second real-time channel: a NestJS WebSocket gateway for CRDT synchronization. This is an explicit, documented exception:

Why Hasura subscriptions cannot serve this purpose: Hasura subscriptions poll the database at ~1s intervals. CRDT sync requires sub-100ms operation relay for a responsive editing experience. Hasura also cannot relay opaque CRDT binary operations — it operates on SQL query results
Scope of the exception: The CRDT WebSocket carries collaborative editing state only (CRDT operations, presence/awareness). All other domain data (dispatch board, inbox, seat maps, dashboards) continues to use Hasura subscriptions exclusively
Precedent: realtime-strategy.md already documents one exception: "BullMQ workflow escalations use a dedicated NestJS WebSocket Gateway." AP18 adds a second, scoped exception

Interaction with AP2 (Offline Sync Protocol)

AP18's CRDT sync and AP2's relational offline sync (ADR-017) coexist in the same application:

Concern	AP2 (Relational Sync)	AP18 (CRDT Sync)
Data	Boarding events, onboard sales, expense receipts, incidents	Tour plan structure (days, stops)
Mechanism	RxDB queue → `POST /api/sync/batch` → server-wins conflict resolution	CRDT local state → WebSocket relay → mathematical merge
Conflict model	Server wins (explicit)	Conflict-free (mathematical)
Offline	Queue mutations, replay on reconnect	Accumulate CRDT ops, merge on reconnect
Materialization	Immediate (server processes batch on receipt)	Debounced (2s idle) or on explicit publish

The two protocols operate on non-overlapping data sets. Connection recovery handles both streams independently: the AP2 batch sync and the AP18 CRDT merge proceed in parallel without ordering constraints.

The only identified cross-protocol interaction surface is driver annotations during tour execution. A driver might add a field note to the trip plan (AP18 CRDT) triggered by an operational event like a delay (AP2 domain). These flow through separate protocols on the same device — no protocol-level coupling exists. If automated triggers become necessary in the future (e.g., a boarding event auto-generates a trip plan annotation), the integration runs server-side via domain events consumed by the projection service, not as a client-side protocol dependency.

The publish lifecycle further isolates the protocols: AP2 data (boarding events, sales) flows against the published relational snapshot, never against the live CRDT state. The two data flows only meet in the relational model after an explicit publish action.

Research Phases

Phase 1: Technology Evaluation Spike

Implement the hierarchical trip plan data model (Day → Stop) in Loro and Yjs
Evaluate move-operation semantics: reorder stops within a day, move a stop from Day 2 to Day 1
Concurrent edit scenarios: two users edit different fields on the same stop, two users reorder the same day's stops simultaneously
Benchmarks: document size after 1,000 edits, memory usage, merge latency
Output: Technology decision (Loro vs. Yjs) with documented trade-offs

Phase 2: Multi-Tenant Integration + Projection

Build the collaboration gateway (NestJS WebSocket, tenant validation, operation relay, operation journal)
Implement the debounced projection service (CRDT toJSON() → diff → transactional SQL write)
FK constraint validation at projection layer (supplier references, allotment links)
Cross-tenant charter collaboration with scoped field-level permissions
Dual-sync coexistence with AP2 (both protocols active in the Driver Hub)
Output: Working prototype with tenant-isolated collaborative editing and relational projection

Phase 3: Versioning + GoBD Compliance Layer

Implement the draft → published lifecycle with Loro frontiers() checkpoints
Derive an append-only audit trail from the projection service's diff output (not raw CRDT operations)
Each change_events entry carries user_id, workspace_id, timestamp, operation_type, old_values, new_values
Attribution: the collaboration gateway maintains an operation journal mapping CRDT operations to user IDs. The projection service consults this journal to attribute each diff entry to the originating user. For merge events combining multiple users' concurrent edits, the entry carries action = 'CRDT_MERGE' with a correlation_id referencing all contributing users
Point-in-time reconstruction: given a published version's frontier, doc.checkout(frontier) → doc.toJSON() reconstructs the exact state
Validate audit trail against the existing change_events pattern from ADR-019
Output: GoBD-compliant audit trail over CRDT state with version history

Phase 4: Semantic Conflict UX

CRDTs resolve data conflicts automatically but not intent conflicts. When concurrent operations produce a technically valid but semantically surprising result, the UI must communicate what happened and offer recovery
Design and test UX patterns for domain-specific conflict scenarios: concurrent stop reordering, delete-while-editing, concurrent field edits on the same stop
Implement version history UI with Loro checkout() for preview and restore
Evaluate fine-grained locking (optional field-level locks for high-stakes fields like pricing) as a prevention mechanism
Output: Semantic conflict UX guidelines for domain-specific collaborative editing

Consequences

Positive

True concurrent editing — dispatchers see each other's changes in real time without "take turns" UX
Offline trip planning with automatic, conflict-free merge on reconnect — a capability no competitor offers
Charter collaboration portal becomes a natural extension: scoped cross-tenant access to a shared CRDT document
Passenger-facing live itinerary as a downstream benefit: the relational projection naturally feeds a read-only itinerary view via standard Hasura subscriptions (no CRDT involvement on the passenger side)
Version history derived from Loro's frontier checkpoints provides point-in-time reconstruction and a clear draft→published lifecycle
Audit trail derived from projection-time diffs provides GoBD-relevant traceability using the existing change_events pattern (ADR-019)
Architectural alignment with AP2 (both pursue distributed data consistency, different mechanisms)

Negative

New library dependency (Loro or Yjs) — neither has production references in multi-tenant SaaS architectures
Dual sync protocol increases operational complexity (AP2 relational + AP18 CRDT)
Custom projection layer required to keep the relational model synchronized with the CRDT state. This projection layer is a maintenance burden with no standard tooling
FK constraint validation at the projection layer adds complexity: the CRDT merges operations without referential integrity awareness, so the projection must validate and handle broken references
Second real-time channel (CRDT WebSocket) breaks the Hasura-only rule, adding infrastructure to operate
Loro maturity risk — the library is less than 2 years old with a small community. Mitigated by keeping Yjs as a fallback
Semantic conflict UX requires domain-specific design: CRDTs resolve data conflicts automatically but can produce surprising results (e.g., concurrent stop reordering). The UI must communicate these outcomes to non-technical dispatchers

Risks

Risk	Likelihood	Mitigation
Loro proves too immature for production	Medium	Yjs fallback; technology decision is an explicit Phase 1 output
CRDT document size grows beyond memory budget for large tour templates (14 days × 5 stops)	Low–Medium	Evaluate document splitting (one CRDT doc per Day instead of per TourTemplate); GC strategies if using Yjs
Relational projection drifts from CRDT state	Medium	Debounced projection with transactional writes; reconciliation cron detects and alerts on drift
FK constraint violations during projection (e.g., supplier deleted while CRDT edit references it)	Medium	Projection-layer validation; UI pre-constrains via dropdowns; cascading FK scenarios remain an open question
GoBD attribution model for concurrent CRDT merge events lacks legal precedent	High	Requires compliance/legal review before production deployment; Phase 3 output
Semantic conflicts produce surprising UX outcomes for non-technical dispatchers	Medium	Phase 4 UX research; toast notifications, version history, and optional fine-grained field locking
CRDT integration proves architecturally infeasible	Low	Fall back to Hasura LWW with UI-led conflict resolution (see Alternative 1). The Hasura approach ships in ~40h and serves 80% of use cases

References

Kleppmann et al., "A highly-available move operation for replicated trees" (IEEE TPDS, 2022)
Kleppmann & Beresford, "A Conflict-Free Replicated JSON Datatype" (IEEE TPDS, 2017)
Loro documentation: Movable Tree (2024)
ADR-004: Multi-Tenant Data Isolation Strategy
ADR-017: Offline Sync Protocol for Driver Hub
ADR-019: Change Events Polymorphic Audit Trail
realtime-strategy.md — §1 Hasura as only real-time channel (exception documented here)
STRATEGY_future-ideas.md §3 B2B Charter Collaboration Portal

Busflow Docs

ADR-032: Collaborative Trip Planning via CRDTs ​

Context ​

Alternatives Evaluated ​

Alternative 1: Hasura Subscription LWW with UI-Led Conflict Resolution ​

Alternative 2: CRDT-Based Collaborative Editing ​

Decision ​

Technology Candidates ​

Architecture ​

Dual-Layer Model ​

Projection Service ​

Versioning and Publish Lifecycle ​

Multi-Tenant Integration ​

Real-Time Channel Exception ​

Interaction with AP2 (Offline Sync Protocol) ​

Research Phases ​

Phase 1: Technology Evaluation Spike ​

Phase 2: Multi-Tenant Integration + Projection ​

Phase 3: Versioning + GoBD Compliance Layer ​

Phase 4: Semantic Conflict UX ​

Consequences ​

Positive ​

Negative ​

Risks ​

References ​