ADR-032: Collaborative Trip Planning via CRDTs β
Date: 2026-05-04 Status: Proposed [future β Phase 2+]Deciders: Julian BrΓΌning
Context β
Busflow's trip planning product does not exist yet. As we design it, a fundamental architecture decision arises: should multiple users be able to edit the same tour plan concurrently, and if so, how?
Two product scenarios motivate this question:
- Multi-dispatcher template editing: A small office (2β3 dispatchers) builds and maintains tour templates. Without collaboration support, dispatchers must work sequentially β one edits, saves, then the next takes over. Collaboration enables simultaneous work on the same template.
- Charter collaboration portal: An operator and a group leader (e.g., Vereinsvorstand, school teacher) jointly define boarding points and stopovers for a charter trip. A live, shared planning surface replaces email round-trips. (See STRATEGY_future-ideas.md Β§3.)
The trip planning data model does not exist in the schema yet. A working hypothesis for the structure:
TourTemplate
ββ Day[]
ββ Stop[]
ββ Timing constraints
ββ Supplier references (FK β Allotment, Hotel, etc.)The exact depth and entity set remain open design decisions. The key architectural characteristic is that the data is hierarchical (nested ordered lists) and heterogeneous (different node types carry different schemas and FK constraints).
Two fundamental approaches exist for enabling concurrent editing on this data:
- Hasura Subscription LWW (Last-Write-Wins): Use existing Hasura mutations + subscriptions. First write wins, subsequent editors receive a conflict notification and resolve manually in the UI.
- CRDTs (Conflict-free Replicated Data Types): Use a CRDT library to enable automatic, conflict-free merging of concurrent edits β including offline edits.
Alternatives Evaluated β
Alternative 1: Hasura Subscription LWW with UI-Led Conflict Resolution β
The simpler approach: leverage the existing stack entirely.
User A edits Stop 3 β Hasura mutation β DB write
β Subscription fires
User B sees notification: "Sabine changed Stop 3. [Reload] [Keep mine] [Merge]"Advantages:
- Zero new dependencies β uses Hasura mutations + subscriptions as-is
- Ships fast (~40h for basic implementation)
- Simple mental model: "you were too slow, here's what changed"
- Sufficient for Year 1 operating scale. Most Busflow target customers are 1β3 person offices. Sequential editing with conflict notifications covers the common case where dispatchers rarely edit the same template simultaneously
Disadvantages (emerge at scale / in advanced scenarios):
- Non-conflicting edits are still flagged as conflicts. If User A edits the stop name and User B edits the stop time, the first-write-wins model discards B's change or forces a manual merge β even though the edits do not conflict
- Offline reconnect produces large diffs. A dispatcher who edits a tour plan offline for an extended period returns to a "your version vs. server version" screen requiring manual reconciliation
- Array operations are destructive. Drag-and-drop reordering of stops overwrites the entire array. If User B reorders stops while User A edits a stop's name, one change can silently overwrite the other
- Charter portal UX degrades. Two people collaborating in real-time experience a "take turns" workflow rather than genuine co-editing
Alternative 2: CRDT-Based Collaborative Editing β
Use a CRDT library to maintain a shared, conflict-free document state.
Advantages:
- True concurrent editing β non-conflicting changes merge automatically
- Offline-first by design: edits accumulate locally and merge seamlessly on reconnect
- Presence awareness (cursors, selections, "User X edits Day 3") becomes a natural extension
- Full edit history enables point-in-time reconstruction (GoBD-relevant)
Disadvantages:
- New library dependency with no multi-tenant SaaS production references
- CRDT state lives as opaque binary blobs β not directly queryable via SQL
- Requires a custom projection layer to keep the relational model in sync
- Introduces a second real-time channel alongside Hasura subscriptions
- Hierarchical tree-move operations (reordering days, moving stops between days) remain an active research problem in the CRDT community
Decision β
Pursue CRDT-based collaborative trip planning. The Hasura LWW approach remains a viable product fallback if the CRDT integration proves architecturally infeasible after the spike phase.
Technology Candidates β
| Criterion | Loro | Yjs | Automerge |
|---|---|---|---|
| Tree-move operations | β
Native Tree type with cycle-safe moves | β Delete+reinsert (loses node identity) | β Delete+reinsert |
| History retention | β Full history by default | β οΈ GC by default (lossy) | β Full history by default |
| Maturity | β οΈ Young (< 2 years, small community) | β Battle-tested, large ecosystem | β Mature, well-researched |
| Performance | β Rust/WASM, efficient | β Optimized JS, fast | β οΈ Heavier, WASM available |
| GoBD fit | β Full history helps audit trail | β οΈ GC conflicts with audit | β Full history helps audit trail |
| Ecosystem | β οΈ Small | β ProseMirror, Quill, Monaco bindings | β οΈ Moderate |
Primary candidate: Loro. Native tree-move operations are essential for the trip planning data model (reordering days, moving stops between days). Neither Yjs nor Automerge solve this natively.
Fallback candidate: Yjs. If Loro's maturity proves insufficient, Yjs offers the largest ecosystem and battle-tested reliability. Tree-move operations would require a custom application-layer implementation on top of Y.Map/Y.Array.
The technology choice is an explicit output of Phase 1 (spike). The decision above is a starting hypothesis, not a commitment.
Architecture β
Dual-Layer Model β
The architecture separates collaborative editing state from authoritative domain state:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Workspace UI β
β β
β ββββββββββββββββββββ βββββββββββββββββββββββββββ β
β β CRDT Document ββββββββΊβ Collaboration Gateway β β
β β (Loro/Yjs) β WS β (NestJS WebSocket) β β
β β β β β β
β β Day[], Stop[] β β β’ tenant_id validation β β
β β β β β’ permission enforcement β β
β ββββββββββββββββββββ β β’ operation relay β β
β β β’ operation journal β β
β ββββββββββββββ¬ββββββββββββββ β
ββββββββββββββββββββββββββββββββββββββββββββΌβββββββββββββββ
β
ββββββββββββββΌββββββββββββββ
β Projection Service β
β (NestJS) β
β β
β β’ debounce (2s idle) β
β β’ doc.toJSON() β diff β
β β’ FK validation β
β β’ SQL write (transaction) β
β β’ change_events INSERT β
ββββββββββββββ¬βββββββββββββββ
β
ββββββββββββββΌβββββββββββββββ
β PostgreSQL β
β β
β tour_templates, days, β
β stops β
β (authoritative relational β
β model for downstream β
β systems: pricing, booking,β
β dispatch) β
β β
β + crdt_documents (bytea, β
β projection_status, β
β last_projected_frontier) β
β β
β + published_versions β
β (frontier checkpoint, β
β snapshot, immutable) β
ββββββββββββββββββββββββββββββCollaborative layer (CRDT): Each TourTemplate gets a CRDT document containing the hierarchical Day β Stop structure. The CRDT handles conflict-free merging during concurrent editing sessions.
Authoritative layer (PostgreSQL): The relational model (tour_templates, days, stops) remains the source of truth for all downstream systems β the pricing engine (AP6), booking pipeline, dispatch board. These systems never read from the CRDT directly. They read only from published versions.
Projection Service β
The projection service bridges the CRDT and relational layers. This is the most complex component in the architecture.
When does projection fire?
- Not on every keystroke. A user editing a stop's description generates dozens of operations per second. Projecting each one produces unacceptable write amplification (WAL bloat, transaction overhead)
- Debounced: 2 seconds after the last operation. The projection service buffers CRDT merge events and projects only when editing activity pauses. The relational model lags up to 2 seconds behind the CRDT during active editing β acceptable because no downstream system reads trip plan data while it is actively being edited
- On document close. When the last editor disconnects, the gateway triggers a final projection
- On explicit publish. When a dispatcher clicks "VerΓΆffentlichen," the system forces an immediate projection before creating the published version (see Β§Versioning below)
Projection mechanism:
- Call
doc.toJSON()to extract the current CRDT state as a plain JSON object - Diff the extracted JSON against the last projected state (stored alongside the CRDT blob)
- Generate SQL
INSERT/UPDATE/DELETEoperations from the diff - Execute all SQL operations + update
crdt_documents.last_projected_frontier+ insertchange_eventsentries β in a single PostgreSQL transaction
Failure handling:
- If projection fails, the CRDT blob is intact β no data loss. The projection service retries with exponential backoff (3 attempts)
- After exhausting retries:
crdt_documents.projection_statusis set to'stale' - A reconciliation cron (every 5 minutes) picks up stale documents and re-projects them
- Downstream systems check
projection_statusbefore reading. If stale, they either wait for reconciliation or force a synchronous projection
FK constraint validation: CRDT documents have no concept of foreign keys. A stop references a supplier (supplier_id), but the CRDT stores this as a plain UUID string. Referential integrity is enforced at the projection layer:
- When projecting a stop, the service validates that
supplier_idreferences an existing supplier record - If the FK target does not exist (e.g., the supplier was deleted between the CRDT edit and the projection), the projection writes the stop with
supplier_id = NULLand logs aprojection_warning - The editing UI pre-constrains selections via dropdowns (fetched from Hasura), so FK violations are rare edge cases β but they must be handled because the CRDT merges operations without validation
- Open question: how to handle cascading FK scenarios (e.g., a supplier deletion that invalidates stops across multiple tour templates currently being edited collaboratively)
Schema migration:
- CRDT documents are schema-less (JSON-like). When the relational schema adds a column (e.g.,
stops.accessibility_rating), the CRDT does not need a migration. The projection service maps missing CRDT fields to column defaults. The next time a user opens the editor, the UI shows the new field - When the relational schema removes a column, the projection service stops reading that key from the CRDT. The key persists in the CRDT blob (harmless) but is no longer projected
Versioning and Publish Lifecycle β
CRDTs merge continuously β they have no concept of "done." The business requires discrete lifecycle states for tour templates: draft β published β locked.
Draft mode (default): The CRDT document is live and editable. Multiple dispatchers collaborate freely. The relational projection updates on a debounced basis. No downstream system (pricing, booking) consumes draft state.
Publish action: A dispatcher with manager or dispatcher role clicks "VerΓΆffentlichen." The system:
- Forces an immediate, synchronous projection (CRDT β relational)
- Stores the current Loro
frontiers()as an immutable version checkpoint inpublished_versions - Snapshots the projected relational state (JSON) alongside the frontier for fast reconstruction
- Downstream systems (pricing engine, booking widget, dispatch board) reference the published version, not the live CRDT
Post-publish editing: The CRDT document remains editable after publishing β it is a template, not a frozen artifact. The next publish creates a new version. Existing departures that reference a previous published version are unaffected.
Version history: Loro's checkout(frontiers) API enters a read-only detached mode showing the document at a historical point. The UI provides a version history sidebar where dispatchers can:
- Browse published versions with timestamps and author attribution
- Preview any version (read-only)
- Restore a historical version (creates a new CRDT state from the checkpoint, which becomes the new head)
Charter collaboration: The group leader cannot publish. Only the operator's dispatchers can transition from draft to published. The group leader edits in draft mode; the operator reviews and publishes.
Multi-Tenant Integration β
CRDT operations bypass Hasura β they flow through a dedicated WebSocket gateway. This gateway becomes the tenant isolation enforcement point for collaborative editing:
- Every WebSocket connection authenticates via the same JWT that Hasura uses (
x-hasura-tenant-id,x-hasura-user-id) - The gateway validates that the requested CRDT document belongs to the authenticated tenant's workspace before accepting or relaying operations
- The gateway extends the two-layer isolation model from ADR-004: Layer 1 = gateway-level tenant validation (analogous to Hasura permission rules), Layer 2 = CRDT documents stored in PostgreSQL with
tenant_idFK and RLS policies oncrdt_documents
Cross-tenant charter collaboration: The gateway supports explicit collaboration_invite tokens that grant a group leader scoped access to a specific CRDT document. The gateway enforces field-level permissions server-side: the group leader can edit stops and boarding points but not pricing fields, compliance data, or supplier references. The CRDT library itself has no concept of permissions β enforcement happens at the gateway before relaying operations to other peers.
Permission revocation: When a collaboration invite expires or the operator revokes access, the gateway stops relaying operations from the revoked user. Any operations the revoked user generated while offline are rejected on reconnect. This breaks CRDT convergence guarantees for the revoked user (their local state diverges), which is the correct behavior β a revoked user's edits should not propagate.
Real-Time Channel Exception β
realtime-strategy.md Β§1 establishes: "All Live surfaces use Hasura's native subscription infrastructure. No custom WebSocket server for domain data."
AP18 introduces a second real-time channel: a NestJS WebSocket gateway for CRDT synchronization. This is an explicit, documented exception:
- Why Hasura subscriptions cannot serve this purpose: Hasura subscriptions poll the database at ~1s intervals. CRDT sync requires sub-100ms operation relay for a responsive editing experience. Hasura also cannot relay opaque CRDT binary operations β it operates on SQL query results
- Scope of the exception: The CRDT WebSocket carries collaborative editing state only (CRDT operations, presence/awareness). All other domain data (dispatch board, inbox, seat maps, dashboards) continues to use Hasura subscriptions exclusively
- Precedent:
realtime-strategy.mdalready documents one exception: "BullMQ workflow escalations use a dedicated NestJS WebSocket Gateway." AP18 adds a second, scoped exception
Interaction with AP2 (Offline Sync Protocol) β
AP18's CRDT sync and AP2's relational offline sync (ADR-017) coexist in the same application:
| Concern | AP2 (Relational Sync) | AP18 (CRDT Sync) |
|---|---|---|
| Data | Boarding events, onboard sales, expense receipts, incidents | Tour plan structure (days, stops) |
| Mechanism | RxDB queue β POST /api/sync/batch β server-wins conflict resolution | CRDT local state β WebSocket relay β mathematical merge |
| Conflict model | Server wins (explicit) | Conflict-free (mathematical) |
| Offline | Queue mutations, replay on reconnect | Accumulate CRDT ops, merge on reconnect |
| Materialization | Immediate (server processes batch on receipt) | Debounced (2s idle) or on explicit publish |
The two protocols operate on non-overlapping data sets. Connection recovery handles both streams independently: the AP2 batch sync and the AP18 CRDT merge proceed in parallel without ordering constraints.
The only identified cross-protocol interaction surface is driver annotations during tour execution. A driver might add a field note to the trip plan (AP18 CRDT) triggered by an operational event like a delay (AP2 domain). These flow through separate protocols on the same device β no protocol-level coupling exists. If automated triggers become necessary in the future (e.g., a boarding event auto-generates a trip plan annotation), the integration runs server-side via domain events consumed by the projection service, not as a client-side protocol dependency.
The publish lifecycle further isolates the protocols: AP2 data (boarding events, sales) flows against the published relational snapshot, never against the live CRDT state. The two data flows only meet in the relational model after an explicit publish action.
Research Phases β
Phase 1: Technology Evaluation Spike β
- Implement the hierarchical trip plan data model (Day β Stop) in Loro and Yjs
- Evaluate move-operation semantics: reorder stops within a day, move a stop from Day 2 to Day 1
- Concurrent edit scenarios: two users edit different fields on the same stop, two users reorder the same day's stops simultaneously
- Benchmarks: document size after 1,000 edits, memory usage, merge latency
- Output: Technology decision (Loro vs. Yjs) with documented trade-offs
Phase 2: Multi-Tenant Integration + Projection β
- Build the collaboration gateway (NestJS WebSocket, tenant validation, operation relay, operation journal)
- Implement the debounced projection service (CRDT
toJSON()β diff β transactional SQL write) - FK constraint validation at projection layer (supplier references, allotment links)
- Cross-tenant charter collaboration with scoped field-level permissions
- Dual-sync coexistence with AP2 (both protocols active in the Driver Hub)
- Output: Working prototype with tenant-isolated collaborative editing and relational projection
Phase 3: Versioning + GoBD Compliance Layer β
- Implement the draft β published lifecycle with Loro
frontiers()checkpoints - Derive an append-only audit trail from the projection service's diff output (not raw CRDT operations)
- Each
change_eventsentry carriesuser_id,workspace_id,timestamp,operation_type,old_values,new_values - Attribution: the collaboration gateway maintains an operation journal mapping CRDT operations to user IDs. The projection service consults this journal to attribute each diff entry to the originating user. For merge events combining multiple users' concurrent edits, the entry carries
action = 'CRDT_MERGE'with acorrelation_idreferencing all contributing users - Point-in-time reconstruction: given a published version's frontier,
doc.checkout(frontier)βdoc.toJSON()reconstructs the exact state - Validate audit trail against the existing
change_eventspattern from ADR-019 - Output: GoBD-compliant audit trail over CRDT state with version history
Phase 4: Semantic Conflict UX β
- CRDTs resolve data conflicts automatically but not intent conflicts. When concurrent operations produce a technically valid but semantically surprising result, the UI must communicate what happened and offer recovery
- Design and test UX patterns for domain-specific conflict scenarios: concurrent stop reordering, delete-while-editing, concurrent field edits on the same stop
- Implement version history UI with Loro
checkout()for preview and restore - Evaluate fine-grained locking (optional field-level locks for high-stakes fields like pricing) as a prevention mechanism
- Output: Semantic conflict UX guidelines for domain-specific collaborative editing
Consequences β
Positive β
- True concurrent editing β dispatchers see each other's changes in real time without "take turns" UX
- Offline trip planning with automatic, conflict-free merge on reconnect β a capability no competitor offers
- Charter collaboration portal becomes a natural extension: scoped cross-tenant access to a shared CRDT document
- Passenger-facing live itinerary as a downstream benefit: the relational projection naturally feeds a read-only itinerary view via standard Hasura subscriptions (no CRDT involvement on the passenger side)
- Version history derived from Loro's frontier checkpoints provides point-in-time reconstruction and a clear draftβpublished lifecycle
- Audit trail derived from projection-time diffs provides GoBD-relevant traceability using the existing
change_eventspattern (ADR-019) - Architectural alignment with AP2 (both pursue distributed data consistency, different mechanisms)
Negative β
- New library dependency (Loro or Yjs) β neither has production references in multi-tenant SaaS architectures
- Dual sync protocol increases operational complexity (AP2 relational + AP18 CRDT)
- Custom projection layer required to keep the relational model synchronized with the CRDT state. This projection layer is a maintenance burden with no standard tooling
- FK constraint validation at the projection layer adds complexity: the CRDT merges operations without referential integrity awareness, so the projection must validate and handle broken references
- Second real-time channel (CRDT WebSocket) breaks the Hasura-only rule, adding infrastructure to operate
- Loro maturity risk β the library is less than 2 years old with a small community. Mitigated by keeping Yjs as a fallback
- Semantic conflict UX requires domain-specific design: CRDTs resolve data conflicts automatically but can produce surprising results (e.g., concurrent stop reordering). The UI must communicate these outcomes to non-technical dispatchers
Risks β
| Risk | Likelihood | Mitigation |
|---|---|---|
| Loro proves too immature for production | Medium | Yjs fallback; technology decision is an explicit Phase 1 output |
| CRDT document size grows beyond memory budget for large tour templates (14 days Γ 5 stops) | LowβMedium | Evaluate document splitting (one CRDT doc per Day instead of per TourTemplate); GC strategies if using Yjs |
| Relational projection drifts from CRDT state | Medium | Debounced projection with transactional writes; reconciliation cron detects and alerts on drift |
| FK constraint violations during projection (e.g., supplier deleted while CRDT edit references it) | Medium | Projection-layer validation; UI pre-constrains via dropdowns; cascading FK scenarios remain an open question |
| GoBD attribution model for concurrent CRDT merge events lacks legal precedent | High | Requires compliance/legal review before production deployment; Phase 3 output |
| Semantic conflicts produce surprising UX outcomes for non-technical dispatchers | Medium | Phase 4 UX research; toast notifications, version history, and optional fine-grained field locking |
| CRDT integration proves architecturally infeasible | Low | Fall back to Hasura LWW with UI-led conflict resolution (see Alternative 1). The Hasura approach ships in ~40h and serves 80% of use cases |
References β
- Kleppmann et al., "A highly-available move operation for replicated trees" (IEEE TPDS, 2022)
- Kleppmann & Beresford, "A Conflict-Free Replicated JSON Datatype" (IEEE TPDS, 2017)
- Loro documentation: Movable Tree (2024)
- ADR-004: Multi-Tenant Data Isolation Strategy
- ADR-017: Offline Sync Protocol for Driver Hub
- ADR-019: Change Events Polymorphic Audit Trail
- realtime-strategy.md β Β§1 Hasura as only real-time channel (exception documented here)
- STRATEGY_future-ideas.md Β§3 B2B Charter Collaboration Portal