Busflow Docs

Internal documentation portal

Skip to content

GDPR TTL Retention — Migration Plan

Status: 📋 Ready to apply once the target schemas (commerce.passengers, backoffice.resellers, commerce.invoices) exist. Scope: Concrete DDL + function bodies + pg_cron schedules implementing ADR-028. Why this file exists: This plan was originally drafted as a Hasura migration (apps/hasura/migrations/default/20260420120000_gdpr_ttl_retention/). The migration was removed because the backoffice database is still empty and the target tables do not exist yet — a migration applied today would fail on every ALTER TABLE referenced below. This document preserves the concrete SQL so nothing is lost when the schema lands.


Preconditions

  1. commerce.passengers exists with at least (id UUID, tenant_id UUID, first_name TEXT, last_name TEXT, email TEXT, phone TEXT).
  2. backoffice.resellers exists with at least (id UUID, tenant_id UUID, contact_name TEXT, contact_email TEXT, contact_phone TEXT).
  3. commerce.invoices exists with at least (id UUID, tenant_id UUID, recipient_snapshot JSONB).
  4. Postgres 15+ with shared_preload_libraries = 'pg_cron,pgsodium'. On Ubicloud standard-2 both extensions are pre-enabled (confirmed in ADR-022 §Consequences).
  5. backoffice.* schema exists and the migration role can CREATE within it.

If any of these are not true, land them as prerequisite migrations first — do not weaken the DDL below to work around missing columns.


Ordering Contract

The migration must be applied top-to-bottom in exactly this order. Later sections read tables/columns created by earlier sections.

  1. Extensions
  2. Audit ledger (backoffice.tenant_scrub_logs) — must exist before scrub functions can INSERT
  3. Legal-hold table (backoffice.legal_holds) — must exist before scrub functions can SELECT
  4. Idempotency columns on PII entities (pii_redacted_at + per-entity "last activity" column)
  5. Scrub functions (per entity)
  6. pg_cron schedules
  7. fillfactor tuning

1. Extensions

sql
CREATE EXTENSION IF NOT EXISTS pgcrypto;
CREATE EXTENSION IF NOT EXISTS pg_cron;

pgcrypto provides gen_random_uuid(). pg_cron provides the scheduler.


2. Append-Only Audit Ledger

tenant_scrub_logs is the only proof-of-action surface for auditors/DPO. It must reject UPDATE and DELETE at the role level — a superuser slip then requires an explicit GRANT, not just a typo.

sql
CREATE TABLE IF NOT EXISTS backoffice.tenant_scrub_logs (
  id            UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  tenant_id     UUID NOT NULL,
  entity_type   TEXT NOT NULL CHECK (entity_type IN ('passenger','reseller','invoice','conversation')),
  entity_id     UUID NOT NULL,
  scrubbed_at   TIMESTAMPTZ NOT NULL DEFAULT now() AT TIME ZONE 'UTC',
  action        TEXT NOT NULL CHECK (action IN ('REDACTED','SKIPPED_LEGAL_HOLD','SKIPPED_NULL_TRIGGER')),
  skip_reason   TEXT,
  job_run_id    UUID
);

COMMENT ON TABLE backoffice.tenant_scrub_logs IS
  'Append-only GDPR scrub ledger. Does NOT store redacted values — ADR-028.';

CREATE INDEX IF NOT EXISTS idx_scrub_logs_tenant_time
  ON backoffice.tenant_scrub_logs (tenant_id, scrubbed_at DESC);

REVOKE UPDATE, DELETE ON backoffice.tenant_scrub_logs FROM PUBLIC;

Why a dedicated job_run_id: one scheduled run may produce thousands of log rows; correlating them back to a single cron.job_run_details record is done via this UUID.


Rows in legal_holds pin a specific entity against redaction. The partial index keeps open-hold lookups O(1) regardless of historical volume.

sql
CREATE TABLE IF NOT EXISTS backoffice.legal_holds (
  id          UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  tenant_id   UUID NOT NULL,
  entity_type TEXT NOT NULL CHECK (entity_type IN ('passenger','reseller','invoice','conversation')),
  entity_id   UUID NOT NULL,
  reason      TEXT NOT NULL,
  until       TIMESTAMPTZ,
  created_by  UUID NOT NULL,
  created_at  TIMESTAMPTZ NOT NULL DEFAULT now() AT TIME ZONE 'UTC',
  closed_at   TIMESTAMPTZ,
  closed_by   UUID
);

CREATE INDEX IF NOT EXISTS idx_legal_holds_open_by_entity
  ON backoffice.legal_holds (tenant_id, entity_type, entity_id)
  WHERE closed_at IS NULL;

Operational procedures live in docs/protocols/legal-hold-runbook.md.


4. Idempotency Columns on PII Entities

Every scrub-eligible entity gets two additions:

  • A trigger column (last_booking_at, last_active_at, issued_at) — nullable forever; NULL means "unknown, skip in scrub".
  • A proof-of-redaction column (pii_redacted_at TIMESTAMPTZ) — NULL means "not yet redacted". This replaces the fragile text-sentinel pattern (first_name = '[REDACTED]') that fails against real travellers whose legal names happen to match the sentinel. See ADR-028 §Decision #2.
sql
ALTER TABLE commerce.passengers
  ADD COLUMN IF NOT EXISTS last_booking_at   TIMESTAMPTZ,
  ADD COLUMN IF NOT EXISTS pii_redacted_at   TIMESTAMPTZ;

ALTER TABLE backoffice.resellers
  ADD COLUMN IF NOT EXISTS last_active_at    TIMESTAMPTZ,
  ADD COLUMN IF NOT EXISTS pii_redacted_at   TIMESTAMPTZ;

ALTER TABLE commerce.invoices
  ADD COLUMN IF NOT EXISTS issued_at         TIMESTAMPTZ,
  ADD COLUMN IF NOT EXISTS pii_redacted_at   TIMESTAMPTZ;

The online back-fill for last_booking_at is handled by apps/api/src/workers/pii-backfill.worker.ts (to be written) in 10 000-row chunks with a 1 s delay — see ADR-028 §Decision #4.


5. Scrub Functions

Every scrub function follows the same CTE-structured pipeline:

  1. candidates — rows where the trigger column crosses the per-entity retention window and pii_redacted_at IS NULL.
  2. held — subset of candidates with an open legal_holds row. These are not redacted.
  3. redactUPDATE that tombstones PII columns (set to NULL or sentinel JSONB) and stamps pii_redacted_at = now().
  4. logged_redact / logged_hold — dual INSERT into tenant_scrub_logs so both actions (REDACTED and SKIPPED_LEGAL_HOLD) are recorded in the same transaction.

All functions are SECURITY DEFINER so they can write to tenant_scrub_logs without granting each scheduled cron role broad INSERT rights on the audit table.

5.1 scrub_passengers — 3-year window (BDSG § 35)

sql
CREATE OR REPLACE FUNCTION backoffice.scrub_passengers(p_run_id UUID DEFAULT gen_random_uuid())
RETURNS INTEGER
LANGUAGE plpgsql
SECURITY DEFINER
AS $$
DECLARE
  v_rows INTEGER := 0;
BEGIN
  WITH candidates AS (
    SELECT p.id, p.tenant_id
      FROM commerce.passengers p
     WHERE p.pii_redacted_at IS NULL
       AND p.last_booking_at IS NOT NULL
       AND p.last_booking_at < (now() AT TIME ZONE 'UTC') - INTERVAL '3 years'
  ),
  held AS (
    SELECT c.id, c.tenant_id
      FROM candidates c
      JOIN backoffice.legal_holds h
        ON h.tenant_id = c.tenant_id
       AND h.entity_type = 'passenger'
       AND h.entity_id   = c.id
       AND h.closed_at   IS NULL
  ),
  redact AS (
    UPDATE commerce.passengers p
       SET first_name = NULL,
           last_name  = NULL,
           email      = NULL,
           phone      = NULL,
           pii_redacted_at = now() AT TIME ZONE 'UTC'
      FROM candidates c
     WHERE p.id = c.id
       AND NOT EXISTS (SELECT 1 FROM held h WHERE h.id = c.id)
     RETURNING p.id, p.tenant_id
  ),
  logged_redact AS (
    INSERT INTO backoffice.tenant_scrub_logs (tenant_id, entity_type, entity_id, action, job_run_id)
    SELECT tenant_id, 'passenger', id, 'REDACTED', p_run_id FROM redact
    RETURNING 1
  ),
  logged_hold AS (
    INSERT INTO backoffice.tenant_scrub_logs (tenant_id, entity_type, entity_id, action, skip_reason, job_run_id)
    SELECT tenant_id, 'passenger', id, 'SKIPPED_LEGAL_HOLD', 'active_hold', p_run_id FROM held
    RETURNING 1
  )
  SELECT COUNT(*) INTO v_rows FROM logged_redact;

  RETURN v_rows;
END;
$$;

5.2 scrub_resellers — 2-year window (BDSG § 35)

sql
CREATE OR REPLACE FUNCTION backoffice.scrub_resellers(p_run_id UUID DEFAULT gen_random_uuid())
RETURNS INTEGER
LANGUAGE plpgsql
SECURITY DEFINER
AS $$
DECLARE v_rows INTEGER := 0;
BEGIN
  WITH candidates AS (
    SELECT r.id, r.tenant_id
      FROM backoffice.resellers r
     WHERE r.pii_redacted_at IS NULL
       AND r.last_active_at IS NOT NULL
       AND r.last_active_at < (now() AT TIME ZONE 'UTC') - INTERVAL '2 years'
  ),
  held AS (
    SELECT c.id, c.tenant_id FROM candidates c
      JOIN backoffice.legal_holds h
        ON h.tenant_id = c.tenant_id AND h.entity_type='reseller'
       AND h.entity_id = c.id AND h.closed_at IS NULL
  ),
  redact AS (
    UPDATE backoffice.resellers r
       SET contact_name = NULL, contact_email = NULL, contact_phone = NULL,
           pii_redacted_at = now() AT TIME ZONE 'UTC'
      FROM candidates c
     WHERE r.id = c.id AND NOT EXISTS (SELECT 1 FROM held h WHERE h.id = c.id)
     RETURNING r.id, r.tenant_id
  )
  INSERT INTO backoffice.tenant_scrub_logs (tenant_id, entity_type, entity_id, action, job_run_id)
  SELECT tenant_id, 'reseller', id, 'REDACTED', p_run_id FROM redact;

  GET DIAGNOSTICS v_rows = ROW_COUNT;
  RETURN v_rows;
END;
$$;

5.3 scrub_invoices — 10-year window (GoBD § 147 AO / § 14b UStG)

Invoice PII is embedded in recipient_snapshot (JSONB). Redaction replaces the whole object with a sentinel rather than NULL-ing individual keys — auditors need to see that the row was processed, not that it was empty.

sql
CREATE OR REPLACE FUNCTION backoffice.scrub_invoices(p_run_id UUID DEFAULT gen_random_uuid())
RETURNS INTEGER
LANGUAGE plpgsql
SECURITY DEFINER
AS $$
DECLARE v_rows INTEGER := 0;
BEGIN
  WITH candidates AS (
    SELECT i.id, i.tenant_id
      FROM commerce.invoices i
     WHERE i.pii_redacted_at IS NULL
       AND i.issued_at IS NOT NULL
       AND i.issued_at < (now() AT TIME ZONE 'UTC') - INTERVAL '10 years'
  ),
  held AS (
    SELECT c.id, c.tenant_id FROM candidates c
      JOIN backoffice.legal_holds h
        ON h.tenant_id=c.tenant_id AND h.entity_type='invoice'
       AND h.entity_id=c.id AND h.closed_at IS NULL
  ),
  redact AS (
    UPDATE commerce.invoices i
       SET recipient_snapshot = jsonb_build_object('redacted', true),
           pii_redacted_at = now() AT TIME ZONE 'UTC'
      FROM candidates c
     WHERE i.id = c.id AND NOT EXISTS (SELECT 1 FROM held h WHERE h.id = c.id)
     RETURNING i.id, i.tenant_id
  )
  INSERT INTO backoffice.tenant_scrub_logs (tenant_id, entity_type, entity_id, action, job_run_id)
  SELECT tenant_id, 'invoice', id, 'REDACTED', p_run_id FROM redact;

  GET DIAGNOSTICS v_rows = ROW_COUNT;
  RETURN v_rows;
END;
$$;

5.4 Conversations / communications.messages

Deliberately not implemented as a database scrub. Per ADR-028 §Decision #10, rendered names in communications.messages.rendered_content are treated as derived from commerce.passengers: once the source passenger is scrubbed, the rendered message contains an orphaned name that can no longer be re-identified. Chat transcripts in Loki log streams expire under the 14-day compactor retention (ADR-027). If the DPO rejects this position, the follow-up work is a cascading text-replace function gated on a new ADR — do not add it here.


6. pg_cron Schedules

All schedules are UTC. Off-peak for DACH; 15-minute offsets prevent three scrub functions from contending for the same buffer pool. 03:00 UTC aligns with the backup-verify cadence in backup-verify-runbook.md.

sql
SELECT cron.schedule('busflow_scrub_passengers', '0 3 * * *',  $$SELECT backoffice.scrub_passengers();$$);
SELECT cron.schedule('busflow_scrub_resellers',  '15 3 * * *', $$SELECT backoffice.scrub_resellers();$$);
SELECT cron.schedule('busflow_scrub_invoices',   '30 3 * * *', $$SELECT backoffice.scrub_invoices();$$);

The NestJS cron-health.worker.ts (ADR-028 §Decision #8) runs at 03:30 UTC and posts failures from cron.job_run_details to Slack #ops — Ubicloud does not expose cron.job_run_details to external Prometheus scrapers.


7. fillfactor Tuning

The scrub functions tombstone many rows per night, which churns HOT update chains. A greenfield fillfactor = 80 leaves 20 % free space per page for HOT updates, cutting index bloat. Set at migration time — not later — because ALTER TABLE ... SET (fillfactor) on a populated table doesn't retroactively rewrite pages.

sql
ALTER TABLE commerce.passengers    SET (fillfactor = 80);
ALTER TABLE backoffice.resellers   SET (fillfactor = 80);
ALTER TABLE commerce.invoices      SET (fillfactor = 80);

Down Migration

sql
SELECT cron.unschedule('busflow_scrub_passengers');
SELECT cron.unschedule('busflow_scrub_resellers');
SELECT cron.unschedule('busflow_scrub_invoices');

DROP FUNCTION IF EXISTS backoffice.scrub_invoices(UUID);
DROP FUNCTION IF EXISTS backoffice.scrub_resellers(UUID);
DROP FUNCTION IF EXISTS backoffice.scrub_passengers(UUID);

ALTER TABLE commerce.invoices    DROP COLUMN IF EXISTS pii_redacted_at;
ALTER TABLE commerce.invoices    DROP COLUMN IF EXISTS issued_at;
ALTER TABLE backoffice.resellers DROP COLUMN IF EXISTS pii_redacted_at;
ALTER TABLE backoffice.resellers DROP COLUMN IF EXISTS last_active_at;
ALTER TABLE commerce.passengers  DROP COLUMN IF EXISTS pii_redacted_at;
ALTER TABLE commerce.passengers  DROP COLUMN IF EXISTS last_booking_at;

DROP INDEX IF EXISTS backoffice.idx_legal_holds_open_by_entity;
DROP TABLE IF EXISTS backoffice.legal_holds;

DROP INDEX IF EXISTS backoffice.idx_scrub_logs_tenant_time;
DROP TABLE IF EXISTS backoffice.tenant_scrub_logs;

The down migration does not restore redacted PII. It exists for structural rollback during migration development only. Any real post-scrub recovery must come from a PITR restore per postgres-cutover.md §PITR.


When to Land This Migration

  1. After commerce.passengers, backoffice.resellers, and commerce.invoices exist with the columns listed in §Preconditions.
  2. After Legal/DPO has signed off on the retention table and the communications.messages separation-of-concerns position in ADR-028.
  3. Before the first production tenant row is written — the trigger columns need to be in place so pii_backfill.worker.ts can populate last_booking_at in-line rather than via a costly table rewrite later.

Internal documentation — Busflow