Verdict Observability

Mipiti's verdict pipeline normally answers one question per control: do the assertions submitted for this control suffice to prove it is implemented? The verdict-observability layer adds two more questions, each backed by a cheap LLM call and surfaced where the operator can act on disagreement:

Coverage — does this control actually cover the control objective it is mapped to? And conversely, does this control cover a control objective it is not mapped to?
Group sufficiency — do the controls in a mitigation group actually defeat the attacker, given the attacker's capability?

Observability is off by default. When the platform operator enables it (see Disabling and tuning), the LLM verdicts run in the background and surface as a divergence list on the model — operator-attributable triage work where the LLM judgment disagrees with what the model currently asserts. Committed coverage and compliance numbers are never changed by the observability layer; it only observes.

What this protects against

Two failure modes the structural assurance layer can't catch by itself:

A control claims to cover CO7 by virtue of its (asset, attacker, properties) mapping, but the control's actual description doesn't address CO7's threat. The structural rollup still credits CO7 as covered. The LLM coverage check disagrees and flags the mapping as spurious.
A control objective has all its mitigation-group members verified by evidence, and the structural rollup says Mitigated. But the group as a whole leaves a critical defensive layer missing for the attacker's actual capability. The LLM group-sufficiency check disagrees and flags the group as insufficient.

Both cases are silent under the structural-only path. Verdict observability surfaces them so the operator can either accept the LLM's view (re-mapping the control, adding a missing layer) or dismiss it (the structural model was right; the LLM was wrong).

Reading the divergence list

A model's divergences live on the Verdict Divergence panel (one per model). Two sections:

Coverage divergences — one row per (control, control objective) pair where the LLM and the recorded mapping disagree, with a kind:

missing_mapping — the LLM is confident this control covers a control objective it is not mapped to. Typical action: add the mapping (Accepting a divergence).
spurious_mapping — the mapping exists, but the LLM is confident this control does not actually cover the control objective. Typical action: remove the mapping.

Each row carries the LLM's probability (p_covers), a one-sentence rationale, and the timestamp at which the verdict was computed.

Group-sufficiency divergences — one row per (control objective, mitigation group) where the LLM rates the group as clearly insufficient (p_suffices ≤ 0.3 by default). Each row decorates with:

member_control_ids — the controls in the group as currently authored.
all_members_verified — true when every member control's evidence sufficiency verdict is sufficient. This is the asymmetrically concerning case: the structural assurance rollup considers the group complete (everything implemented), but the LLM thinks the group still doesn't defeat the attacker. A dashboard's critical_count is exactly this subset.

The confidence band

Not every LLM verdict surfaces. The divergence read endpoint applies a dual-band confidence floor so only confident disagreements show up:

A coverage row surfaces as missing_mapping only when p_covers ≥ 0.7.
A coverage row surfaces as spurious_mapping only when p_covers ≤ 0.3.
A group_sufficiency row surfaces only when p_suffices ≤ 0.3.

Probabilities in the middle band — p_covers in [0.3, 0.7], or p_suffices > 0.3 — mean the LLM is uncertain. Surfacing those would generate triage work the operator would reasonably ignore. The verdicts are still cached for later inspection; they're just not surfaced as actionable rows.

The floors are tunable per deployment via environment variables — see Disabling and tuning.

Accepting a divergence

Coverage divergences support a one-click accept that applies the LLM's view as a mapping update:

Accept missing_mapping — adds the mapped CO to the control. The change goes through the standard remap pipeline: new control version, audit text from the operator's confirmation, evidence-completeness invalidated, evidence-sufficiency verdict re-enqueued.
Accept spurious_mapping — removes the mapped CO from the control. Same audit + re-eval pipeline. If the spurious mapping is the control's only mapping, accept is refused with an orphan warning; soft-delete the control instead.

A change reason is required (10-character minimum) — the same audit floor the manual co-mapping PATCH endpoint enforces. The accepted change records the operator's reasoning on the new version row, so the audit trail names what changed and why.

Group-sufficiency divergences are observation-only — there's no one-click accept because the natural fix isn't mechanical (it's "add a control" or "restructure the group", which is operator-judgment work). When the divergence list reports a critical group-sufficiency divergence, the action is in the methodology layer, not the click layer.

Stale divergences

Once a divergence is accepted (or fixed via any other route — manual remap, control refinement, etc.), the underlying inputs change, and the cached LLM verdict goes stale. The next entity mutation enqueues a re-evaluation; on completion, the divergence either updates with new probabilities or disappears entirely. Attempting to re-accept a divergence that no longer applies (e.g., the mapping was already added by something else) returns an explicit "divergence may be stale — re-fetch" error rather than silently no-op'ing.

If the surfaced list looks stale but no recent mutation has fired, POST /api/models/{id}/verdict-divergence/recompute force-enqueues a fresh re-evaluation for every control and live CO on the model. The worker debounces, so a flurry of recompute calls collapses to one re-eval per (model, control, kind) — safe to call from a UI "refresh" button.

Filtering and pagination

The read endpoint supports two query parameters for triage at scale:

limit / offset (defaults 100 / 0, max limit=500) — slice the divergences array. Each section's pagination block carries filtered_total so the UI can render "showing N of M" without a second request. The summary block stays unfiltered so dashboard counts remain accurate regardless of which page is open.
kind — one of missing_mapping, spurious_mapping, or group_sufficiency. Filters the response to the named kind (clears the unrelated section). Omitted = all kinds. Powers tabbed UIs cleanly: Missing | Spurious | Group | All.

The summary block always reflects the full unfiltered state; only pagination.filtered_total reflects the kind-filtered subset. A response with summary.missing_mapping_count = 47 and pagination.filtered_total = 47 on a ?kind=missing_mapping request is the same model state viewed two ways — totals (for dashboards) and the active tab (for triage).

Disabling and tuning

The observability layer is gated by the DERIVATION_GRAPH_OBSERVABILITY_ENABLED environment variable on the backend, off by default. When off:

The verdict-divergence endpoint still exists, but returns flag_enabled: false with zeroed counters — the UI renders the panel in a disabled state with no triage work.
No new LLM verdicts are computed for coverage or group_sufficiency. The existing evidence-sufficiency verdicts continue to run as before.

When enabled, three further env vars tune the surfacing thresholds:

DIVERGENCE_COVERAGE_HIGH_FLOOR (default 0.7) — minimum p_covers for missing_mapping to surface.
DIVERGENCE_COVERAGE_LOW_FLOOR (default 0.3) — maximum p_covers for spurious_mapping to surface.
DIVERGENCE_GROUP_SUFFICIENCY_FLOOR (default 0.3) — maximum p_suffices for a group-sufficiency divergence to surface.

A deployment hitting a noisy surface can loosen the bounds (e.g., 0.6 / 0.4) to see more candidates; a deployment that wants only the very confident disagreements can tighten (e.g., 0.9 / 0.1).

How it relates to evidence verification

Verdict observability runs in addition to the regular evidence-sufficiency verdict pipeline (Evidence Verification). The three verdict kinds answer three distinct questions:

Verdict kind	Question	Affects committed math?
`sufficiency`	Does the evidence on this control suffice to prove it implemented?	Yes — feeds the `Mitigated` rollup.
`coverage` (this layer)	Does this control cover this control objective?	No — observation only.
`group_sufficiency` (this layer)	Does this group's controls defeat the attacker?	No — observation only.

The split is deliberate: committed posture remains structural and deterministic (no LLM in the rollup); the observability layer adds LLM judgment as a separate signal the operator can act on, without ever silently overwriting authored or structurally-derived state.