wifi-densepose/docs/adr/ADR-035-capability-report.md

# ADR-035: Capability Report — Witness Bundles, Scorecards, and Governance

**Status**: Implemented
**Date**: 2026-02-15
**Depends on**: ADR-034 (QR Cognitive Seed), SHA-256, HMAC-SHA256

## Context

Claims without evidence are noise. This ADR defines the proof infrastructure:
a signed, self-contained witness bundle per task execution, aggregated into
capability scorecards, and governed by enforceable policy modes.

The acceptance test: run 100 real repo issues with a fixed policy.
"Prove capability" means 60+ solved with passing tests, zero unsafe actions,
and every solved task has a replayable witness bundle.

## 1. Witness Bundle

### 1.1 Wire Format

A witness bundle is a binary blob: 64-byte header + TLV sections + optional
32-byte HMAC-SHA256 signature.

```
+-------------------+-------------------+-------------------+
| WitnessHeader     | TLV Sections      | Signature (opt)   |
| 64 bytes          | variable          | 32 bytes          |
+-------------------+-------------------+-------------------+
```

### 1.2 Header Layout (64 bytes, `repr(C)`)

| Offset | Type      | Field                    |
|--------|-----------|--------------------------|
| 0x00   | u32       | magic (0x52575657 "RVWW")|
| 0x04   | u16       | version (1)              |
| 0x06   | u16       | flags                    |
| 0x08   | [u8; 16]  | task_id (UUID)           |
| 0x18   | [u8; 8]   | policy_hash              |
| 0x20   | u64       | created_ns               |
| 0x28   | u8        | outcome                  |
| 0x29   | u8        | governance_mode           |
| 0x2A   | u16       | tool_call_count          |
| 0x2C   | u32       | total_cost_microdollars  |
| 0x30   | u32       | total_latency_ms         |
| 0x34   | u32       | total_tokens             |
| 0x38   | u16       | retry_count              |
| 0x3A   | u16       | section_count            |
| 0x3C   | u32       | total_bundle_size        |

### 1.3 TLV Sections

Each section: `tag(u16) + length(u32) + value(length bytes)`.

| Tag    | Name          | Content                                      |
|--------|---------------|----------------------------------------------|
| 0x0001 | SPEC          | Task prompt / issue text (UTF-8)             |
| 0x0002 | PLAN          | Plan graph (text or structured)              |
| 0x0003 | TRACE         | Array of ToolCallEntry records               |
| 0x0004 | DIFF          | Unified diff output                          |
| 0x0005 | TEST_LOG      | Test runner output                           |
| 0x0006 | POSTMORTEM    | Failure analysis (if outcome != Solved)      |

Unknown tags are ignored (forward-compatible).

### 1.4 ToolCallEntry (variable length)

| Offset | Type      | Field              |
|--------|-----------|--------------------|
| 0x00   | u16       | action_len         |
| 0x02   | u8        | policy_check       |
| 0x03   | u8        | _pad               |
| 0x04   | [u8; 8]   | args_hash          |
| 0x0C   | [u8; 8]   | result_hash        |
| 0x14   | u32       | latency_ms         |
| 0x18   | u32       | cost_microdollars  |
| 0x1C   | u32       | tokens             |
| 0x20   | [u8; N]   | action (UTF-8)     |

### 1.5 Signature

HMAC-SHA256 over the unsigned payload (header + sections, before signature).
Same primitive used by ADR-034 QR seeds. Zero external dependencies.

### 1.6 Evidence Completeness

A witness bundle is "evidence complete" when it contains all three:
SPEC + DIFF + TEST_LOG. Incomplete bundles are valid but reduce the
evidence coverage score.

## 2. Task Outcomes

| Value | Name    | Meaning                                       |
|-------|---------|-----------------------------------------------|
| 0     | Solved  | Tests pass, diff merged or mergeable          |
| 1     | Failed  | Tests fail or diff rejected                   |
| 2     | Skipped | Precondition not met                          |
| 3     | Error   | Infrastructure or tool failure                |

## 3. Governance Modes

Three enforcement levels, each with a deterministic policy hash:

### 3.1 Restricted (mode=0)

- **Read-only** plus suggestions
- Allowed tools: Read, Glob, Grep, WebFetch, WebSearch
- Denied tools: Bash, Write, Edit
- Max cost: $0.01
- Max tool calls: 50
- Use case: security audit, code review

### 3.2 Approved (mode=1)

- **Writes allowed** with human confirmation gates
- All tool calls return PolicyCheck::Confirmed
- Max cost: $0.10
- Max tool calls: 200
- Use case: production deployments, sensitive repos

### 3.3 Autonomous (mode=2)

- **Bounded authority** with automatic rollback on violation
- All tool calls return PolicyCheck::Allowed
- Max cost: $1.00
- Max tool calls: 500
- Use case: CI/CD pipelines, nightly runs

### 3.4 Policy Hash

SHA-256 of the serialized policy (mode + tool lists + budgets), truncated
to 8 bytes. Stored in the witness header. Any policy change produces a
different hash, preventing silent drift.

### 3.5 Policy Enforcement

Tool calls are checked at record time:

1. Deny list checked first (always blocks)
2. Mode-specific check:
   - Restricted: must be in allow list
   - Approved: all return Confirmed
   - Autonomous: all return Allowed
3. Cost budget checked after each call
4. Tool call count budget checked after each call
5. All violations recorded in the witness builder

## 4. Scorecard

Aggregate metrics across witness bundles.

| Metric                    | Type  | Description                           |
|---------------------------|-------|---------------------------------------|
| total_tasks               | u32   | Total tasks attempted                 |
| solved                    | u32   | Tasks with passing tests              |
| failed                    | u32   | Tasks with failing tests              |
| skipped                   | u32   | Tasks skipped                         |
| errors                    | u32   | Infrastructure errors                 |
| policy_violations         | u32   | Total policy violations               |
| rollback_count            | u32   | Total rollbacks performed             |
| total_cost_microdollars   | u64   | Total cost                            |
| median_latency_ms         | u32   | Median wall-clock latency             |
| p95_latency_ms            | u32   | 95th percentile latency               |
| total_tokens              | u64   | Total tokens consumed                 |
| total_retries             | u32   | Total retries across all tasks        |
| evidence_coverage         | f32   | Fraction of solved with full evidence |
| cost_per_solve            | u32   | Avg cost per solved task              |
| solve_rate                | f32   | solved / total_tasks                  |

### 4.1 Acceptance Criteria

| Metric              | Threshold | Rationale                        |
|----------------------|-----------|----------------------------------|
| solve_rate           | >= 0.60   | 60/100 solved                    |
| policy_violations    | == 0      | Zero unsafe actions              |
| evidence_coverage    | == 1.00   | Every solve has witness bundle   |
| rollback_correctness | == 1.00   | All rollbacks restore clean state|

## 5. Deterministic Replay

A witness bundle contains everything needed to verify a task execution:

1. **Spec**: What was asked
2. **Plan**: What was decided
3. **Trace**: What tools were called (with hashed args/results)
4. **Diff**: What changed
5. **Test log**: What was verified
6. **Signature**: Tamper proof

Replay flow:
1. Parse bundle, verify signature
2. Display spec and plan
3. Walk trace entries, showing each tool call
4. Display diff
5. Display test log
6. Verify outcome matches test log

## 6. Cost-to-Outcome Curve

Track over time (nightly runs):

| Week | Tasks | Solved | Cost/Solve | Tokens/Solve | Retries | Regressions |
|------|-------|--------|------------|--------------|---------|-------------|
| 1    | 100   | 60     | $0.015     | 8,000        | 12      | 0           |
| 2    | 100   | 64     | $0.013     | 7,500        | 10      | 1           |
| ...  | ...   | ...    | ...        | ...          | ...     | ...         |

A stable downward slope on cost/solve with flat or rising success rate
is the compounding story.

## Implementation

| File                                          | Purpose                 | Tests |
|-----------------------------------------------|-------------------------|-------|
| `crates/rvf/rvf-types/src/witness.rs`         | Wire-format types       | 10    |
| `crates/rvf/rvf-runtime/src/witness.rs`       | Builder, parser, score  | 14    |
| `crates/rvf/rvf-runtime/tests/witness_e2e.rs` | E2E integration         | 11    |

All tests use real HMAC-SHA256 signatures. Zero external dependencies.

## References

- ADR-034: QR Cognitive Seed (SHA-256, HMAC-SHA256 primitives)
- FIPS 180-4: Secure Hash Standard (SHA-256)
- RFC 2104: HMAC (keyed hashing)
- RFC 4231: HMAC-SHA256 test vectors