Files
wifi-densepose/docs/adr/ADR-008-distributed-consensus-multi-ap.md
Claude 337dd9652f feat: Add 12 ADRs for RuVector RVF integration and proof-of-reality
Comprehensive architecture decision records for integrating ruvnet/ruvector
into wifi-densepose, covering:

- ADR-002: Master integration strategy (phased rollout, new crate design)
- ADR-003: RVF cognitive containers for CSI data persistence
- ADR-004: HNSW vector search replacing fixed-threshold detection
- ADR-005: SONA self-learning with LoRA + EWC++ for online adaptation
- ADR-006: GNN-enhanced pattern recognition with temporal modeling
- ADR-007: Post-quantum cryptography (ML-DSA-65 hybrid signatures)
- ADR-008: Raft consensus for multi-AP distributed coordination
- ADR-009: RVF WASM runtime for edge/browser/IoT deployment
- ADR-010: Witness chains for tamper-evident audit trails
- ADR-011: Mock elimination and proof-of-reality (fixes np.random.rand
           placeholders, ships CSI capture + SHA-256 verified pipeline)
- ADR-012: ESP32 CSI sensor mesh ($54 starter kit specification)
- ADR-013: Feature-level sensing on commodity gear (zero-cost RSSI path)

ADR-011 directly addresses the credibility gap by cataloging every
mock/placeholder in the Python codebase and specifying concrete fixes.

https://claude.ai/code/session_01Ki7pvEZtJDvqJkmyn6B714
2026-02-28 06:13:04 +00:00

285 lines
12 KiB
Markdown

# ADR-008: Distributed Consensus for Multi-AP Coordination
## Status
Proposed
## Date
2026-02-28
## Context
### Multi-AP Sensing Architecture
WiFi-DensePose achieves higher accuracy and coverage with multiple access points (APs) observing the same space from different angles. The disaster detection module (wifi-densepose-mat, ADR-001) explicitly requires distributed deployment:
- **Portable**: Single TX/RX units deployed around a collapse site
- **Distributed**: Multiple APs covering a large disaster zone
- **Drone-mounted**: UAVs scanning from above with coordinated flight paths
Each AP independently captures CSI data, extracts features, and runs local inference. But the distributed system needs coordination:
1. **Consistent survivor registry**: All nodes must agree on the set of detected survivors, their locations, and triage classifications. Conflicting records cause rescue teams to waste time.
2. **Coordinated scanning**: Avoid redundant scans of the same zone. Dynamically reassign APs as zones are cleared.
3. **Model synchronization**: When SONA adapts a model on one node (ADR-005), other nodes should benefit from the adaptation without re-learning.
4. **Clock synchronization**: CSI timestamps must be aligned across nodes for multi-view pose fusion (the GNN multi-person disentanglement in ADR-006 requires temporal alignment).
5. **Partition tolerance**: In disaster scenarios, network connectivity is unreliable. The system must function during partitions and reconcile when connectivity restores.
### Current State
No distributed coordination exists. Each node operates independently. The Rust workspace has no consensus crate.
### RuVector's Distributed Capabilities
RuVector provides:
- **Raft consensus**: Leader election and replicated log for strong consistency
- **Vector clocks**: Logical timestamps for causal ordering without synchronized clocks
- **Multi-master replication**: Concurrent writes with conflict resolution
- **Delta consensus**: Tracks behavioral changes across nodes for anomaly detection
- **Auto-sharding**: Distributes data based on access patterns
## Decision
We will integrate RuVector's Raft consensus implementation as the coordination backbone for multi-AP WiFi-DensePose deployments, with vector clocks for causal ordering and CRDT-based conflict resolution for partition-tolerant operation.
### Consensus Architecture
```
┌─────────────────────────────────────────────────────────────────────┐
│ Multi-AP Coordination Architecture │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ Normal Operation (Connected): │
│ │
│ ┌─────────┐ Raft ┌─────────┐ Raft ┌─────────┐ │
│ │ AP-1 │◀────────────▶│ AP-2 │◀────────────▶│ AP-3 │ │
│ │ (Leader)│ Replicated │(Follower│ Replicated │(Follower│ │
│ │ │ Log │ )│ Log │ )│ │
│ └────┬────┘ └────┬────┘ └────┬────┘ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ │
│ │ Local │ │ Local │ │ Local │ │
│ │ RVF │ │ RVF │ │ RVF │ │
│ │Container│ │Container│ │Container│ │
│ └─────────┘ └─────────┘ └─────────┘ │
│ │
│ Partitioned Operation (Disconnected): │
│ │
│ ┌─────────┐ ┌──────────────────────┐ │
│ │ AP-1 │ ← operates independently → │ AP-2 AP-3 │ │
│ │ │ │ (form sub-cluster) │ │
│ │ Local │ │ Raft between 2+3 │ │
│ │ writes │ │ │ │
│ └─────────┘ └──────────────────────┘ │
│ │ │ │
│ └──────── Reconnect: CRDT merge ─────────────┘ │
└─────────────────────────────────────────────────────────────────────┘
```
### Replicated State Machine
The Raft log replicates these operations across all nodes:
```rust
/// Operations replicated via Raft consensus
#[derive(Serialize, Deserialize, Clone)]
pub enum ConsensusOp {
/// New survivor detected
SurvivorDetected {
survivor_id: Uuid,
location: GeoCoord,
triage: TriageLevel,
detecting_ap: ApId,
confidence: f64,
timestamp: VectorClock,
},
/// Survivor status updated (e.g., triage reclassification)
SurvivorUpdated {
survivor_id: Uuid,
new_triage: TriageLevel,
updating_ap: ApId,
evidence: DetectionEvidence,
},
/// Zone assignment changed
ZoneAssignment {
zone_id: ZoneId,
assigned_aps: Vec<ApId>,
priority: ScanPriority,
},
/// Model adaptation delta shared
ModelDelta {
source_ap: ApId,
lora_delta: Vec<u8>, // Serialized LoRA matrices
environment_hash: [u8; 32],
performance_metrics: AdaptationMetrics,
},
/// AP joined or left the cluster
MembershipChange {
ap_id: ApId,
action: MembershipAction, // Join | Leave | Suspect
},
}
```
### Vector Clocks for Causal Ordering
Since APs may have unsynchronized physical clocks, vector clocks provide causal ordering:
```rust
/// Vector clock for causal ordering across APs
#[derive(Clone, Serialize, Deserialize)]
pub struct VectorClock {
/// Map from AP ID to logical timestamp
clocks: HashMap<ApId, u64>,
}
impl VectorClock {
/// Increment this AP's clock
pub fn tick(&mut self, ap_id: &ApId) {
*self.clocks.entry(ap_id.clone()).or_insert(0) += 1;
}
/// Merge with another clock (take max of each component)
pub fn merge(&mut self, other: &VectorClock) {
for (ap_id, &ts) in &other.clocks {
let entry = self.clocks.entry(ap_id.clone()).or_insert(0);
*entry = (*entry).max(ts);
}
}
/// Check if self happened-before other
pub fn happened_before(&self, other: &VectorClock) -> bool {
self.clocks.iter().all(|(k, &v)| {
other.clocks.get(k).map_or(false, |&ov| v <= ov)
}) && self.clocks != other.clocks
}
}
```
### CRDT-Based Conflict Resolution
During network partitions, concurrent updates may conflict. We use CRDTs (Conflict-free Replicated Data Types) for automatic resolution:
```rust
/// Survivor registry using Last-Writer-Wins Register CRDT
pub struct SurvivorRegistry {
/// LWW-Element-Set: each survivor has a timestamp-tagged state
survivors: HashMap<Uuid, LwwRegister<SurvivorState>>,
}
/// Triage uses Max-wins semantics:
/// If partition A says P1 (Red/Immediate) and partition B says P2 (Yellow/Delayed),
/// after merge the survivor is classified P1 (more urgent wins)
/// Rationale: false negative (missing critical) is worse than false positive
impl CrdtMerge for TriageLevel {
fn merge(a: Self, b: Self) -> Self {
// Lower numeric priority = more urgent
if a.urgency() >= b.urgency() { a } else { b }
}
}
```
**CRDT merge strategies by data type**:
| Data Type | CRDT Type | Merge Strategy | Rationale |
|-----------|-----------|---------------|-----------|
| Survivor set | OR-Set | Union (never lose a detection) | Missing survivors = fatal |
| Triage level | Max-Register | Most urgent wins | Err toward caution |
| Location | LWW-Register | Latest timestamp wins | Survivors may move |
| Zone assignment | LWW-Map | Leader's assignment wins | Need authoritative coord |
| Model deltas | G-Set | Accumulate all deltas | All adaptations valuable |
### Node Discovery and Health
```rust
/// AP cluster management
pub struct ApCluster {
/// This node's identity
local_ap: ApId,
/// Raft consensus engine
raft: RaftEngine<ConsensusOp>,
/// Failure detector (phi-accrual)
failure_detector: PhiAccrualDetector,
/// Cluster membership
members: HashSet<ApId>,
}
impl ApCluster {
/// Heartbeat interval for failure detection
const HEARTBEAT_MS: u64 = 500;
/// Phi threshold for suspecting node failure
const PHI_THRESHOLD: f64 = 8.0;
/// Minimum cluster size for Raft (need majority)
const MIN_CLUSTER_SIZE: usize = 3;
}
```
### Performance Characteristics
| Operation | Latency | Notes |
|-----------|---------|-------|
| Raft heartbeat | 500 ms interval | Configurable |
| Log replication | 1-5 ms (LAN) | Depends on payload size |
| Leader election | 1-3 seconds | After leader failure detected |
| CRDT merge (partition heal) | 10-100 ms | Proportional to divergence |
| Vector clock comparison | <0.01 ms | O(n) where n = cluster size |
| Model delta replication | 50-200 ms | ~70 KB LoRA delta |
### Deployment Configurations
| Scenario | Nodes | Consensus | Partition Strategy |
|----------|-------|-----------|-------------------|
| Single room | 1-2 | None (local only) | N/A |
| Building floor | 3-5 | Raft (3-node quorum) | CRDT merge on heal |
| Disaster site | 5-20 | Raft (5-node quorum) + zones | Zone-level sub-clusters |
| Urban search | 20-100 | Hierarchical Raft | Regional leaders |
## Consequences
### Positive
- **Consistent state**: All APs agree on survivor registry via Raft
- **Partition tolerant**: CRDT merge allows operation during disconnection
- **Causal ordering**: Vector clocks provide logical time without NTP
- **Automatic failover**: Raft leader election handles AP failures
- **Model sharing**: SONA adaptations propagate across cluster
### Negative
- **Minimum 3 nodes**: Raft requires odd-numbered quorum for leader election
- **Network overhead**: Heartbeats and log replication consume bandwidth (~1-10 KB/s per node)
- **Complexity**: Distributed systems are inherently harder to debug
- **Latency for writes**: Raft requires majority acknowledgment before commit (1-5ms LAN)
- **Split-brain risk**: If cluster splits evenly (2+2), neither partition has quorum
### Disaster-Specific Considerations
| Challenge | Mitigation |
|-----------|------------|
| Intermittent connectivity | Aggressive CRDT merge on reconnect; local operation during partition |
| Power failures | Raft log persisted to local SSD; recovery on restart |
| Node destruction | Raft tolerates minority failure; data replicated across survivors |
| Drone mobility | Drone APs treated as ephemeral members; data synced on landing |
| Bandwidth constraints | Delta-only replication; compress LoRA deltas |
## References
- [Raft Consensus Algorithm](https://raft.github.io/raft.pdf)
- [CRDTs: Conflict-free Replicated Data Types](https://hal.inria.fr/inria-00609399)
- [Vector Clocks](https://en.wikipedia.org/wiki/Vector_clock)
- [Phi Accrual Failure Detector](https://www.computer.org/csdl/proceedings-article/srds/2004/22390066/12OmNyQYtlC)
- [RuVector Distributed Consensus](https://github.com/ruvnet/ruvector)
- ADR-001: WiFi-Mat Disaster Detection Architecture
- ADR-002: RuVector RVF Integration Strategy