feat: Add 12 ADRs for RuVector RVF integration and proof-of-reality
Comprehensive architecture decision records for integrating ruvnet/ruvector
into wifi-densepose, covering:
- ADR-002: Master integration strategy (phased rollout, new crate design)
- ADR-003: RVF cognitive containers for CSI data persistence
- ADR-004: HNSW vector search replacing fixed-threshold detection
- ADR-005: SONA self-learning with LoRA + EWC++ for online adaptation
- ADR-006: GNN-enhanced pattern recognition with temporal modeling
- ADR-007: Post-quantum cryptography (ML-DSA-65 hybrid signatures)
- ADR-008: Raft consensus for multi-AP distributed coordination
- ADR-009: RVF WASM runtime for edge/browser/IoT deployment
- ADR-010: Witness chains for tamper-evident audit trails
- ADR-011: Mock elimination and proof-of-reality (fixes np.random.rand
placeholders, ships CSI capture + SHA-256 verified pipeline)
- ADR-012: ESP32 CSI sensor mesh ($54 starter kit specification)
- ADR-013: Feature-level sensing on commodity gear (zero-cost RSSI path)
ADR-011 directly addresses the credibility gap by cataloging every
mock/placeholder in the Python codebase and specifying concrete fixes.
https://claude.ai/code/session_01Ki7pvEZtJDvqJkmyn6B714
This commit is contained in:
284
docs/adr/ADR-008-distributed-consensus-multi-ap.md
Normal file
284
docs/adr/ADR-008-distributed-consensus-multi-ap.md
Normal file
@@ -0,0 +1,284 @@
|
||||
# ADR-008: Distributed Consensus for Multi-AP Coordination
|
||||
|
||||
## Status
|
||||
Proposed
|
||||
|
||||
## Date
|
||||
2026-02-28
|
||||
|
||||
## Context
|
||||
|
||||
### Multi-AP Sensing Architecture
|
||||
|
||||
WiFi-DensePose achieves higher accuracy and coverage with multiple access points (APs) observing the same space from different angles. The disaster detection module (wifi-densepose-mat, ADR-001) explicitly requires distributed deployment:
|
||||
|
||||
- **Portable**: Single TX/RX units deployed around a collapse site
|
||||
- **Distributed**: Multiple APs covering a large disaster zone
|
||||
- **Drone-mounted**: UAVs scanning from above with coordinated flight paths
|
||||
|
||||
Each AP independently captures CSI data, extracts features, and runs local inference. But the distributed system needs coordination:
|
||||
|
||||
1. **Consistent survivor registry**: All nodes must agree on the set of detected survivors, their locations, and triage classifications. Conflicting records cause rescue teams to waste time.
|
||||
|
||||
2. **Coordinated scanning**: Avoid redundant scans of the same zone. Dynamically reassign APs as zones are cleared.
|
||||
|
||||
3. **Model synchronization**: When SONA adapts a model on one node (ADR-005), other nodes should benefit from the adaptation without re-learning.
|
||||
|
||||
4. **Clock synchronization**: CSI timestamps must be aligned across nodes for multi-view pose fusion (the GNN multi-person disentanglement in ADR-006 requires temporal alignment).
|
||||
|
||||
5. **Partition tolerance**: In disaster scenarios, network connectivity is unreliable. The system must function during partitions and reconcile when connectivity restores.
|
||||
|
||||
### Current State
|
||||
|
||||
No distributed coordination exists. Each node operates independently. The Rust workspace has no consensus crate.
|
||||
|
||||
### RuVector's Distributed Capabilities
|
||||
|
||||
RuVector provides:
|
||||
- **Raft consensus**: Leader election and replicated log for strong consistency
|
||||
- **Vector clocks**: Logical timestamps for causal ordering without synchronized clocks
|
||||
- **Multi-master replication**: Concurrent writes with conflict resolution
|
||||
- **Delta consensus**: Tracks behavioral changes across nodes for anomaly detection
|
||||
- **Auto-sharding**: Distributes data based on access patterns
|
||||
|
||||
## Decision
|
||||
|
||||
We will integrate RuVector's Raft consensus implementation as the coordination backbone for multi-AP WiFi-DensePose deployments, with vector clocks for causal ordering and CRDT-based conflict resolution for partition-tolerant operation.
|
||||
|
||||
### Consensus Architecture
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────────────────────────┐
|
||||
│ Multi-AP Coordination Architecture │
|
||||
├─────────────────────────────────────────────────────────────────────┤
|
||||
│ │
|
||||
│ Normal Operation (Connected): │
|
||||
│ │
|
||||
│ ┌─────────┐ Raft ┌─────────┐ Raft ┌─────────┐ │
|
||||
│ │ AP-1 │◀────────────▶│ AP-2 │◀────────────▶│ AP-3 │ │
|
||||
│ │ (Leader)│ Replicated │(Follower│ Replicated │(Follower│ │
|
||||
│ │ │ Log │ )│ Log │ )│ │
|
||||
│ └────┬────┘ └────┬────┘ └────┬────┘ │
|
||||
│ │ │ │ │
|
||||
│ ▼ ▼ ▼ │
|
||||
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ │
|
||||
│ │ Local │ │ Local │ │ Local │ │
|
||||
│ │ RVF │ │ RVF │ │ RVF │ │
|
||||
│ │Container│ │Container│ │Container│ │
|
||||
│ └─────────┘ └─────────┘ └─────────┘ │
|
||||
│ │
|
||||
│ Partitioned Operation (Disconnected): │
|
||||
│ │
|
||||
│ ┌─────────┐ ┌──────────────────────┐ │
|
||||
│ │ AP-1 │ ← operates independently → │ AP-2 AP-3 │ │
|
||||
│ │ │ │ (form sub-cluster) │ │
|
||||
│ │ Local │ │ Raft between 2+3 │ │
|
||||
│ │ writes │ │ │ │
|
||||
│ └─────────┘ └──────────────────────┘ │
|
||||
│ │ │ │
|
||||
│ └──────── Reconnect: CRDT merge ─────────────┘ │
|
||||
└─────────────────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
### Replicated State Machine
|
||||
|
||||
The Raft log replicates these operations across all nodes:
|
||||
|
||||
```rust
|
||||
/// Operations replicated via Raft consensus
|
||||
#[derive(Serialize, Deserialize, Clone)]
|
||||
pub enum ConsensusOp {
|
||||
/// New survivor detected
|
||||
SurvivorDetected {
|
||||
survivor_id: Uuid,
|
||||
location: GeoCoord,
|
||||
triage: TriageLevel,
|
||||
detecting_ap: ApId,
|
||||
confidence: f64,
|
||||
timestamp: VectorClock,
|
||||
},
|
||||
|
||||
/// Survivor status updated (e.g., triage reclassification)
|
||||
SurvivorUpdated {
|
||||
survivor_id: Uuid,
|
||||
new_triage: TriageLevel,
|
||||
updating_ap: ApId,
|
||||
evidence: DetectionEvidence,
|
||||
},
|
||||
|
||||
/// Zone assignment changed
|
||||
ZoneAssignment {
|
||||
zone_id: ZoneId,
|
||||
assigned_aps: Vec<ApId>,
|
||||
priority: ScanPriority,
|
||||
},
|
||||
|
||||
/// Model adaptation delta shared
|
||||
ModelDelta {
|
||||
source_ap: ApId,
|
||||
lora_delta: Vec<u8>, // Serialized LoRA matrices
|
||||
environment_hash: [u8; 32],
|
||||
performance_metrics: AdaptationMetrics,
|
||||
},
|
||||
|
||||
/// AP joined or left the cluster
|
||||
MembershipChange {
|
||||
ap_id: ApId,
|
||||
action: MembershipAction, // Join | Leave | Suspect
|
||||
},
|
||||
}
|
||||
```
|
||||
|
||||
### Vector Clocks for Causal Ordering
|
||||
|
||||
Since APs may have unsynchronized physical clocks, vector clocks provide causal ordering:
|
||||
|
||||
```rust
|
||||
/// Vector clock for causal ordering across APs
|
||||
#[derive(Clone, Serialize, Deserialize)]
|
||||
pub struct VectorClock {
|
||||
/// Map from AP ID to logical timestamp
|
||||
clocks: HashMap<ApId, u64>,
|
||||
}
|
||||
|
||||
impl VectorClock {
|
||||
/// Increment this AP's clock
|
||||
pub fn tick(&mut self, ap_id: &ApId) {
|
||||
*self.clocks.entry(ap_id.clone()).or_insert(0) += 1;
|
||||
}
|
||||
|
||||
/// Merge with another clock (take max of each component)
|
||||
pub fn merge(&mut self, other: &VectorClock) {
|
||||
for (ap_id, &ts) in &other.clocks {
|
||||
let entry = self.clocks.entry(ap_id.clone()).or_insert(0);
|
||||
*entry = (*entry).max(ts);
|
||||
}
|
||||
}
|
||||
|
||||
/// Check if self happened-before other
|
||||
pub fn happened_before(&self, other: &VectorClock) -> bool {
|
||||
self.clocks.iter().all(|(k, &v)| {
|
||||
other.clocks.get(k).map_or(false, |&ov| v <= ov)
|
||||
}) && self.clocks != other.clocks
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### CRDT-Based Conflict Resolution
|
||||
|
||||
During network partitions, concurrent updates may conflict. We use CRDTs (Conflict-free Replicated Data Types) for automatic resolution:
|
||||
|
||||
```rust
|
||||
/// Survivor registry using Last-Writer-Wins Register CRDT
|
||||
pub struct SurvivorRegistry {
|
||||
/// LWW-Element-Set: each survivor has a timestamp-tagged state
|
||||
survivors: HashMap<Uuid, LwwRegister<SurvivorState>>,
|
||||
}
|
||||
|
||||
/// Triage uses Max-wins semantics:
|
||||
/// If partition A says P1 (Red/Immediate) and partition B says P2 (Yellow/Delayed),
|
||||
/// after merge the survivor is classified P1 (more urgent wins)
|
||||
/// Rationale: false negative (missing critical) is worse than false positive
|
||||
impl CrdtMerge for TriageLevel {
|
||||
fn merge(a: Self, b: Self) -> Self {
|
||||
// Lower numeric priority = more urgent
|
||||
if a.urgency() >= b.urgency() { a } else { b }
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**CRDT merge strategies by data type**:
|
||||
|
||||
| Data Type | CRDT Type | Merge Strategy | Rationale |
|
||||
|-----------|-----------|---------------|-----------|
|
||||
| Survivor set | OR-Set | Union (never lose a detection) | Missing survivors = fatal |
|
||||
| Triage level | Max-Register | Most urgent wins | Err toward caution |
|
||||
| Location | LWW-Register | Latest timestamp wins | Survivors may move |
|
||||
| Zone assignment | LWW-Map | Leader's assignment wins | Need authoritative coord |
|
||||
| Model deltas | G-Set | Accumulate all deltas | All adaptations valuable |
|
||||
|
||||
### Node Discovery and Health
|
||||
|
||||
```rust
|
||||
/// AP cluster management
|
||||
pub struct ApCluster {
|
||||
/// This node's identity
|
||||
local_ap: ApId,
|
||||
|
||||
/// Raft consensus engine
|
||||
raft: RaftEngine<ConsensusOp>,
|
||||
|
||||
/// Failure detector (phi-accrual)
|
||||
failure_detector: PhiAccrualDetector,
|
||||
|
||||
/// Cluster membership
|
||||
members: HashSet<ApId>,
|
||||
}
|
||||
|
||||
impl ApCluster {
|
||||
/// Heartbeat interval for failure detection
|
||||
const HEARTBEAT_MS: u64 = 500;
|
||||
|
||||
/// Phi threshold for suspecting node failure
|
||||
const PHI_THRESHOLD: f64 = 8.0;
|
||||
|
||||
/// Minimum cluster size for Raft (need majority)
|
||||
const MIN_CLUSTER_SIZE: usize = 3;
|
||||
}
|
||||
```
|
||||
|
||||
### Performance Characteristics
|
||||
|
||||
| Operation | Latency | Notes |
|
||||
|-----------|---------|-------|
|
||||
| Raft heartbeat | 500 ms interval | Configurable |
|
||||
| Log replication | 1-5 ms (LAN) | Depends on payload size |
|
||||
| Leader election | 1-3 seconds | After leader failure detected |
|
||||
| CRDT merge (partition heal) | 10-100 ms | Proportional to divergence |
|
||||
| Vector clock comparison | <0.01 ms | O(n) where n = cluster size |
|
||||
| Model delta replication | 50-200 ms | ~70 KB LoRA delta |
|
||||
|
||||
### Deployment Configurations
|
||||
|
||||
| Scenario | Nodes | Consensus | Partition Strategy |
|
||||
|----------|-------|-----------|-------------------|
|
||||
| Single room | 1-2 | None (local only) | N/A |
|
||||
| Building floor | 3-5 | Raft (3-node quorum) | CRDT merge on heal |
|
||||
| Disaster site | 5-20 | Raft (5-node quorum) + zones | Zone-level sub-clusters |
|
||||
| Urban search | 20-100 | Hierarchical Raft | Regional leaders |
|
||||
|
||||
## Consequences
|
||||
|
||||
### Positive
|
||||
- **Consistent state**: All APs agree on survivor registry via Raft
|
||||
- **Partition tolerant**: CRDT merge allows operation during disconnection
|
||||
- **Causal ordering**: Vector clocks provide logical time without NTP
|
||||
- **Automatic failover**: Raft leader election handles AP failures
|
||||
- **Model sharing**: SONA adaptations propagate across cluster
|
||||
|
||||
### Negative
|
||||
- **Minimum 3 nodes**: Raft requires odd-numbered quorum for leader election
|
||||
- **Network overhead**: Heartbeats and log replication consume bandwidth (~1-10 KB/s per node)
|
||||
- **Complexity**: Distributed systems are inherently harder to debug
|
||||
- **Latency for writes**: Raft requires majority acknowledgment before commit (1-5ms LAN)
|
||||
- **Split-brain risk**: If cluster splits evenly (2+2), neither partition has quorum
|
||||
|
||||
### Disaster-Specific Considerations
|
||||
|
||||
| Challenge | Mitigation |
|
||||
|-----------|------------|
|
||||
| Intermittent connectivity | Aggressive CRDT merge on reconnect; local operation during partition |
|
||||
| Power failures | Raft log persisted to local SSD; recovery on restart |
|
||||
| Node destruction | Raft tolerates minority failure; data replicated across survivors |
|
||||
| Drone mobility | Drone APs treated as ephemeral members; data synced on landing |
|
||||
| Bandwidth constraints | Delta-only replication; compress LoRA deltas |
|
||||
|
||||
## References
|
||||
|
||||
- [Raft Consensus Algorithm](https://raft.github.io/raft.pdf)
|
||||
- [CRDTs: Conflict-free Replicated Data Types](https://hal.inria.fr/inria-00609399)
|
||||
- [Vector Clocks](https://en.wikipedia.org/wiki/Vector_clock)
|
||||
- [Phi Accrual Failure Detector](https://www.computer.org/csdl/proceedings-article/srds/2004/22390066/12OmNyQYtlC)
|
||||
- [RuVector Distributed Consensus](https://github.com/ruvnet/ruvector)
|
||||
- ADR-001: WiFi-Mat Disaster Detection Architecture
|
||||
- ADR-002: RuVector RVF Integration Strategy
|
||||
Reference in New Issue
Block a user