Merge commit 'd803bfe2b1fe7f5e219e50ac20d6801a0a58ac75' as 'vendor/ruvector'

This commit is contained in:
ruv
2026-02-28 14:39:40 -05:00
7854 changed files with 3522914 additions and 0 deletions

View File

@@ -0,0 +1,333 @@
# rvf-federation
[![Crates.io](https://img.shields.io/crates/v/rvf-federation.svg)](https://crates.io/crates/rvf-federation)
[![docs.rs](https://img.shields.io/docsrs/rvf-federation)](https://docs.rs/rvf-federation)
[![License: MIT OR Apache-2.0](https://img.shields.io/badge/License-MIT%20OR%20Apache--2.0-blue.svg)](https://opensource.org/licenses/MIT)
[![Rust 1.87+](https://img.shields.io/badge/rust-1.87%2B-orange.svg)](https://www.rust-lang.org)
**Privacy-preserving federated transfer learning for the RVF format.**
```toml
rvf-federation = "0.1"
```
RuVector users independently accumulate learning patterns -- SONA weight trajectories, policy kernel configurations, domain expansion priors, HNSW tuning parameters. Today that learning is siloed. `rvf-federation` implements the inter-user federation layer defined in [ADR-057](../../../docs/adr/ADR-057-federated-rvf-transfer-learning.md): it strips PII, injects differential privacy noise, packages transferable learning as RVF segments, and merges incoming learning with formal privacy guarantees.
| | rvf-federation | Siloed learning | Manual sharing |
|---|---|---|---|
| **Privacy** | 3-stage PII stripping + calibrated DP noise | N/A -- nothing leaves the machine | Trust the sender |
| **Knowledge reuse** | New users bootstrap from community priors | Every deployment starts cold | Copy-paste config files |
| **Integrity** | Witness chain + Ed25519/ML-DSA-65 signatures | N/A | No verification |
| **Aggregation** | FedAvg, FedProx, Byzantine-tolerant averaging | N/A | Manual merge |
| **Privacy accounting** | RDP composition with formal epsilon budget | N/A | N/A |
## Quick Start
```rust
use rvf_federation::{
ExportBuilder, DiffPrivacyEngine, FederationPolicy,
TransferPriorSet, TransferPriorEntry, BetaParams,
};
// 1. Build an export from local learning
let priors = TransferPriorSet {
source_domain: "code_review".into(),
entries: vec![TransferPriorEntry {
bucket_id: "medium_algorithm".into(),
arm_id: "arm_0".into(),
params: BetaParams::new(10.0, 5.0),
observation_count: 50,
}],
cost_ema: 0.85,
};
// 2. Configure differential privacy (epsilon=1.0, delta=1e-5)
let mut dp = DiffPrivacyEngine::gaussian(1.0, 1e-5, 1.0, 1.0).unwrap();
// 3. Build: PII strip -> DP noise -> assemble manifest
let export = ExportBuilder::new("alice_pseudo".into(), "code_review".into())
.with_policy(FederationPolicy::default())
.add_priors(priors)
.add_string_field("config_path".into(), "/home/alice/project/.config".into())
.build(&mut dp)
.unwrap();
assert_eq!(export.manifest.format_version, 1);
assert!(export.redaction_log.total_redactions >= 1); // PII was stripped
assert!(export.privacy_proof.epsilon > 0.0); // DP noise was applied
```
## Key Features
| Feature | What It Does | Why It Matters |
|---|---|---|
| **PII stripping** | 3-stage pipeline: detect, redact, attest | No personal data leaves the local machine |
| **Differential privacy** | Gaussian/Laplace noise with RDP accounting | Formal mathematical privacy guarantee per export |
| **Gradient clipping** | Bound L2 norms before aggregation | Limits any single user's influence on the aggregate |
| **FedAvg / FedProx** | Federated averaging with optional proximal term | Industry-standard aggregation (McMahan et al. 2017) |
| **Byzantine tolerance** | Outlier detection by L2-norm z-score | Malicious contributions are excluded automatically |
| **Version-aware merging** | Dampened confidence for cross-version imports | Older learning still helps, with reduced weight |
| **Selective sharing** | Allowlist/denylist for segments and domains | Users control exactly what they share |
## Architecture
```
Local Engine Remote
+------------------+ +------------+ +---------+ +----------+
| TransferPriors |--->| |--->| |---->| |
| PolicyKernels | | PII Strip | | DP | | RVF | Registry
| CostCurves | | (3-stage) | | Noise | | Export |----> (GCS)
| LoRA Weights | | | | | | Builder | |
+------------------+ +------------+ +---------+ +----------+ |
v
+------------------+ +------------+ +---------+ +----------+ +--------+
| Merged Learning |<---| Version- |<---| Import |<----| Validate |<-| Import |
| (local engines) | | Aware | | Merger | | (sig + | | (pull) |
| | | Merge | | | | witness) | +--------+
+------------------+ +------------+ +---------+ +----------+
```
## Modules
| Module | Description |
|---|---|
| `types` | Four new RVF segment payload types (0x33-0x36) plus federation data structures |
| `error` | 15 error variants covering privacy, validation, aggregation, and I/O failures |
| `pii_strip` | Three-stage PII stripping pipeline with 12 built-in detection rules |
| `diff_privacy` | Gaussian/Laplace noise engines, gradient clipping, RDP privacy accountant |
| `federation` | `ExportBuilder` and `ImportMerger` implementing the ADR-057 transfer protocol |
| `aggregate` | `FederatedAggregator` with FedAvg, FedProx, and Byzantine-tolerant strategies |
| `policy` | `FederationPolicy` for selective sharing with allowlists, denylists, and rate limits |
## Segment Types
Four new RVF segment types extend the `0x30-0x32` domain expansion range:
| Code | Name | Purpose |
|---|---|---|
| `0x33` | `FederatedManifest` | Describes the export: contributor pseudonym, timestamp, included segments, privacy budget spent |
| `0x34` | `DiffPrivacyProof` | Privacy attestation: epsilon/delta, mechanism, sensitivity, clipping norm, noise scale |
| `0x35` | `RedactionLog` | PII stripping attestation: redaction counts by category, pre-redaction content hash, rules fired |
| `0x36` | `AggregateWeights` | Federated-averaged LoRA deltas with participation count, round number, confidence scores |
Readers that do not recognize these segment types skip them per the RVF forward-compatibility rule. Existing `TransferPrior (0x30)`, `PolicyKernel (0x31)`, `CostCurve (0x32)`, `Witness`, and `Crypto` segments are reused as-is.
## PII Stripping Pipeline
`PiiStripper` runs a three-stage pipeline on every string field before it leaves the local machine.
**Stage 1 -- Detection.** Twelve built-in regex rules scan for:
- Unix and Windows file paths (`/home/user/...`, `C:\Users\...`)
- IPv4 and IPv6 addresses
- Email addresses
- API keys (`sk-...`, `AKIA...`, `ghp_...`, Bearer tokens)
- Environment variable references (`$HOME`, `%USERPROFILE%`)
- Usernames (`@handle`)
Custom rules can be registered with `add_rule()`.
**Stage 2 -- Redaction.** Detected PII is replaced with deterministic pseudonyms (`<PATH_1>`, `<IP_2>`, `<REDACTED_KEY>`). The same original value always maps to the same pseudonym within a single export, preserving structural relationships without revealing content.
**Stage 3 -- Attestation.** A `RedactionLog (0x35)` segment is generated containing redaction counts by category, the SHAKE-256 hash of the pre-redaction content (proves scanning happened without revealing it), and the rules that fired.
```rust
use rvf_federation::PiiStripper;
let mut stripper = PiiStripper::new();
let fields = vec![
("config", "/home/alice/project/.env"),
("server", "connecting to 10.0.0.1:8080"),
("note", "no pii here"),
];
let (redacted, log) = stripper.strip_fields(&fields);
assert_eq!(log.fields_scanned, 3);
assert!(log.total_redactions >= 2);
assert!(redacted[2].1 == "no pii here"); // clean fields pass through
```
## Differential Privacy
### Noise Mechanisms
| Mechanism | Privacy Model | Noise Distribution | Use Case |
|---|---|---|---|
| Gaussian | (epsilon, delta)-DP | N(0, sigma^2) where sigma = S * sqrt(2 ln(1.25/delta)) / epsilon | Default; tighter for large parameter counts |
| Laplace | Pure epsilon-DP | Laplace(0, S/epsilon) | Stronger guarantee; no delta term |
### Gradient Clipping
Before noise injection, all parameter vectors are clipped to a configurable L2 norm bound. This limits the sensitivity of the aggregation to any single user's contribution.
### Privacy Accountant
`PrivacyAccountant` tracks cumulative privacy loss using Renyi Differential Privacy (RDP) composition across 16 alpha orders. RDP composition is tighter than naive (epsilon, delta)-DP composition, meaning more exports fit within the same budget.
```rust
use rvf_federation::PrivacyAccountant;
let mut accountant = PrivacyAccountant::new(10.0, 1e-5); // budget: eps=10, delta=1e-5
accountant.record_gaussian(1.0, 1.0, 1e-5, 100);
assert!(accountant.remaining_budget() > 0.0);
assert!(!accountant.is_exhausted());
```
## Federation Strategies
| Strategy | Algorithm | Weighting | When to Use |
|---|---|---|---|
| `FedAvg` | Federated Averaging (McMahan et al.) | Trajectory count | Default; most scenarios |
| `FedProx` | Proximal regularization | Trajectory count + mu penalty | Heterogeneous data distributions |
| `WeightedAverage` | Simple weighted mean | Quality/reputation score | When contributor reputation varies widely |
| Byzantine detection | L2-norm z-score filtering | Outliers > 2 std removed | Always runs before aggregation |
```rust
use rvf_federation::{FederatedAggregator, AggregationStrategy};
use rvf_federation::aggregate::Contribution;
let mut agg = FederatedAggregator::new("code_review".into(), AggregationStrategy::FedAvg)
.with_min_contributions(2)
.with_byzantine_threshold(2.0);
agg.add_contribution(Contribution {
contributor: "alice".into(),
weights: vec![1.0, 2.0, 3.0],
quality_weight: 0.9,
trajectory_count: 100,
});
agg.add_contribution(Contribution {
contributor: "bob".into(),
weights: vec![1.2, 1.8, 3.1],
quality_weight: 0.85,
trajectory_count: 80,
});
let result = agg.aggregate().unwrap();
assert_eq!(result.participation_count, 2);
assert_eq!(result.lora_deltas.len(), 3);
```
## Performance Benchmarks
Measured on an AMD64 Linux system with Criterion.
| Benchmark | Time |
|---|---|
| PII detect (single string) | 756 ns |
| PII strip (10 fields) | 44 us |
| PII strip (100 fields) | 303 us |
| Gaussian noise (100 params) | 4.7 us |
| Gaussian noise (10k params) | 334 us |
| Gradient clipping (1k params) | 487 ns |
| Privacy accountant (100 rounds) | 1.0 us |
| FedAvg (10 contrib, 100 dim) | 3.9 us |
| FedAvg (100 contrib, 1k dim) | 365 us |
| Byzantine detection (50 contrib) | 12 us |
| Full export pipeline | 1.2 ms |
| Merge 100 priors | 28 us |
## Feature Flags
| Flag | Default | What It Enables |
|---|---|---|
| `std` | Yes | Standard library support (required) |
| `serde` | No | Derive `Serialize`/`Deserialize` on all public types |
```toml
[dependencies]
rvf-federation = { version = "0.1", features = ["serde"] }
```
## API Overview
### Core Types
| Type | Description |
|---|---|
| `FederatedManifest` | Export metadata: contributor pseudonym, domain, timestamp, privacy budget spent |
| `DiffPrivacyProof` | Privacy attestation: epsilon, delta, mechanism, sensitivity, noise scale |
| `RedactionLog` | PII stripping attestation: entries by category, pre-redaction hash, field count |
| `AggregateWeights` | Federated-averaged LoRA deltas with round number, participation count, confidences |
| `BetaParams` | Beta distribution parameters for Thompson Sampling priors (merge, dampen, mean) |
### Transfer Types
| Type | Description |
|---|---|
| `TransferPriorEntry` | Single context bucket prior: bucket ID, arm ID, Beta params, observation count |
| `TransferPriorSet` | Collection of priors from a trained domain with cost EMA |
| `PolicyKernelSnapshot` | Snapshot of tunable policy knob values with fitness score |
| `CostCurveSnapshot` | Ordered (step, cost) points with acceleration factor |
### Aggregation Types
| Type | Description |
|---|---|
| `FederatedAggregator` | Aggregation server: collects contributions, detects outliers, produces `AggregateWeights` |
| `AggregationStrategy` | `FedAvg`, `FedProx { mu }`, or `WeightedAverage` |
| `Contribution` | Single participant's weight vector with quality and trajectory metadata |
### Protocol Types
| Type | Description |
|---|---|
| `ExportBuilder` | Builder pattern: add priors/kernels/weights, PII-strip, DP-noise, produce `FederatedExport` |
| `ImportMerger` | Validate imports, merge priors with version-aware dampening, merge weights |
| `FederatedExport` | Completed export: manifest + redaction log + privacy proof + learning data |
| `FederationPolicy` | Selective sharing: allowlists, denylists, quality gate, rate limit, privacy budget |
| `PiiStripper` | Three-stage PII pipeline: detect, redact, attest |
| `DiffPrivacyEngine` | Noise injection with Gaussian or Laplace mechanism and gradient clipping |
| `PrivacyAccountant` | RDP-based cumulative privacy loss tracker |
### Error Types
`FederationError` covers 15 variants:
| Variant | Trigger |
|---|---|
| `PrivacyBudgetExhausted` | Cumulative epsilon exceeds limit |
| `InvalidEpsilon` | Epsilon <= 0 |
| `InvalidDelta` | Delta outside (0, 1) |
| `SegmentValidation` | Malformed segment data |
| `VersionMismatch` | Incompatible format version |
| `SignatureVerification` | Ed25519/ML-DSA-65 signature check failed |
| `WitnessChainBroken` | Witness chain has a gap or tampered entry |
| `InsufficientObservations` | Prior has too few observations for export |
| `QualityBelowThreshold` | Trajectory quality below policy minimum |
| `RateLimited` | Export rate limit exceeded |
| `PiiLeakDetected` | PII found after stripping (defense-in-depth) |
| `ByzantineOutlier` | Contribution flagged as adversarial |
| `InsufficientContributions` | Not enough participants for aggregation round |
| `Serialization` | Encoding/decoding failure |
| `Io` | I/O operation failure |
## Related Crates
| Crate | Relationship |
|---|---|
| [`rvf-types`](../rvf-types) | Core RVF segment definitions; `rvf-federation` defines its own payload types to avoid circular deps |
| [`ruvector-domain-expansion`](../../ruvector-domain-expansion) | Source of `TransferPrior`, `PolicyKernel`, `CostCurve`; federation exports these as RVF segments |
| [`sona`](../../sona) | SONA learning engine; `FederatedCoordinator` handles intra-deployment aggregation, `rvf-federation` handles inter-user |
| [`rvf-crypto`](../rvf-crypto) | Ed25519 signatures and SHAKE-256 hashing used for witness chains and segment integrity |
## Testing
54 tests across all modules:
```bash
cargo test -p rvf-federation
```
Benchmarks:
```bash
cargo bench -p rvf-federation
```
## License
MIT OR Apache-2.0
---
Part of [RuVector](https://github.com/ruvnet/ruvector) -- the self-learning vector database.