Merge commit 'd803bfe2b1fe7f5e219e50ac20d6801a0a58ac75' as 'vendor/ruvector'

2026-02-28 14:39:40 -05:00
parent 7885bf6278 d803bfe2b1
commit cd5943df23
7854 changed files with 3522914 additions and 0 deletions
--- a/vendor/ruvector/crates/ruvector-sparse-inference/README.md
+++ b/vendor/ruvector/crates/ruvector-sparse-inference/README.md
@@ -0,0 +1,339 @@
+# ruvector-sparse-inference
+
+PowerInfer-style Activation Locality Inference Engine for RuVector.
+
+A high-performance sparse inference engine that exploits neural network activation patterns to achieve 2×–10× speedups with <1% accuracy loss.
+
+## Features
+
+### Core Capabilities
+
+- **Activation Locality**: Exploits power-law distribution where ~10% of neurons handle ~90% of activations
+- **Low-Rank Prediction**: Fast P·Q matrix factorization predicts active neurons in O(r·d) time
+- **Sparse FFN**: Computes only active neurons, skipping cold weights entirely
+- **SIMD Optimization**: AVX2/FMA (GELU, SiLU, axpy), SSE4.1, NEON, and WASM SIMD backends
+- **GGUF Support**: Full compatibility with quantized Llama models (Q4_0 through Q6_K)
+- **Hot/Cold Caching**: LRU/LFU strategies for intelligent neuron weight management
+
+### Precision Lanes (3/5/7-bit)
+
+Layered quantization that turns activation selectivity into anatomical control:
+
+| Lane | Bits | Range | Use Case |
+|------|------|-------|----------|
+| **Bit3** | 3 | -4..3 | Reflex signals, gating, anomaly triggers |
+| **Bit5** | 5 | -16..15 | Streaming embeddings, drift detection |
+| **Bit7** | 7 | -64..63 | Reasoning, synthesis, micro-LoRA |
+| **Float** | 32 | Full | Training, offline calibration |
+
+**Graduation Rules**: Signals move UP lanes on novelty/drift, DOWN on stability/stall.
+
+### π Integration
+
+π (pi) provides structural constants for low-precision systems:
+
+```
+π breaks symmetry.
+```
+
+| Module | Purpose |
+|--------|---------|
+| **Calibration** | π-derived constants avoid power-of-2 resonance |
+| **Drift Detection** | Quantization honesty signals via π transforms |
+| **Angular Embeddings** | Hyperspherical projections with π phase encoding |
+| **Chaos Seeding** | Deterministic pseudo-randomness from π digits |
+
+## Performance (v0.1.31)
+
+**6× speedup** over previous version through W2 transpose optimization and SIMD-accelerated activations.
+
+| Sparsity Level | Latency | vs Dense | Improvement |
+|----------------|---------|----------|-------------|
+| 10% active | 130µs | 52× faster | **83% reduction** |
+| 30% active | 383µs | 18× faster | **83% reduction** |
+| 50% active | 651µs | 10× faster | **83% reduction** |
+| 70% active | 912µs | 7× faster | **83% reduction** |
+
+### Key Optimizations (v0.1.31)
+
+- **W2 Transpose Storage**: Column access becomes contiguous row access
+- **SIMD GELU/SiLU**: AVX2 polynomial approximations for activations
+- **Cached Feature Detection**: OnceLock eliminates runtime CPUID calls
+- **SIMD axpy**: Vectorized accumulation in sparse second layer
+
+### Target Performance
+
+| Model | Target Latency | Speedup | Memory Reduction |
+|-------|----------------|---------|------------------|
+| LFM2 350M | ~5-10ms/sentence | 2.5× | 40% |
+| Sentence-transformers | ~2-5ms/sentence | 2× | 30% |
+| Llama 7B | 50-100ms/token | 5-10× | 50% |
+
+## Quick Start
+
+```rust
+use ruvector_sparse_inference::{
+    SparseInferenceEngine, SparsityConfig, PiContext, PrecisionLane
+};
+
+// Create sparse inference engine
+let engine = SparseInferenceEngine::new_sparse(512, 2048, 0.1)?;
+
+// Run inference
+let input = vec![0.1f32; 512];
+let output = engine.infer(&input)?;
+
+// Use π context for calibration
+let pi_ctx = PiContext::new(PrecisionLane::Bit5);
+let calibrated = pi_ctx.calibrate(1.0);
+
+// Check quantization honesty
+let honesty = pi_ctx.check_honesty(&original, &quantized);
+if !honesty.is_honest {
+    // Escalate to higher precision lane
+}
+```
+
+## Architecture
+
+```
+┌─────────────────────────────────────────────────────────────┐
+│                    Input Embedding                          │
+└─────────────────────────────────────────────────────────────┘
+                              │
+                              ▼
+┌─────────────────────────────────────────────────────────────┐
+│              Low-Rank Predictor (P·Q)                       │
+│  ┌───────────┐    ┌───────────┐    ┌──────────────────┐    │
+│  │ Input x   │───▶│  P matrix │───▶│  Q matrix        │    │
+│  │ [d×1]     │    │  [d×r]    │    │  [r×hidden]      │    │
+│  └───────────┘    └───────────┘    └──────────────────┘    │
+│                                             │               │
+│                                             ▼               │
+│                              ┌──────────────────────────┐  │
+│                              │ Threshold/Top-K Selection │  │
+│                              │ Active Neuron Indices    │  │
+│                              └──────────────────────────┘  │
+└─────────────────────────────────────────────────────────────┘
+                              │
+                              ▼
+┌─────────────────────────────────────────────────────────────┐
+│                    Sparse FFN Forward                       │
+│  ┌─────────────────┐                                        │
+│  │ Hot Weights     │◀── Always in memory                    │
+│  │ (20% neurons)   │                                        │
+│  └─────────────────┘                                        │
+│           │                                                 │
+│           ▼                                                 │
+│  ┌─────────────────┐    ┌──────────────────────────────┐   │
+│  │ W1[active] @ x  │───▶│ Activation (ReLU/GELU/SiLU)  │   │
+│  └─────────────────┘    └──────────────────────────────┘   │
+│           │                                                 │
+│           ▼                                                 │
+│  ┌─────────────────┐                                        │
+│  │ W2 @ activated  │───▶ Output                             │
+│  └─────────────────┘                                        │
+└─────────────────────────────────────────────────────────────┘
+```
+
+## π-Based Systems
+
+### Why π Matters
+
+In 3/5/7-bit math, you deliberately throw away bits. π lets you check whether the system is still behaving honestly.
+
+```rust
+use ruvector_sparse_inference::pi::*;
+
+// π as calibration constant
+let calibration = PiCalibration::for_lane(PrecisionLane::Bit5);
+let normalized = calibration.normalize(value);
+
+// π as drift detector
+let mut detector = DriftDetector::new(PrecisionLane::Bit5);
+let honesty = detector.check(&original, &quantized);
+if honesty.should_escalate {
+    // Precision too low or hardware misbehaving
+}
+
+// π for angular embeddings
+let angular = AngularEmbedding::new(PrecisionLane::Bit7);
+let projected = angular.project(&vector);
+let distance = angular.angular_distance(&a, &b);
+
+// π for deterministic chaos
+let chaos = PiChaos::new();
+let jitter = chaos.jitter(index);  // Same input = same output, always
+let schedule = chaos.schedule_order(n_agents, round);
+```
+
+### Key Constants
+
+```rust
+// π-based scale factors (avoid power-of-2 resonance)
+pub const PI_SCALE_3BIT: f32 = π / 4.0;   // ~0.785
+pub const PI_SCALE_5BIT: f32 = π / 16.0;  // ~0.196
+pub const PI_SCALE_7BIT: f32 = π / 64.0;  // ~0.049
+```
+
+## Precision Lane Graduation
+
+```rust
+use ruvector_sparse_inference::precision::*;
+
+// Configure graduation policy
+let config = GraduationConfig {
+    novelty_threshold: 0.3,
+    drift_persistence_threshold: 5,
+    confidence_threshold: 0.8,
+    escalation_budget: 0.2,
+};
+
+let mut policy = GraduationPolicy::new(PrecisionLane::Bit5, config);
+
+// Update metrics during inference
+policy.update_metrics(GraduationMetrics {
+    novelty: 0.4,      // High novelty detected
+    drift_steps: 3,
+    confidence: 0.9,
+    cost_usage: 0.1,
+    ..Default::default()
+});
+
+// Check graduation decision
+match policy.decide() {
+    GraduationDecision::Stay => { /* Continue at Bit5 */ }
+    GraduationDecision::Escalate(PrecisionLane::Bit7) => { /* Upgrade */ }
+    GraduationDecision::Demote(PrecisionLane::Bit3) => { /* Downgrade */ }
+}
+```
+
+## Configuration Options
+
+### Sparsity Selection
+
+```rust
+// Top-K selection
+SparsityConfig::with_top_k(100);
+
+// Threshold-based selection
+SparsityConfig::with_threshold(0.01);
+
+// Target sparsity ratio
+SparsityConfig::with_target_sparsity(0.95); // 95% sparse
+```
+
+### Activation Functions
+
+- `Relu`: max(0, x)
+- `Gelu`: Gaussian Error Linear Unit
+- `Silu`/`Swish`: x * sigmoid(x)
+- `Identity`: No activation
+
+### Quantization
+
+```rust
+use ruvector_sparse_inference::memory::QuantizedWeights;
+
+// Int8 quantization
+let weights = QuantizedWeights::quantize_int8(&original);
+let dequantized = weights.dequantize_row(0);
+
+// Int4 quantization (GGUF-style)
+let weights = QuantizedWeights::quantize_int4(&original, 32);
+```
+
+## WASM Support
+
+```rust
+// In ruvector-sparse-inference-wasm
+use wasm_bindgen::prelude::*;
+
+#[wasm_bindgen]
+pub fn create_sparse_engine(
+    input_dim: usize,
+    hidden_dim: usize,
+    sparsity: f32,
+) -> Result<SparseEngineWasm, JsValue>;
+
+#[wasm_bindgen]
+pub fn infer(
+    engine: &SparseEngineWasm,
+    input: &[f32],
+) -> Result<Vec<f32>, JsValue>;
+```
+
+## Integration
+
+### With RuVector (EmbeddingProvider)
+
+```rust
+use ruvector_sparse_inference::integration::SparseEmbeddingProvider;
+
+let provider = SparseEmbeddingProvider::new(config)?;
+let embedding = provider.embed("Hello world")?;
+```
+
+### With RuvLLM (InferenceBackend)
+
+```rust
+use ruvector_sparse_inference::integration::SparseInferenceBackend;
+
+let backend = SparseInferenceBackend::new(model_path)?;
+let output = backend.generate(tokens, &config)?;
+```
+
+## Benchmarks
+
+Run benchmarks:
+
+```bash
+cargo bench -p ruvector-sparse-inference
+```
+
+SIMD kernel benchmarks:
+```bash
+cargo bench -p ruvector-sparse-inference --bench simd_kernels
+```
+
+## Testing
+
+```bash
+# Unit tests
+cargo test -p ruvector-sparse-inference
+
+# Integration tests
+cargo test -p ruvector-sparse-inference --test '*'
+```
+
+## Hardware Targets
+
+| Platform | SIMD Backend | Precision Lanes |
+|----------|--------------|-----------------|
+| x86_64 (AVX2) | 256-bit vectors | All |
+| x86_64 (SSE4.1) | 128-bit vectors | All |
+| ARM (NEON) | 128-bit vectors | All |
+| WASM | 128-bit SIMD | Bit5, Bit7 |
+| ESP32 | Scalar | Bit3 only |
+
+## The Deeper Insight
+
+> π is not about geometry here. It is about injecting infinite structure into finite machines without breaking determinism.
+
+Low-bit quantization simplifies the math. π reintroduces richness without cost.
+
+- Quantization makes systems stable
+- π makes them expressive
+- Together: the math stays boring, the behavior stays interesting, the proofs stay simple
+
+## Features
+
+- `default = ["simd"]`
+- `simd`: Enable SIMD optimizations
+- `parallel`: Enable parallel computation with rayon
+- `quantization`: Enable quantization support
+- `npu`: Enable ARM NPU support (experimental)
+
+## License
+
+MIT OR Apache-2.0