Squashed 'vendor/ruvector/' content from commit b64c2172

git-subtree-dir: vendor/ruvector git-subtree-split: b64c21726f2bb37286d9ee36a7869fef60cc6900
2026-02-28 14:39:40 -05:00
commit d803bfe2b1
7854 changed files with 3522914 additions and 0 deletions
--- a/crates/ruvector-attention/README.md
+++ b/crates/ruvector-attention/README.md
@@ -0,0 +1,861 @@
+# ruvector-attention
+
+[![Crates.io](https://img.shields.io/crates/v/ruvector-attention.svg)](https://crates.io/crates/ruvector-attention)
+[![Documentation](https://docs.rs/ruvector-attention/badge.svg)](https://docs.rs/ruvector-attention)
+[![License](https://img.shields.io/badge/license-MIT%2FApache--2.0-blue.svg)](LICENSE)
+[![Tests](https://img.shields.io/badge/tests-142%20passing-brightgreen.svg)]()
+
+**46 attention mechanisms grounded in 7 mathematical theories -- from Flash Attention to optimal transport -- in one crate.**
+
+```bash
+cargo add ruvector-attention
+```
+
+Attention is the core operation in transformers, vector search, and graph neural networks, but most libraries give you one or two flavors and call it done. `ruvector-attention` ships 46 mechanisms spanning standard dot-product, sparse (Flash, linear, local-global), geometric (hyperbolic, mixed-curvature), graph (GAT, RoPE), and mixture-of-experts -- all SIMD-accelerated with quantization support. Pick the right attention for your data shape instead of forcing everything through softmax(QK^T/sqrt(d))V.
+
+| | ruvector-attention | PyTorch `nn.MultiheadAttention` | FlashAttention (standalone) | xFormers |
+|---|---|---|---|---|
+| **Mechanism count** | 46 | 1 (scaled dot-product) | 1 (Flash) | ~5 |
+| **Geometric attention** | Hyperbolic, spherical, mixed-curvature | No | No | No |
+| **Graph attention** | Edge-featured GAT, RoPE for graphs | No | No | Limited |
+| **Optimal transport** | Sliced Wasserstein, centroid OT | No | No | No |
+| **Topology-gated** | Coherence-based mode switching | No | No | No |
+| **Quantization** | Per-component (8-bit E, 5-bit H/S) | Via separate tools | No | Limited |
+| **Language** | Rust (with WASM target) | Python/C++ | CUDA only | Python/CUDA |
+| **SIMD acceleration** | Built in (4-way unrolled) | Via backend | CUDA only | Via backend |
+
+| Feature | What It Does | Why It Matters |
+|---------|-------------|----------------|
+| **Flash Attention** | O(n) memory tiled computation | Process long sequences without running out of memory |
+| **Mixed Curvature Fusion** | Combines Euclidean, hyperbolic, and spherical spaces in one pass | Model hierarchies, clusters, and flat data simultaneously |
+| **Optimal Transport Attention** | Uses Wasserstein distance instead of dot-product similarity | Better distribution matching for retrieval and generation |
+| **Topology-Gated Switching** | Automatically picks attention mode based on local coherence | Self-adapts to data characteristics without manual tuning |
+| **Information Bottleneck** | Compresses attention via KL minimization | Keeps only the signal, discards noise |
+| **PDE/Diffusion Attention** | Runs heat equation on a similarity graph | Smooth, noise-robust attention for irregular data |
+| **Unified Diagnostics** | Health monitoring and automatic mode selection across all 7 theories | One report tells you which attention works best for your data |
+
+> Part of the [RuVector](https://github.com/ruvnet/ruvector) ecosystem -- the self-learning vector database with graph intelligence.
+
+## Supported Attention Mechanisms
+
+### Standard Attention
+- **Scaled Dot-Product**: `softmax(QK^T / √d)V`
+- **Multi-Head**: Parallel attention heads with diverse representations
+
+### Sparse Attention (Memory Efficient)
+- **Flash Attention**: O(n) memory complexity with tiled computation
+- **Linear Attention**: O(n) complexity using kernel approximation
+- **Local-Global**: Sliding window + global tokens (Longformer-style)
+
+### Geometric Attention
+- **Hyperbolic Attention**: Attention in hyperbolic space for hierarchical data
+- **Mixed Curvature**: Dynamic curvature for complex geometries
+
+### Graph Attention
+- **Edge-Featured GAT**: Graph attention with edge features
+- **RoPE**: Rotary Position Embeddings for graphs
+
+### Mixture-of-Experts
+- **MoE Attention**: Learned routing to specialized expert modules
+- **Top-k Routing**: Efficient expert selection
+
+## 7 Mathematical Theories
+
+This crate implements attention mechanisms grounded in 7 distinct mathematical theories:
+
+| # | Theory | Module | Key Types | Use Case |
+|---|--------|--------|-----------|----------|
+| 1 | **Optimal Transport** | `transport` | `SlicedWassersteinAttention`, `CentroidOTAttention` | Distribution matching, Earth mover distance |
+| 2 | **Mixed Curvature** | `curvature` | `MixedCurvatureFusedAttention`, `TangentSpaceMapper` | Product spaces E^e × H^h × S^s |
+| 3 | **Topology** | `topology` | `TopologyGatedAttention`, `WindowCoherence` | Coherence-based mode switching |
+| 4 | **Information Geometry** | `info_geometry` | `FisherMetric`, `NaturalGradient` | Natural gradient descent |
+| 5 | **Information Bottleneck** | `info_bottleneck` | `InformationBottleneck`, `KLDivergence` | Compression via KL minimization |
+| 6 | **PDE/Diffusion** | `pde_attention` | `DiffusionAttention`, `GraphLaplacian` | Heat equation on similarity graph |
+| 7 | **Unified Diagnostics** | `unified_report` | `GeometryReport`, `ReportBuilder` | Health monitoring & mode selection |
+
+### Theory 1: Optimal Transport Attention
+
+Attention as mass transport between query and key distributions using Wasserstein distance.
+
+```rust
+use ruvector_attention::{SlicedWassersteinAttention, SlicedWassersteinConfig};
+
+// Configure Sliced Wasserstein with 16 random projections
+let config = SlicedWassersteinConfig {
+    num_projections: 16,
+    num_candidates: 64,
+    dim: 512,
+    ..Default::default()
+};
+
+let ot_attention = SlicedWassersteinAttention::new(config);
+
+// Compute OT-based attention scores
+let query = vec![0.5; 512];
+let keys: Vec<&[f32]> = key_data.iter().map(|k| k.as_slice()).collect();
+let values: Vec<&[f32]> = value_data.iter().map(|v| v.as_slice()).collect();
+
+let output = ot_attention.compute_sliced(&query, &keys, &values)?;
+```
+
+**Key Features:**
+- Sliced Wasserstein with cached sorted projections
+- Two-stage filtering: cheap dot-product → expensive OT kernel
+- Centroid OT: cluster keys into M centroids for O(M) transport
+
+### Theory 2: Mixed Curvature Attention
+
+Attention in product manifolds combining Euclidean (E), Hyperbolic (H), and Spherical (S) spaces.
+
+```rust
+use ruvector_attention::{
+    MixedCurvatureFusedAttention, FusedCurvatureConfig,
+    TangentSpaceMapper, TangentSpaceConfig
+};
+
+// Configure mixed curvature with component dimensions
+let config = FusedCurvatureConfig {
+    euclidean_dim: 256,
+    hyperbolic_dim: 128,
+    spherical_dim: 128,
+    curvature_h: -1.0,  // Negative for hyperbolic
+    curvature_s: 1.0,   // Positive for spherical
+    ..Default::default()
+};
+
+let mixed_attention = MixedCurvatureFusedAttention::new(config);
+
+// Map hyperbolic vectors to tangent space for efficient computation
+let mapper = TangentSpaceMapper::new(TangentSpaceConfig::default());
+let tangent_keys = mapper.map_to_tangent(&hyperbolic_keys);
+```
+
+**Key Features:**
+- Tangent space mapping (avoids expensive geodesic computations)
+- Fused dot kernel: single vectorized loop for E+H+S similarities
+- Per-head learned mixing weights
+- Component quantization: 8-bit E, 5-bit H/S
+
+### Theory 3: Topology-Gated Attention
+
+Adaptive attention that switches modes based on local coherence metrics.
+
+```rust
+use ruvector_attention::{
+    TopologyGatedAttention, TopologyGatedConfig,
+    AttentionMode, PolicyConfig, CoherenceMetric
+};
+
+let config = TopologyGatedConfig {
+    dim: 512,
+    policy: PolicyConfig {
+        stable_threshold: 0.8,    // High coherence → Stable mode
+        cautious_threshold: 0.5,  // Medium → Cautious mode
+        freeze_threshold: 0.3,    // Low → Freeze mode
+        hysteresis: 0.05,         // Prevents mode oscillation
+        ..Default::default()
+    },
+    ..Default::default()
+};
+
+let gated = TopologyGatedAttention::new(config);
+
+// Attention automatically adjusts based on window coherence
+let output = gated.compute_gated(&query, &keys, &values)?;
+let mode = gated.current_mode(); // Stable, Cautious, or Freeze
+```
+
+**Coherence Metrics:**
+| Metric | Description |
+|--------|-------------|
+| `BoundaryMass` | Mass near window boundaries |
+| `CutProxy` | Proxy for graph cut quality |
+| `Disagreement` | Variance in attention weights |
+| `SimilarityVariance` | Local similarity variance |
+
+### Theory 4: Information Geometry
+
+Natural gradient optimization using the Fisher Information Matrix.
+
+```rust
+use ruvector_attention::{FisherMetric, FisherConfig, NaturalGradient, NaturalGradientConfig};
+
+// Fisher metric for probability distributions
+let fisher = FisherMetric::new(FisherConfig {
+    eps: 1e-8,
+    max_cg_iters: 50,
+    cg_tol: 1e-6,
+});
+
+// Compute F * v (Fisher-vector product)
+let probs = vec![0.25, 0.25, 0.25, 0.25];
+let direction = vec![0.1, -0.1, 0.05, -0.05];
+let fv = fisher.apply(&probs, &direction);
+
+// Natural gradient optimizer
+let ng = NaturalGradient::new(NaturalGradientConfig {
+    lr: 0.1,
+    use_diagonal: false,  // Full CG solve (more accurate)
+    fisher: FisherConfig::default(),
+});
+
+// Update logits using natural gradient: θ ← θ - lr * F^{-1} * ∇L
+let new_logits = ng.step_logits(&logits, &grad_logits);
+```
+
+**Key Features:**
+- Conjugate gradient solver for F^{-1} * v
+- Diagonal approximation for speed
+- SIMD-accelerated matrix-vector operations
+
+### Theory 5: Information Bottleneck
+
+Attention compression via the Information Bottleneck principle.
+
+```rust
+use ruvector_attention::{InformationBottleneck, IBConfig, KLDivergence, DiagonalGaussian};
+
+// Information bottleneck layer
+let ib = InformationBottleneck::new(IBConfig {
+    beta: 0.1,        // Compression strength
+    z_dim: 64,        // Bottleneck dimension
+    anneal_steps: 1000,
+    ..Default::default()
+});
+
+// Compute KL divergence between Gaussian and unit normal
+let gaussian = DiagonalGaussian {
+    mean: vec![0.1; 64],
+    log_var: vec![-1.0; 64],
+};
+let kl = KLDivergence::gaussian_to_unit(&gaussian);
+
+// Compress attention weights
+let (compressed, kl_loss) = ib.compress_attention_weights(&weights, temperature);
+
+// Reparameterized sampling
+let z = ib.sample(&mean, &log_var, &epsilon);
+```
+
+**Key Features:**
+- KL divergence: Gaussian→Unit, Categorical, Jensen-Shannon
+- Variational Information Bottleneck (VIB)
+- Temperature annealing for curriculum learning
+
+### Theory 6: PDE/Diffusion Attention
+
+Attention as heat diffusion on the key similarity graph.
+
+```rust
+use ruvector_attention::{
+    DiffusionAttention, DiffusionConfig,
+    GraphLaplacian, LaplacianType
+};
+
+// Build graph Laplacian from keys
+let laplacian = GraphLaplacian::from_keys(
+    &keys,
+    sigma,  // Gaussian kernel bandwidth
+    LaplacianType::SymmetricNormalized
+);
+
+// Diffusion attention with heat equation
+let config = DiffusionConfig {
+    t: 1.0,           // Diffusion time
+    num_steps: 10,    // Discretization steps
+    sigma: 1.0,       // Kernel bandwidth
+    use_knn: true,    // Sparse Laplacian
+    k: 16,            // k-NN neighbors
+    laplacian_type: LaplacianType::SymmetricNormalized,
+    ..Default::default()
+};
+
+let diffusion = DiffusionAttention::new(config);
+
+// Compute diffused attention
+let output = diffusion.compute_diffusion(&query, &keys, &values)?;
+
+// Multi-scale diffusion (captures different granularities)
+let scales = diffusion.compute_multiscale(&query, &keys, 4);
+```
+
+**Laplacian Types:**
+| Type | Formula | Properties |
+|------|---------|------------|
+| `Unnormalized` | D - W | Graph spectrum analysis |
+| `SymmetricNormalized` | I - D^{-1/2}WD^{-1/2} | Symmetric, eigenvalues in [0,2] |
+| `RandomWalk` | I - D^{-1}W | Probability transitions |
+
+### Theory 7: Unified Geometry Report
+
+Diagnostic dashboard combining all metrics for intelligent attention mode selection.
+
+```rust
+use ruvector_attention::{
+    ReportBuilder, ReportConfig, GeometryReport,
+    MetricType, AttentionRecommendation
+};
+
+// Build comprehensive geometry report
+let report = ReportBuilder::new(ReportConfig::default())
+    .with_ot_distance(0.15)
+    .with_topology_coherence(0.82)
+    .with_ib_kl(0.05)
+    .with_diffusion_energy(0.3)
+    .with_attention_entropy(2.1)
+    .build();
+
+// Get health score (0-1)
+println!("Health: {:.2}", report.health_score);
+
+// Get automatic attention mode recommendation
+match report.recommendation {
+    AttentionRecommendation::Standard => { /* Use standard attention */ }
+    AttentionRecommendation::Sparse => { /* Switch to sparse */ }
+    AttentionRecommendation::Geometric => { /* Use hyperbolic/mixed */ }
+    AttentionRecommendation::Diffusion => { /* Use diffusion attention */ }
+}
+
+// Check individual metrics
+for metric in &report.metrics {
+    println!("{:?}: {} ({})",
+        metric.metric_type,
+        metric.value,
+        metric.status()
+    );
+}
+```
+
+**Metrics Tracked:**
+| Metric | Healthy Range | Warning | Critical |
+|--------|---------------|---------|----------|
+| OT Distance | 0.0 - 0.5 | > 0.3 | > 0.7 |
+| Topology Coherence | 0.5 - 1.0 | < 0.3 | < 0.1 |
+| IB KL | 0.0 - 0.2 | > 0.5 | > 1.0 |
+| Diffusion Energy | 0.0 - 1.0 | > 2.0 | > 5.0 |
+| Attention Entropy | 1.0 - 4.0 | < 0.5 | < 0.1 |
+
+## Quick Start
+
+```rust
+use ruvector_attention::sdk::*;
+
+// Simple multi-head attention
+let attention = multi_head(768, 12)
+    .dropout(0.1)
+    .causal(true)
+    .build()?;
+
+// Use preset configurations
+let bert = AttentionPreset::Bert.builder(768).build()?;
+let gpt = AttentionPreset::Gpt.builder(768).build()?;
+
+// Build pipelines with normalization
+let pipeline = AttentionPipeline::new()
+    .add_attention(attention)
+    .add_norm(NormType::LayerNorm)
+    .add_residual();
+
+// Compute attention
+let query = vec![0.5; 768];
+let keys = vec![&query[..]; 10];
+let values = vec![&query[..]; 10];
+
+let output = pipeline.run(&query, &keys, &values)?;
+```
+
+## Installation
+
+Add to your `Cargo.toml`:
+
+```toml
+[dependencies]
+ruvector-attention = "0.1"
+```
+
+Or with specific features:
+
+```toml
+[dependencies]
+ruvector-attention = { version = "0.1", features = ["simd", "wasm"] }
+```
+
+## SDK Overview
+
+### Builder API
+
+The builder provides a fluent interface for configuring attention:
+
+```rust
+use ruvector_attention::sdk::*;
+
+// Flash attention for long sequences
+let flash = flash(1024, 128)  // dim, block_size
+    .causal(true)
+    .dropout(0.1)
+    .build()?;
+
+// Linear attention for O(n) complexity
+let linear = linear(512, 256)  // dim, num_features
+    .build()?;
+
+// MoE attention with 8 experts
+let moe = moe(512, 8, 2)  // dim, num_experts, top_k
+    .expert_capacity(1.25)
+    .jitter_noise(0.01)
+    .build()?;
+
+// Hyperbolic attention for hierarchies
+let hyperbolic = hyperbolic(512, -1.0)  // dim, curvature
+    .build()?;
+```
+
+### Pipeline API
+
+Compose attention with pre/post processing:
+
+```rust
+use ruvector_attention::sdk::*;
+
+let attention = multi_head(768, 12).build()?;
+
+let pipeline = AttentionPipeline::new()
+    .add_norm(NormType::LayerNorm)     // Pre-normalization
+    .add_attention(attention)           // Attention layer
+    .add_dropout(0.1)                   // Dropout
+    .add_residual()                     // Residual connection
+    .add_norm(NormType::RMSNorm);      // Post-normalization
+
+let output = pipeline.run(&query, &keys, &values)?;
+```
+
+### Preset Configurations
+
+Pre-configured attention for popular models:
+
+```rust
+use ruvector_attention::sdk::presets::*;
+
+// Model-specific presets
+let bert = AttentionPreset::Bert.builder(768).build()?;
+let gpt = AttentionPreset::Gpt.builder(768).build()?;
+let longformer = AttentionPreset::Longformer.builder(512).build()?;
+let flash = AttentionPreset::FlashOptimized.builder(1024).build()?;
+let t5 = AttentionPreset::T5.builder(768).build()?;
+let vit = AttentionPreset::ViT.builder(768).build()?;
+
+// Smart selection based on use case
+let attention = for_sequences(512, max_len).build()?;  // Auto-select by length
+let graph_attn = for_graphs(256, hierarchical).build()?;  // Graph attention
+let fast_attn = for_large_scale(1024).build()?;  // Flash attention
+
+// By model name
+let bert = from_model_name("bert", 768)?;
+let gpt2 = from_model_name("gpt2", 768)?;
+```
+
+## Architecture
+
+```
+ruvector-attention/
+├── src/
+│   ├── lib.rs                 # Main crate entry
+│   ├── error.rs               # Error types
+│   ├── traits.rs              # Core attention traits
+│   │
+│   ├── attention/             # Standard attention
+│   │   ├── scaled_dot_product.rs
+│   │   └── multi_head.rs
+│   │
+│   ├── sparse/                # Sparse attention (O(n) memory)
+│   │   ├── flash.rs           # Flash attention (tiled)
+│   │   ├── linear.rs          # Kernel approximation
+│   │   └── local_global.rs    # Longformer-style
+│   │
+│   ├── graph/                 # Graph attention
+│   │   ├── edge_featured.rs   # GAT with edge features
+│   │   ├── dual_space.rs      # Dual-space attention
+│   │   └── rope.rs            # Rotary embeddings
+│   │
+│   ├── hyperbolic/            # Hyperbolic geometry
+│   │   ├── hyperbolic_attention.rs
+│   │   ├── mixed_curvature.rs
+│   │   └── poincare.rs
+│   │
+│   ├── moe/                   # Mixture-of-Experts
+│   │   ├── expert.rs          # Expert modules
+│   │   ├── router.rs          # Top-k routing
+│   │   └── moe_attention.rs
+│   │
+│   ├── transport/             # [Theory 1] Optimal Transport
+│   │   ├── sliced_wasserstein.rs   # Sliced OT attention
+│   │   ├── centroid_ot.rs          # Centroid-based OT
+│   │   └── cached_projections.rs   # Projection caching
+│   │
+│   ├── curvature/             # [Theory 2] Mixed Curvature
+│   │   ├── tangent_space.rs        # Tangent space mapping
+│   │   ├── fused_attention.rs      # Fused E+H+S kernel
+│   │   └── component_quantizer.rs  # 8-bit/5-bit quantization
+│   │
+│   ├── topology/              # [Theory 3] Topology Gating
+│   │   ├── coherence.rs            # Window coherence metrics
+│   │   ├── policy.rs               # 3-mode policy (Stable/Cautious/Freeze)
+│   │   └── gated_attention.rs      # Adaptive gated attention
+│   │
+│   ├── info_geometry/         # [Theory 4] Information Geometry
+│   │   ├── fisher.rs               # Fisher information matrix
+│   │   └── natural_gradient.rs     # Natural gradient descent
+│   │
+│   ├── info_bottleneck/       # [Theory 5] Information Bottleneck
+│   │   ├── kl_divergence.rs        # KL, JS divergences
+│   │   └── bottleneck.rs           # VIB layer
+│   │
+│   ├── pde_attention/         # [Theory 6] PDE/Diffusion
+│   │   ├── laplacian.rs            # Graph Laplacian construction
+│   │   └── diffusion.rs            # Heat equation attention
+│   │
+│   ├── unified_report/        # [Theory 7] Unified Diagnostics
+│   │   ├── metrics.rs              # Metric types and values
+│   │   ├── report.rs               # Geometry report builder
+│   │   └── recommendation.rs       # Attention mode recommendations
+│   │
+│   ├── training/              # Training utilities
+│   │   ├── loss.rs            # InfoNCE, contrastive losses
+│   │   ├── optimizer.rs       # SGD, Adam, AdamW
+│   │   └── curriculum.rs      # Curriculum scheduling
+│   │
+│   └── sdk/                   # High-level SDK
+│       ├── builder.rs         # Fluent builder API
+│       ├── pipeline.rs        # Composable pipelines
+│       └── presets.rs         # Model presets (BERT, GPT, etc.)
+```
+
+## Examples
+
+### Transformer Block
+
+```rust
+use ruvector_attention::sdk::*;
+
+fn create_transformer_block(dim: usize) -> AttentionResult<AttentionPipeline> {
+    let attention = multi_head(dim, 12)
+        .dropout(0.1)
+        .build()?;
+
+    Ok(AttentionPipeline::new()
+        .add_norm(NormType::LayerNorm)
+        .add_attention(attention)
+        .add_dropout(0.1)
+        .add_residual())
+}
+```
+
+### Long Context Processing
+
+```rust
+use ruvector_attention::sdk::*;
+
+fn create_long_context_attention(dim: usize, max_len: usize)
+    -> AttentionResult<Box<dyn Attention>> {
+    if max_len <= 2048 {
+        multi_head(dim, 12).build()
+    } else if max_len <= 16384 {
+        local_global(dim, 512).build()
+    } else {
+        linear(dim, dim / 4).build()
+    }
+}
+```
+
+### Graph Neural Network
+
+```rust
+use ruvector_attention::sdk::*;
+
+fn create_graph_attention(dim: usize, is_tree: bool)
+    -> AttentionResult<Box<dyn Attention>> {
+    if is_tree {
+        hyperbolic(dim, -1.0).build()  // Hyperbolic for tree-like
+    } else {
+        multi_head(dim, 8).build()     // Standard for general graphs
+    }
+}
+```
+
+## Performance
+
+### Complexity Comparison
+
+| Mechanism | Time | Memory | Use Case |
+|-----------|------|--------|----------|
+| Scaled Dot-Product | O(n²) | O(n²) | Short sequences |
+| Multi-Head | O(n²) | O(n²) | Standard transformers |
+| Flash Attention | O(n²) | O(n) | Long sequences |
+| Linear Attention | O(n) | O(n) | Very long sequences |
+| Local-Global | O(n·w) | O(n·w) | Document processing |
+| Hyperbolic | O(n²) | O(n²) | Hierarchical data |
+| MoE | O(n²/E) | O(n²) | Specialized tasks |
+
+### Advanced Mechanisms Complexity
+
+| Theory | Mechanism | Time | Memory | Notes |
+|--------|-----------|------|--------|-------|
+| OT | Sliced Wasserstein | O(n·P·log n) | O(n·P) | P = num projections |
+| OT | Centroid OT | O(n + M²) | O(M·d) | M = num centroids |
+| Curvature | Mixed Curvature | O(n²) | O(n²) | Fused E+H+S kernel |
+| Topology | Gated Attention | O(n²) | O(n²) | + O(n) coherence |
+| Info Geo | Natural Gradient | O(n²) | O(n) | CG solver |
+| Info Bottle | VIB | O(n·z) | O(z) | z = bottleneck dim |
+| PDE | Diffusion | O(n²·T) | O(n²) | T = diffusion steps |
+
+Where:
+- `n` = sequence length
+- `w` = local window size
+- `E` = number of experts
+- `P` = number of random projections (typically 8-16)
+- `M` = number of centroids (typically 16-32)
+- `z` = bottleneck dimension
+- `T` = number of diffusion time steps
+
+### Benchmarks
+
+On a typical workload (batch_size=32, seq_len=512, dim=768):
+
+- **Flash Attention**: 2.3x faster, 5x less memory than standard
+- **Linear Attention**: O(n) scaling for sequences >4096
+- **Local-Global**: 60% of standard attention cost for w=256
+- **Sliced Wasserstein**: 1.8x slower than standard, but better distribution matching
+- **Mixed Curvature**: ~1.3x standard with tangent space optimization
+- **Diffusion Attention**: 2-10x slower depending on T, but captures multi-scale structure
+
+## Tutorials
+
+### Tutorial 1: Building a Geometry-Aware Transformer
+
+Combine multiple geometric attention mechanisms for hierarchical data.
+
+```rust
+use ruvector_attention::*;
+use ruvector_attention::sdk::*;
+
+fn create_geometry_aware_block(dim: usize) -> AttentionResult<AttentionPipeline> {
+    // Use hyperbolic attention for hierarchy + standard for local patterns
+    let hyperbolic_attn = hyperbolic(dim, -1.0).build()?;
+
+    // Create a pipeline with pre-norm
+    Ok(AttentionPipeline::new()
+        .add_norm(NormType::RMSNorm)
+        .add_attention(hyperbolic_attn)
+        .add_dropout(0.1)
+        .add_residual())
+}
+```
+
+### Tutorial 2: Adaptive Attention with Unified Report
+
+Use the unified report to automatically select the best attention mode.
+
+```rust
+use ruvector_attention::*;
+
+fn adaptive_attention(
+    query: &[f32],
+    keys: &[&[f32]],
+    values: &[&[f32]],
+) -> AttentionResult<Vec<f32>> {
+    // Build a diagnostic report
+    let report = ReportBuilder::new(ReportConfig::default())
+        .analyze_keys(keys)  // Automatically compute metrics
+        .build();
+
+    // Select attention based on recommendation
+    match report.recommendation {
+        AttentionRecommendation::Standard => {
+            let attn = ScaledDotProductAttention::new(query.len());
+            attn.compute(query, keys, values)
+        }
+        AttentionRecommendation::Sparse => {
+            let attn = FlashAttention::new(query.len(), 64);
+            attn.compute(query, keys, values)
+        }
+        AttentionRecommendation::Geometric => {
+            let config = HyperbolicAttentionConfig {
+                dim: query.len(),
+                curvature: -1.0,
+                ..Default::default()
+            };
+            let attn = HyperbolicAttention::new(config);
+            attn.compute(query, keys, values)
+        }
+        AttentionRecommendation::Diffusion => {
+            let config = DiffusionConfig::default();
+            let attn = DiffusionAttention::new(config);
+            attn.compute_diffusion(query, keys, values)
+        }
+    }
+}
+```
+
+### Tutorial 3: Information Bottleneck for Attention Compression
+
+Use VIB to learn compressed attention representations.
+
+```rust
+use ruvector_attention::*;
+
+struct CompressedAttention {
+    ib: InformationBottleneck,
+    encoder_mean: Vec<f32>,     // Learned weights
+    encoder_log_var: Vec<f32>,  // Learned weights
+}
+
+impl CompressedAttention {
+    fn new(input_dim: usize, bottleneck_dim: usize) -> Self {
+        let ib = InformationBottleneck::new(IBConfig {
+            beta: 0.1,
+            z_dim: bottleneck_dim,
+            ..Default::default()
+        });
+
+        Self {
+            ib,
+            encoder_mean: vec![0.0; input_dim * bottleneck_dim],
+            encoder_log_var: vec![0.0; input_dim * bottleneck_dim],
+        }
+    }
+
+    fn forward(&self, x: &[f32], epsilon: &[f32]) -> (Vec<f32>, f32) {
+        // Encode to mean and log_var (simplified)
+        let mean = self.encode_mean(x);
+        let log_var = self.encode_log_var(x);
+
+        // Sample from posterior
+        let z = self.ib.sample(&mean, &log_var, epsilon);
+
+        // Compute KL loss
+        let kl_loss = self.ib.compute_kl_loss(&mean, &log_var);
+
+        (z, kl_loss)
+    }
+
+    fn encode_mean(&self, _x: &[f32]) -> Vec<f32> {
+        // Linear transform (simplified)
+        vec![0.0; self.ib.config().z_dim]
+    }
+
+    fn encode_log_var(&self, _x: &[f32]) -> Vec<f32> {
+        vec![-1.0; self.ib.config().z_dim]  // Initialize to low variance
+    }
+}
+```
+
+### Tutorial 4: Multi-Scale Diffusion for Document Understanding
+
+Use diffusion attention at multiple scales for long documents.
+
+```rust
+use ruvector_attention::*;
+
+fn document_understanding(
+    query: &[f32],
+    document_keys: &[&[f32]],  // Keys from document chunks
+) -> Vec<Vec<f32>> {
+    // Configure diffusion with k-NN sparsity for large documents
+    let config = DiffusionConfig {
+        t: 2.0,           // Larger t for more diffusion
+        num_steps: 20,
+        sigma: 1.0,
+        use_knn: true,
+        k: 32,            // Sparse Laplacian
+        laplacian_type: LaplacianType::SymmetricNormalized,
+    };
+
+    let diffusion = DiffusionAttention::new(config);
+
+    // Get attention at 4 different scales
+    // Scale 0: Local (small t) - captures nearby relationships
+    // Scale 3: Global (large t) - captures document-level structure
+    let scales = diffusion.compute_multiscale(query, document_keys, 4);
+
+    scales
+}
+```
+
+### Tutorial 5: Natural Gradient Training Loop
+
+Train attention parameters with geometry-aware optimization.
+
+```rust
+use ruvector_attention::*;
+
+fn natural_gradient_step(
+    logits: &[f32],
+    target_probs: &[f32],
+    config: &NaturalGradientConfig,
+) -> Vec<f32> {
+    let ng = NaturalGradient::new(config.clone());
+
+    // Compute cross-entropy gradient w.r.t. logits
+    let probs = softmax(logits);
+    let grad: Vec<f32> = probs.iter()
+        .zip(target_probs.iter())
+        .map(|(p, t)| p - t)
+        .collect();
+
+    // Apply natural gradient update
+    // This uses F^{-1} to rescale gradients, accounting for
+    // the geometry of the probability simplex
+    ng.step_logits(logits, &grad)
+}
+
+fn softmax(logits: &[f32]) -> Vec<f32> {
+    let max = logits.iter().cloned().fold(f32::NEG_INFINITY, f32::max);
+    let exp: Vec<f32> = logits.iter().map(|&l| (l - max).exp()).collect();
+    let sum: f32 = exp.iter().sum();
+    exp.iter().map(|&e| e / sum).collect()
+}
+```
+
+## Features
+
+- `simd` - SIMD acceleration (default, enabled)
+- `wasm` - WebAssembly support
+- `napi` - Node.js bindings
+
+## Documentation
+
+- [SDK Guide](docs/SDK_GUIDE.md) - Comprehensive SDK usage guide
+- [API Documentation](https://docs.rs/ruvector-attention) - Full API reference
+- [Examples](examples/) - Working code examples
+
+## Contributing
+
+Contributions are welcome! Please see [CONTRIBUTING.md](CONTRIBUTING.md).
+
+## License
+
+Licensed under either of:
+
+- Apache License, Version 2.0 ([LICENSE-APACHE](LICENSE-APACHE))
+- MIT License ([LICENSE-MIT](LICENSE-MIT))
+
+at your option.
+
+## Citation
+
+If you use this crate in your research, please cite:
+
+```bibtex
+@software{ruvector_attention,
+  title = {ruvector-attention: Advanced Attention Mechanisms for Vector Search},
+  author = {ruvector contributors},
+  year = {2025},
+  url = {https://github.com/ruvnet/ruvector}
+}
+```
+
+## Related Projects
+
+- [ruvector](../ruvector) - Core vector search engine
+- [ruvector-graph](../ruvector-graph) - Graph neural networks
+- [ruvector-gnn](../ruvector-gnn) - Geometric neural networks