git-subtree-dir: vendor/ruvector git-subtree-split: b64c21726f2bb37286d9ee36a7869fef60cc6900
862 lines
28 KiB
Markdown
862 lines
28 KiB
Markdown
# ruvector-attention
|
||
|
||
[](https://crates.io/crates/ruvector-attention)
|
||
[](https://docs.rs/ruvector-attention)
|
||
[](LICENSE)
|
||
[]()
|
||
|
||
**46 attention mechanisms grounded in 7 mathematical theories -- from Flash Attention to optimal transport -- in one crate.**
|
||
|
||
```bash
|
||
cargo add ruvector-attention
|
||
```
|
||
|
||
Attention is the core operation in transformers, vector search, and graph neural networks, but most libraries give you one or two flavors and call it done. `ruvector-attention` ships 46 mechanisms spanning standard dot-product, sparse (Flash, linear, local-global), geometric (hyperbolic, mixed-curvature), graph (GAT, RoPE), and mixture-of-experts -- all SIMD-accelerated with quantization support. Pick the right attention for your data shape instead of forcing everything through softmax(QK^T/sqrt(d))V.
|
||
|
||
| | ruvector-attention | PyTorch `nn.MultiheadAttention` | FlashAttention (standalone) | xFormers |
|
||
|---|---|---|---|---|
|
||
| **Mechanism count** | 46 | 1 (scaled dot-product) | 1 (Flash) | ~5 |
|
||
| **Geometric attention** | Hyperbolic, spherical, mixed-curvature | No | No | No |
|
||
| **Graph attention** | Edge-featured GAT, RoPE for graphs | No | No | Limited |
|
||
| **Optimal transport** | Sliced Wasserstein, centroid OT | No | No | No |
|
||
| **Topology-gated** | Coherence-based mode switching | No | No | No |
|
||
| **Quantization** | Per-component (8-bit E, 5-bit H/S) | Via separate tools | No | Limited |
|
||
| **Language** | Rust (with WASM target) | Python/C++ | CUDA only | Python/CUDA |
|
||
| **SIMD acceleration** | Built in (4-way unrolled) | Via backend | CUDA only | Via backend |
|
||
|
||
| Feature | What It Does | Why It Matters |
|
||
|---------|-------------|----------------|
|
||
| **Flash Attention** | O(n) memory tiled computation | Process long sequences without running out of memory |
|
||
| **Mixed Curvature Fusion** | Combines Euclidean, hyperbolic, and spherical spaces in one pass | Model hierarchies, clusters, and flat data simultaneously |
|
||
| **Optimal Transport Attention** | Uses Wasserstein distance instead of dot-product similarity | Better distribution matching for retrieval and generation |
|
||
| **Topology-Gated Switching** | Automatically picks attention mode based on local coherence | Self-adapts to data characteristics without manual tuning |
|
||
| **Information Bottleneck** | Compresses attention via KL minimization | Keeps only the signal, discards noise |
|
||
| **PDE/Diffusion Attention** | Runs heat equation on a similarity graph | Smooth, noise-robust attention for irregular data |
|
||
| **Unified Diagnostics** | Health monitoring and automatic mode selection across all 7 theories | One report tells you which attention works best for your data |
|
||
|
||
> Part of the [RuVector](https://github.com/ruvnet/ruvector) ecosystem -- the self-learning vector database with graph intelligence.
|
||
|
||
## Supported Attention Mechanisms
|
||
|
||
### Standard Attention
|
||
- **Scaled Dot-Product**: `softmax(QK^T / √d)V`
|
||
- **Multi-Head**: Parallel attention heads with diverse representations
|
||
|
||
### Sparse Attention (Memory Efficient)
|
||
- **Flash Attention**: O(n) memory complexity with tiled computation
|
||
- **Linear Attention**: O(n) complexity using kernel approximation
|
||
- **Local-Global**: Sliding window + global tokens (Longformer-style)
|
||
|
||
### Geometric Attention
|
||
- **Hyperbolic Attention**: Attention in hyperbolic space for hierarchical data
|
||
- **Mixed Curvature**: Dynamic curvature for complex geometries
|
||
|
||
### Graph Attention
|
||
- **Edge-Featured GAT**: Graph attention with edge features
|
||
- **RoPE**: Rotary Position Embeddings for graphs
|
||
|
||
### Mixture-of-Experts
|
||
- **MoE Attention**: Learned routing to specialized expert modules
|
||
- **Top-k Routing**: Efficient expert selection
|
||
|
||
## 7 Mathematical Theories
|
||
|
||
This crate implements attention mechanisms grounded in 7 distinct mathematical theories:
|
||
|
||
| # | Theory | Module | Key Types | Use Case |
|
||
|---|--------|--------|-----------|----------|
|
||
| 1 | **Optimal Transport** | `transport` | `SlicedWassersteinAttention`, `CentroidOTAttention` | Distribution matching, Earth mover distance |
|
||
| 2 | **Mixed Curvature** | `curvature` | `MixedCurvatureFusedAttention`, `TangentSpaceMapper` | Product spaces E^e × H^h × S^s |
|
||
| 3 | **Topology** | `topology` | `TopologyGatedAttention`, `WindowCoherence` | Coherence-based mode switching |
|
||
| 4 | **Information Geometry** | `info_geometry` | `FisherMetric`, `NaturalGradient` | Natural gradient descent |
|
||
| 5 | **Information Bottleneck** | `info_bottleneck` | `InformationBottleneck`, `KLDivergence` | Compression via KL minimization |
|
||
| 6 | **PDE/Diffusion** | `pde_attention` | `DiffusionAttention`, `GraphLaplacian` | Heat equation on similarity graph |
|
||
| 7 | **Unified Diagnostics** | `unified_report` | `GeometryReport`, `ReportBuilder` | Health monitoring & mode selection |
|
||
|
||
### Theory 1: Optimal Transport Attention
|
||
|
||
Attention as mass transport between query and key distributions using Wasserstein distance.
|
||
|
||
```rust
|
||
use ruvector_attention::{SlicedWassersteinAttention, SlicedWassersteinConfig};
|
||
|
||
// Configure Sliced Wasserstein with 16 random projections
|
||
let config = SlicedWassersteinConfig {
|
||
num_projections: 16,
|
||
num_candidates: 64,
|
||
dim: 512,
|
||
..Default::default()
|
||
};
|
||
|
||
let ot_attention = SlicedWassersteinAttention::new(config);
|
||
|
||
// Compute OT-based attention scores
|
||
let query = vec![0.5; 512];
|
||
let keys: Vec<&[f32]> = key_data.iter().map(|k| k.as_slice()).collect();
|
||
let values: Vec<&[f32]> = value_data.iter().map(|v| v.as_slice()).collect();
|
||
|
||
let output = ot_attention.compute_sliced(&query, &keys, &values)?;
|
||
```
|
||
|
||
**Key Features:**
|
||
- Sliced Wasserstein with cached sorted projections
|
||
- Two-stage filtering: cheap dot-product → expensive OT kernel
|
||
- Centroid OT: cluster keys into M centroids for O(M) transport
|
||
|
||
### Theory 2: Mixed Curvature Attention
|
||
|
||
Attention in product manifolds combining Euclidean (E), Hyperbolic (H), and Spherical (S) spaces.
|
||
|
||
```rust
|
||
use ruvector_attention::{
|
||
MixedCurvatureFusedAttention, FusedCurvatureConfig,
|
||
TangentSpaceMapper, TangentSpaceConfig
|
||
};
|
||
|
||
// Configure mixed curvature with component dimensions
|
||
let config = FusedCurvatureConfig {
|
||
euclidean_dim: 256,
|
||
hyperbolic_dim: 128,
|
||
spherical_dim: 128,
|
||
curvature_h: -1.0, // Negative for hyperbolic
|
||
curvature_s: 1.0, // Positive for spherical
|
||
..Default::default()
|
||
};
|
||
|
||
let mixed_attention = MixedCurvatureFusedAttention::new(config);
|
||
|
||
// Map hyperbolic vectors to tangent space for efficient computation
|
||
let mapper = TangentSpaceMapper::new(TangentSpaceConfig::default());
|
||
let tangent_keys = mapper.map_to_tangent(&hyperbolic_keys);
|
||
```
|
||
|
||
**Key Features:**
|
||
- Tangent space mapping (avoids expensive geodesic computations)
|
||
- Fused dot kernel: single vectorized loop for E+H+S similarities
|
||
- Per-head learned mixing weights
|
||
- Component quantization: 8-bit E, 5-bit H/S
|
||
|
||
### Theory 3: Topology-Gated Attention
|
||
|
||
Adaptive attention that switches modes based on local coherence metrics.
|
||
|
||
```rust
|
||
use ruvector_attention::{
|
||
TopologyGatedAttention, TopologyGatedConfig,
|
||
AttentionMode, PolicyConfig, CoherenceMetric
|
||
};
|
||
|
||
let config = TopologyGatedConfig {
|
||
dim: 512,
|
||
policy: PolicyConfig {
|
||
stable_threshold: 0.8, // High coherence → Stable mode
|
||
cautious_threshold: 0.5, // Medium → Cautious mode
|
||
freeze_threshold: 0.3, // Low → Freeze mode
|
||
hysteresis: 0.05, // Prevents mode oscillation
|
||
..Default::default()
|
||
},
|
||
..Default::default()
|
||
};
|
||
|
||
let gated = TopologyGatedAttention::new(config);
|
||
|
||
// Attention automatically adjusts based on window coherence
|
||
let output = gated.compute_gated(&query, &keys, &values)?;
|
||
let mode = gated.current_mode(); // Stable, Cautious, or Freeze
|
||
```
|
||
|
||
**Coherence Metrics:**
|
||
| Metric | Description |
|
||
|--------|-------------|
|
||
| `BoundaryMass` | Mass near window boundaries |
|
||
| `CutProxy` | Proxy for graph cut quality |
|
||
| `Disagreement` | Variance in attention weights |
|
||
| `SimilarityVariance` | Local similarity variance |
|
||
|
||
### Theory 4: Information Geometry
|
||
|
||
Natural gradient optimization using the Fisher Information Matrix.
|
||
|
||
```rust
|
||
use ruvector_attention::{FisherMetric, FisherConfig, NaturalGradient, NaturalGradientConfig};
|
||
|
||
// Fisher metric for probability distributions
|
||
let fisher = FisherMetric::new(FisherConfig {
|
||
eps: 1e-8,
|
||
max_cg_iters: 50,
|
||
cg_tol: 1e-6,
|
||
});
|
||
|
||
// Compute F * v (Fisher-vector product)
|
||
let probs = vec![0.25, 0.25, 0.25, 0.25];
|
||
let direction = vec![0.1, -0.1, 0.05, -0.05];
|
||
let fv = fisher.apply(&probs, &direction);
|
||
|
||
// Natural gradient optimizer
|
||
let ng = NaturalGradient::new(NaturalGradientConfig {
|
||
lr: 0.1,
|
||
use_diagonal: false, // Full CG solve (more accurate)
|
||
fisher: FisherConfig::default(),
|
||
});
|
||
|
||
// Update logits using natural gradient: θ ← θ - lr * F^{-1} * ∇L
|
||
let new_logits = ng.step_logits(&logits, &grad_logits);
|
||
```
|
||
|
||
**Key Features:**
|
||
- Conjugate gradient solver for F^{-1} * v
|
||
- Diagonal approximation for speed
|
||
- SIMD-accelerated matrix-vector operations
|
||
|
||
### Theory 5: Information Bottleneck
|
||
|
||
Attention compression via the Information Bottleneck principle.
|
||
|
||
```rust
|
||
use ruvector_attention::{InformationBottleneck, IBConfig, KLDivergence, DiagonalGaussian};
|
||
|
||
// Information bottleneck layer
|
||
let ib = InformationBottleneck::new(IBConfig {
|
||
beta: 0.1, // Compression strength
|
||
z_dim: 64, // Bottleneck dimension
|
||
anneal_steps: 1000,
|
||
..Default::default()
|
||
});
|
||
|
||
// Compute KL divergence between Gaussian and unit normal
|
||
let gaussian = DiagonalGaussian {
|
||
mean: vec![0.1; 64],
|
||
log_var: vec![-1.0; 64],
|
||
};
|
||
let kl = KLDivergence::gaussian_to_unit(&gaussian);
|
||
|
||
// Compress attention weights
|
||
let (compressed, kl_loss) = ib.compress_attention_weights(&weights, temperature);
|
||
|
||
// Reparameterized sampling
|
||
let z = ib.sample(&mean, &log_var, &epsilon);
|
||
```
|
||
|
||
**Key Features:**
|
||
- KL divergence: Gaussian→Unit, Categorical, Jensen-Shannon
|
||
- Variational Information Bottleneck (VIB)
|
||
- Temperature annealing for curriculum learning
|
||
|
||
### Theory 6: PDE/Diffusion Attention
|
||
|
||
Attention as heat diffusion on the key similarity graph.
|
||
|
||
```rust
|
||
use ruvector_attention::{
|
||
DiffusionAttention, DiffusionConfig,
|
||
GraphLaplacian, LaplacianType
|
||
};
|
||
|
||
// Build graph Laplacian from keys
|
||
let laplacian = GraphLaplacian::from_keys(
|
||
&keys,
|
||
sigma, // Gaussian kernel bandwidth
|
||
LaplacianType::SymmetricNormalized
|
||
);
|
||
|
||
// Diffusion attention with heat equation
|
||
let config = DiffusionConfig {
|
||
t: 1.0, // Diffusion time
|
||
num_steps: 10, // Discretization steps
|
||
sigma: 1.0, // Kernel bandwidth
|
||
use_knn: true, // Sparse Laplacian
|
||
k: 16, // k-NN neighbors
|
||
laplacian_type: LaplacianType::SymmetricNormalized,
|
||
..Default::default()
|
||
};
|
||
|
||
let diffusion = DiffusionAttention::new(config);
|
||
|
||
// Compute diffused attention
|
||
let output = diffusion.compute_diffusion(&query, &keys, &values)?;
|
||
|
||
// Multi-scale diffusion (captures different granularities)
|
||
let scales = diffusion.compute_multiscale(&query, &keys, 4);
|
||
```
|
||
|
||
**Laplacian Types:**
|
||
| Type | Formula | Properties |
|
||
|------|---------|------------|
|
||
| `Unnormalized` | D - W | Graph spectrum analysis |
|
||
| `SymmetricNormalized` | I - D^{-1/2}WD^{-1/2} | Symmetric, eigenvalues in [0,2] |
|
||
| `RandomWalk` | I - D^{-1}W | Probability transitions |
|
||
|
||
### Theory 7: Unified Geometry Report
|
||
|
||
Diagnostic dashboard combining all metrics for intelligent attention mode selection.
|
||
|
||
```rust
|
||
use ruvector_attention::{
|
||
ReportBuilder, ReportConfig, GeometryReport,
|
||
MetricType, AttentionRecommendation
|
||
};
|
||
|
||
// Build comprehensive geometry report
|
||
let report = ReportBuilder::new(ReportConfig::default())
|
||
.with_ot_distance(0.15)
|
||
.with_topology_coherence(0.82)
|
||
.with_ib_kl(0.05)
|
||
.with_diffusion_energy(0.3)
|
||
.with_attention_entropy(2.1)
|
||
.build();
|
||
|
||
// Get health score (0-1)
|
||
println!("Health: {:.2}", report.health_score);
|
||
|
||
// Get automatic attention mode recommendation
|
||
match report.recommendation {
|
||
AttentionRecommendation::Standard => { /* Use standard attention */ }
|
||
AttentionRecommendation::Sparse => { /* Switch to sparse */ }
|
||
AttentionRecommendation::Geometric => { /* Use hyperbolic/mixed */ }
|
||
AttentionRecommendation::Diffusion => { /* Use diffusion attention */ }
|
||
}
|
||
|
||
// Check individual metrics
|
||
for metric in &report.metrics {
|
||
println!("{:?}: {} ({})",
|
||
metric.metric_type,
|
||
metric.value,
|
||
metric.status()
|
||
);
|
||
}
|
||
```
|
||
|
||
**Metrics Tracked:**
|
||
| Metric | Healthy Range | Warning | Critical |
|
||
|--------|---------------|---------|----------|
|
||
| OT Distance | 0.0 - 0.5 | > 0.3 | > 0.7 |
|
||
| Topology Coherence | 0.5 - 1.0 | < 0.3 | < 0.1 |
|
||
| IB KL | 0.0 - 0.2 | > 0.5 | > 1.0 |
|
||
| Diffusion Energy | 0.0 - 1.0 | > 2.0 | > 5.0 |
|
||
| Attention Entropy | 1.0 - 4.0 | < 0.5 | < 0.1 |
|
||
|
||
## Quick Start
|
||
|
||
```rust
|
||
use ruvector_attention::sdk::*;
|
||
|
||
// Simple multi-head attention
|
||
let attention = multi_head(768, 12)
|
||
.dropout(0.1)
|
||
.causal(true)
|
||
.build()?;
|
||
|
||
// Use preset configurations
|
||
let bert = AttentionPreset::Bert.builder(768).build()?;
|
||
let gpt = AttentionPreset::Gpt.builder(768).build()?;
|
||
|
||
// Build pipelines with normalization
|
||
let pipeline = AttentionPipeline::new()
|
||
.add_attention(attention)
|
||
.add_norm(NormType::LayerNorm)
|
||
.add_residual();
|
||
|
||
// Compute attention
|
||
let query = vec![0.5; 768];
|
||
let keys = vec![&query[..]; 10];
|
||
let values = vec![&query[..]; 10];
|
||
|
||
let output = pipeline.run(&query, &keys, &values)?;
|
||
```
|
||
|
||
## Installation
|
||
|
||
Add to your `Cargo.toml`:
|
||
|
||
```toml
|
||
[dependencies]
|
||
ruvector-attention = "0.1"
|
||
```
|
||
|
||
Or with specific features:
|
||
|
||
```toml
|
||
[dependencies]
|
||
ruvector-attention = { version = "0.1", features = ["simd", "wasm"] }
|
||
```
|
||
|
||
## SDK Overview
|
||
|
||
### Builder API
|
||
|
||
The builder provides a fluent interface for configuring attention:
|
||
|
||
```rust
|
||
use ruvector_attention::sdk::*;
|
||
|
||
// Flash attention for long sequences
|
||
let flash = flash(1024, 128) // dim, block_size
|
||
.causal(true)
|
||
.dropout(0.1)
|
||
.build()?;
|
||
|
||
// Linear attention for O(n) complexity
|
||
let linear = linear(512, 256) // dim, num_features
|
||
.build()?;
|
||
|
||
// MoE attention with 8 experts
|
||
let moe = moe(512, 8, 2) // dim, num_experts, top_k
|
||
.expert_capacity(1.25)
|
||
.jitter_noise(0.01)
|
||
.build()?;
|
||
|
||
// Hyperbolic attention for hierarchies
|
||
let hyperbolic = hyperbolic(512, -1.0) // dim, curvature
|
||
.build()?;
|
||
```
|
||
|
||
### Pipeline API
|
||
|
||
Compose attention with pre/post processing:
|
||
|
||
```rust
|
||
use ruvector_attention::sdk::*;
|
||
|
||
let attention = multi_head(768, 12).build()?;
|
||
|
||
let pipeline = AttentionPipeline::new()
|
||
.add_norm(NormType::LayerNorm) // Pre-normalization
|
||
.add_attention(attention) // Attention layer
|
||
.add_dropout(0.1) // Dropout
|
||
.add_residual() // Residual connection
|
||
.add_norm(NormType::RMSNorm); // Post-normalization
|
||
|
||
let output = pipeline.run(&query, &keys, &values)?;
|
||
```
|
||
|
||
### Preset Configurations
|
||
|
||
Pre-configured attention for popular models:
|
||
|
||
```rust
|
||
use ruvector_attention::sdk::presets::*;
|
||
|
||
// Model-specific presets
|
||
let bert = AttentionPreset::Bert.builder(768).build()?;
|
||
let gpt = AttentionPreset::Gpt.builder(768).build()?;
|
||
let longformer = AttentionPreset::Longformer.builder(512).build()?;
|
||
let flash = AttentionPreset::FlashOptimized.builder(1024).build()?;
|
||
let t5 = AttentionPreset::T5.builder(768).build()?;
|
||
let vit = AttentionPreset::ViT.builder(768).build()?;
|
||
|
||
// Smart selection based on use case
|
||
let attention = for_sequences(512, max_len).build()?; // Auto-select by length
|
||
let graph_attn = for_graphs(256, hierarchical).build()?; // Graph attention
|
||
let fast_attn = for_large_scale(1024).build()?; // Flash attention
|
||
|
||
// By model name
|
||
let bert = from_model_name("bert", 768)?;
|
||
let gpt2 = from_model_name("gpt2", 768)?;
|
||
```
|
||
|
||
## Architecture
|
||
|
||
```
|
||
ruvector-attention/
|
||
├── src/
|
||
│ ├── lib.rs # Main crate entry
|
||
│ ├── error.rs # Error types
|
||
│ ├── traits.rs # Core attention traits
|
||
│ │
|
||
│ ├── attention/ # Standard attention
|
||
│ │ ├── scaled_dot_product.rs
|
||
│ │ └── multi_head.rs
|
||
│ │
|
||
│ ├── sparse/ # Sparse attention (O(n) memory)
|
||
│ │ ├── flash.rs # Flash attention (tiled)
|
||
│ │ ├── linear.rs # Kernel approximation
|
||
│ │ └── local_global.rs # Longformer-style
|
||
│ │
|
||
│ ├── graph/ # Graph attention
|
||
│ │ ├── edge_featured.rs # GAT with edge features
|
||
│ │ ├── dual_space.rs # Dual-space attention
|
||
│ │ └── rope.rs # Rotary embeddings
|
||
│ │
|
||
│ ├── hyperbolic/ # Hyperbolic geometry
|
||
│ │ ├── hyperbolic_attention.rs
|
||
│ │ ├── mixed_curvature.rs
|
||
│ │ └── poincare.rs
|
||
│ │
|
||
│ ├── moe/ # Mixture-of-Experts
|
||
│ │ ├── expert.rs # Expert modules
|
||
│ │ ├── router.rs # Top-k routing
|
||
│ │ └── moe_attention.rs
|
||
│ │
|
||
│ ├── transport/ # [Theory 1] Optimal Transport
|
||
│ │ ├── sliced_wasserstein.rs # Sliced OT attention
|
||
│ │ ├── centroid_ot.rs # Centroid-based OT
|
||
│ │ └── cached_projections.rs # Projection caching
|
||
│ │
|
||
│ ├── curvature/ # [Theory 2] Mixed Curvature
|
||
│ │ ├── tangent_space.rs # Tangent space mapping
|
||
│ │ ├── fused_attention.rs # Fused E+H+S kernel
|
||
│ │ └── component_quantizer.rs # 8-bit/5-bit quantization
|
||
│ │
|
||
│ ├── topology/ # [Theory 3] Topology Gating
|
||
│ │ ├── coherence.rs # Window coherence metrics
|
||
│ │ ├── policy.rs # 3-mode policy (Stable/Cautious/Freeze)
|
||
│ │ └── gated_attention.rs # Adaptive gated attention
|
||
│ │
|
||
│ ├── info_geometry/ # [Theory 4] Information Geometry
|
||
│ │ ├── fisher.rs # Fisher information matrix
|
||
│ │ └── natural_gradient.rs # Natural gradient descent
|
||
│ │
|
||
│ ├── info_bottleneck/ # [Theory 5] Information Bottleneck
|
||
│ │ ├── kl_divergence.rs # KL, JS divergences
|
||
│ │ └── bottleneck.rs # VIB layer
|
||
│ │
|
||
│ ├── pde_attention/ # [Theory 6] PDE/Diffusion
|
||
│ │ ├── laplacian.rs # Graph Laplacian construction
|
||
│ │ └── diffusion.rs # Heat equation attention
|
||
│ │
|
||
│ ├── unified_report/ # [Theory 7] Unified Diagnostics
|
||
│ │ ├── metrics.rs # Metric types and values
|
||
│ │ ├── report.rs # Geometry report builder
|
||
│ │ └── recommendation.rs # Attention mode recommendations
|
||
│ │
|
||
│ ├── training/ # Training utilities
|
||
│ │ ├── loss.rs # InfoNCE, contrastive losses
|
||
│ │ ├── optimizer.rs # SGD, Adam, AdamW
|
||
│ │ └── curriculum.rs # Curriculum scheduling
|
||
│ │
|
||
│ └── sdk/ # High-level SDK
|
||
│ ├── builder.rs # Fluent builder API
|
||
│ ├── pipeline.rs # Composable pipelines
|
||
│ └── presets.rs # Model presets (BERT, GPT, etc.)
|
||
```
|
||
|
||
## Examples
|
||
|
||
### Transformer Block
|
||
|
||
```rust
|
||
use ruvector_attention::sdk::*;
|
||
|
||
fn create_transformer_block(dim: usize) -> AttentionResult<AttentionPipeline> {
|
||
let attention = multi_head(dim, 12)
|
||
.dropout(0.1)
|
||
.build()?;
|
||
|
||
Ok(AttentionPipeline::new()
|
||
.add_norm(NormType::LayerNorm)
|
||
.add_attention(attention)
|
||
.add_dropout(0.1)
|
||
.add_residual())
|
||
}
|
||
```
|
||
|
||
### Long Context Processing
|
||
|
||
```rust
|
||
use ruvector_attention::sdk::*;
|
||
|
||
fn create_long_context_attention(dim: usize, max_len: usize)
|
||
-> AttentionResult<Box<dyn Attention>> {
|
||
if max_len <= 2048 {
|
||
multi_head(dim, 12).build()
|
||
} else if max_len <= 16384 {
|
||
local_global(dim, 512).build()
|
||
} else {
|
||
linear(dim, dim / 4).build()
|
||
}
|
||
}
|
||
```
|
||
|
||
### Graph Neural Network
|
||
|
||
```rust
|
||
use ruvector_attention::sdk::*;
|
||
|
||
fn create_graph_attention(dim: usize, is_tree: bool)
|
||
-> AttentionResult<Box<dyn Attention>> {
|
||
if is_tree {
|
||
hyperbolic(dim, -1.0).build() // Hyperbolic for tree-like
|
||
} else {
|
||
multi_head(dim, 8).build() // Standard for general graphs
|
||
}
|
||
}
|
||
```
|
||
|
||
## Performance
|
||
|
||
### Complexity Comparison
|
||
|
||
| Mechanism | Time | Memory | Use Case |
|
||
|-----------|------|--------|----------|
|
||
| Scaled Dot-Product | O(n²) | O(n²) | Short sequences |
|
||
| Multi-Head | O(n²) | O(n²) | Standard transformers |
|
||
| Flash Attention | O(n²) | O(n) | Long sequences |
|
||
| Linear Attention | O(n) | O(n) | Very long sequences |
|
||
| Local-Global | O(n·w) | O(n·w) | Document processing |
|
||
| Hyperbolic | O(n²) | O(n²) | Hierarchical data |
|
||
| MoE | O(n²/E) | O(n²) | Specialized tasks |
|
||
|
||
### Advanced Mechanisms Complexity
|
||
|
||
| Theory | Mechanism | Time | Memory | Notes |
|
||
|--------|-----------|------|--------|-------|
|
||
| OT | Sliced Wasserstein | O(n·P·log n) | O(n·P) | P = num projections |
|
||
| OT | Centroid OT | O(n + M²) | O(M·d) | M = num centroids |
|
||
| Curvature | Mixed Curvature | O(n²) | O(n²) | Fused E+H+S kernel |
|
||
| Topology | Gated Attention | O(n²) | O(n²) | + O(n) coherence |
|
||
| Info Geo | Natural Gradient | O(n²) | O(n) | CG solver |
|
||
| Info Bottle | VIB | O(n·z) | O(z) | z = bottleneck dim |
|
||
| PDE | Diffusion | O(n²·T) | O(n²) | T = diffusion steps |
|
||
|
||
Where:
|
||
- `n` = sequence length
|
||
- `w` = local window size
|
||
- `E` = number of experts
|
||
- `P` = number of random projections (typically 8-16)
|
||
- `M` = number of centroids (typically 16-32)
|
||
- `z` = bottleneck dimension
|
||
- `T` = number of diffusion time steps
|
||
|
||
### Benchmarks
|
||
|
||
On a typical workload (batch_size=32, seq_len=512, dim=768):
|
||
|
||
- **Flash Attention**: 2.3x faster, 5x less memory than standard
|
||
- **Linear Attention**: O(n) scaling for sequences >4096
|
||
- **Local-Global**: 60% of standard attention cost for w=256
|
||
- **Sliced Wasserstein**: 1.8x slower than standard, but better distribution matching
|
||
- **Mixed Curvature**: ~1.3x standard with tangent space optimization
|
||
- **Diffusion Attention**: 2-10x slower depending on T, but captures multi-scale structure
|
||
|
||
## Tutorials
|
||
|
||
### Tutorial 1: Building a Geometry-Aware Transformer
|
||
|
||
Combine multiple geometric attention mechanisms for hierarchical data.
|
||
|
||
```rust
|
||
use ruvector_attention::*;
|
||
use ruvector_attention::sdk::*;
|
||
|
||
fn create_geometry_aware_block(dim: usize) -> AttentionResult<AttentionPipeline> {
|
||
// Use hyperbolic attention for hierarchy + standard for local patterns
|
||
let hyperbolic_attn = hyperbolic(dim, -1.0).build()?;
|
||
|
||
// Create a pipeline with pre-norm
|
||
Ok(AttentionPipeline::new()
|
||
.add_norm(NormType::RMSNorm)
|
||
.add_attention(hyperbolic_attn)
|
||
.add_dropout(0.1)
|
||
.add_residual())
|
||
}
|
||
```
|
||
|
||
### Tutorial 2: Adaptive Attention with Unified Report
|
||
|
||
Use the unified report to automatically select the best attention mode.
|
||
|
||
```rust
|
||
use ruvector_attention::*;
|
||
|
||
fn adaptive_attention(
|
||
query: &[f32],
|
||
keys: &[&[f32]],
|
||
values: &[&[f32]],
|
||
) -> AttentionResult<Vec<f32>> {
|
||
// Build a diagnostic report
|
||
let report = ReportBuilder::new(ReportConfig::default())
|
||
.analyze_keys(keys) // Automatically compute metrics
|
||
.build();
|
||
|
||
// Select attention based on recommendation
|
||
match report.recommendation {
|
||
AttentionRecommendation::Standard => {
|
||
let attn = ScaledDotProductAttention::new(query.len());
|
||
attn.compute(query, keys, values)
|
||
}
|
||
AttentionRecommendation::Sparse => {
|
||
let attn = FlashAttention::new(query.len(), 64);
|
||
attn.compute(query, keys, values)
|
||
}
|
||
AttentionRecommendation::Geometric => {
|
||
let config = HyperbolicAttentionConfig {
|
||
dim: query.len(),
|
||
curvature: -1.0,
|
||
..Default::default()
|
||
};
|
||
let attn = HyperbolicAttention::new(config);
|
||
attn.compute(query, keys, values)
|
||
}
|
||
AttentionRecommendation::Diffusion => {
|
||
let config = DiffusionConfig::default();
|
||
let attn = DiffusionAttention::new(config);
|
||
attn.compute_diffusion(query, keys, values)
|
||
}
|
||
}
|
||
}
|
||
```
|
||
|
||
### Tutorial 3: Information Bottleneck for Attention Compression
|
||
|
||
Use VIB to learn compressed attention representations.
|
||
|
||
```rust
|
||
use ruvector_attention::*;
|
||
|
||
struct CompressedAttention {
|
||
ib: InformationBottleneck,
|
||
encoder_mean: Vec<f32>, // Learned weights
|
||
encoder_log_var: Vec<f32>, // Learned weights
|
||
}
|
||
|
||
impl CompressedAttention {
|
||
fn new(input_dim: usize, bottleneck_dim: usize) -> Self {
|
||
let ib = InformationBottleneck::new(IBConfig {
|
||
beta: 0.1,
|
||
z_dim: bottleneck_dim,
|
||
..Default::default()
|
||
});
|
||
|
||
Self {
|
||
ib,
|
||
encoder_mean: vec![0.0; input_dim * bottleneck_dim],
|
||
encoder_log_var: vec![0.0; input_dim * bottleneck_dim],
|
||
}
|
||
}
|
||
|
||
fn forward(&self, x: &[f32], epsilon: &[f32]) -> (Vec<f32>, f32) {
|
||
// Encode to mean and log_var (simplified)
|
||
let mean = self.encode_mean(x);
|
||
let log_var = self.encode_log_var(x);
|
||
|
||
// Sample from posterior
|
||
let z = self.ib.sample(&mean, &log_var, epsilon);
|
||
|
||
// Compute KL loss
|
||
let kl_loss = self.ib.compute_kl_loss(&mean, &log_var);
|
||
|
||
(z, kl_loss)
|
||
}
|
||
|
||
fn encode_mean(&self, _x: &[f32]) -> Vec<f32> {
|
||
// Linear transform (simplified)
|
||
vec![0.0; self.ib.config().z_dim]
|
||
}
|
||
|
||
fn encode_log_var(&self, _x: &[f32]) -> Vec<f32> {
|
||
vec![-1.0; self.ib.config().z_dim] // Initialize to low variance
|
||
}
|
||
}
|
||
```
|
||
|
||
### Tutorial 4: Multi-Scale Diffusion for Document Understanding
|
||
|
||
Use diffusion attention at multiple scales for long documents.
|
||
|
||
```rust
|
||
use ruvector_attention::*;
|
||
|
||
fn document_understanding(
|
||
query: &[f32],
|
||
document_keys: &[&[f32]], // Keys from document chunks
|
||
) -> Vec<Vec<f32>> {
|
||
// Configure diffusion with k-NN sparsity for large documents
|
||
let config = DiffusionConfig {
|
||
t: 2.0, // Larger t for more diffusion
|
||
num_steps: 20,
|
||
sigma: 1.0,
|
||
use_knn: true,
|
||
k: 32, // Sparse Laplacian
|
||
laplacian_type: LaplacianType::SymmetricNormalized,
|
||
};
|
||
|
||
let diffusion = DiffusionAttention::new(config);
|
||
|
||
// Get attention at 4 different scales
|
||
// Scale 0: Local (small t) - captures nearby relationships
|
||
// Scale 3: Global (large t) - captures document-level structure
|
||
let scales = diffusion.compute_multiscale(query, document_keys, 4);
|
||
|
||
scales
|
||
}
|
||
```
|
||
|
||
### Tutorial 5: Natural Gradient Training Loop
|
||
|
||
Train attention parameters with geometry-aware optimization.
|
||
|
||
```rust
|
||
use ruvector_attention::*;
|
||
|
||
fn natural_gradient_step(
|
||
logits: &[f32],
|
||
target_probs: &[f32],
|
||
config: &NaturalGradientConfig,
|
||
) -> Vec<f32> {
|
||
let ng = NaturalGradient::new(config.clone());
|
||
|
||
// Compute cross-entropy gradient w.r.t. logits
|
||
let probs = softmax(logits);
|
||
let grad: Vec<f32> = probs.iter()
|
||
.zip(target_probs.iter())
|
||
.map(|(p, t)| p - t)
|
||
.collect();
|
||
|
||
// Apply natural gradient update
|
||
// This uses F^{-1} to rescale gradients, accounting for
|
||
// the geometry of the probability simplex
|
||
ng.step_logits(logits, &grad)
|
||
}
|
||
|
||
fn softmax(logits: &[f32]) -> Vec<f32> {
|
||
let max = logits.iter().cloned().fold(f32::NEG_INFINITY, f32::max);
|
||
let exp: Vec<f32> = logits.iter().map(|&l| (l - max).exp()).collect();
|
||
let sum: f32 = exp.iter().sum();
|
||
exp.iter().map(|&e| e / sum).collect()
|
||
}
|
||
```
|
||
|
||
## Features
|
||
|
||
- `simd` - SIMD acceleration (default, enabled)
|
||
- `wasm` - WebAssembly support
|
||
- `napi` - Node.js bindings
|
||
|
||
## Documentation
|
||
|
||
- [SDK Guide](docs/SDK_GUIDE.md) - Comprehensive SDK usage guide
|
||
- [API Documentation](https://docs.rs/ruvector-attention) - Full API reference
|
||
- [Examples](examples/) - Working code examples
|
||
|
||
## Contributing
|
||
|
||
Contributions are welcome! Please see [CONTRIBUTING.md](CONTRIBUTING.md).
|
||
|
||
## License
|
||
|
||
Licensed under either of:
|
||
|
||
- Apache License, Version 2.0 ([LICENSE-APACHE](LICENSE-APACHE))
|
||
- MIT License ([LICENSE-MIT](LICENSE-MIT))
|
||
|
||
at your option.
|
||
|
||
## Citation
|
||
|
||
If you use this crate in your research, please cite:
|
||
|
||
```bibtex
|
||
@software{ruvector_attention,
|
||
title = {ruvector-attention: Advanced Attention Mechanisms for Vector Search},
|
||
author = {ruvector contributors},
|
||
year = {2025},
|
||
url = {https://github.com/ruvnet/ruvector}
|
||
}
|
||
```
|
||
|
||
## Related Projects
|
||
|
||
- [ruvector](../ruvector) - Core vector search engine
|
||
- [ruvector-graph](../ruvector-graph) - Graph neural networks
|
||
- [ruvector-gnn](../ruvector-gnn) - Geometric neural networks
|