wifi-densepose/docs/research/latent-space/implementation-plans/01-specification.md

# SPARC Specification: ruvector-attention Crate

**Version**: 1.0.0
**Date**: 2025-11-30
**Status**: Draft
**Authors**: RuVector Research Team
**SPARC Phase**: Specification

---

## Table of Contents

1. [Executive Summary](#1-executive-summary)
2. [Requirements Analysis](#2-requirements-analysis)
3. [Module Architecture](#3-module-architecture)
4. [API Design](#4-api-design)
5. [Performance Targets](#5-performance-targets)
6. [Compatibility Matrix](#6-compatibility-matrix)
7. [Testing Strategy](#7-testing-strategy)
8. [Success Criteria](#8-success-criteria)
9. [Constraints and Dependencies](#9-constraints-and-dependencies)
10. [Risk Assessment](#10-risk-assessment)

---

## 1. Executive Summary

### 1.1 Vision

Create a modular, high-performance attention mechanism library specifically designed for GNN latent space operations in RuVector. The `ruvector-attention` crate will implement **10 distinct attention mechanisms** from research literature, enabling researchers and practitioners to experiment with different attention strategies for graph-structured data.

**Core Mission**: Bridge the gap between latent space representations and graph topology through specialized attention mechanisms optimized for HNSW-based vector databases.

### 1.2 Goals

**Primary Goals**:
1. **Modularity**: Each attention mechanism is a standalone, composable component
2. **Performance**: Achieve <200ms latency for 95% of attention operations on 1000-neighbor graphs
3. **Compatibility**: Support WASM, NAPI-RS (Node.js), CLI, and Rust SDK environments
4. **Extensibility**: Easy to add new attention mechanisms without modifying core APIs
5. **Research-Driven**: Implement cutting-edge attention mechanisms from academic literature

**Secondary Goals**:
1. Provide benchmarking tools for comparing attention mechanisms
2. Enable automatic mechanism selection based on graph properties
3. Support distributed/parallel attention computation
4. Maintain numerical stability across all implementations

### 1.3 Performance Targets

| Metric | Target | Stretch Goal |
|--------|--------|--------------|
| **Latency (p95)** | <200ms @ 1K neighbors | <100ms @ 1K neighbors |
| **Throughput** | 5,000 ops/sec | 10,000 ops/sec |
| **Memory (per op)** | <50MB @ 1K neighbors | <25MB @ 1K neighbors |
| **WASM Binary Size** | <2MB (gzipped) | <1MB (gzipped) |
| **Compilation Time** | <60s (release) | <30s (release) |
| **Test Coverage** | >90% | >95% |

### 1.4 Timeline Overview

**Phase 1 (Weeks 1-4)**: Core attention primitives + Multi-head attention
**Phase 2 (Weeks 5-8)**: Geometric attention (Hyperbolic, Edge-featured)
**Phase 3 (Weeks 9-12)**: Sparse and efficient mechanisms (Flash, Linear)
**Phase 4 (Weeks 13-16)**: Adaptive mechanisms (MoE, Cross-attention)
**Phase 5 (Weeks 17-20)**: Integration, optimization, documentation

---

## 2. Requirements Analysis

### 2.1 Functional Requirements

#### FR-001: Core Attention Mechanisms

**Priority**: CRITICAL
**Description**: Implement foundational attention mechanisms

**Acceptance Criteria**:
- [x] FR-001.1: Scaled Dot-Product Attention (baseline)
- [ ] FR-001.2: Multi-Head Attention (2-16 heads configurable)
- [ ] FR-001.3: Supports variable-length input sequences
- [ ] FR-001.4: Numerically stable softmax implementation
- [ ] FR-001.5: Gradient computation for backpropagation

**Test Cases**:
```rust
#[test]
fn test_scaled_dot_product_attention() {
    let attn = ScaledDotProductAttention::new(128);
    let query = vec![1.0; 128];
    let keys = vec![vec![1.0; 128]; 10];
    let values = vec![vec![1.0; 128]; 10];

    let output = attn.forward(&query, &keys, &values);
    assert_eq!(output.len(), 128);
    assert!(output.iter().all(|&x| x.is_finite()));
}
```

---

#### FR-002: Geometric Attention Mechanisms

**Priority**: HIGH
**Description**: Implement attention mechanisms aware of geometric structure

**Acceptance Criteria**:
- [ ] FR-002.1: Edge-Featured Attention (GAT-style with edge attributes)
- [ ] FR-002.2: Hyperbolic Attention (Poincaré ball model)
- [ ] FR-002.3: Mixed-Curvature Attention (Euclidean + Hyperbolic fusion)
- [ ] FR-002.4: Manifold-Aware Attention

**Edge-Featured Attention**:
```
score(i, j) = LeakyReLU(a^T [W·h_i || W·h_j || W_e·edge_ij])
```

**Hyperbolic Attention**:
```
distance_poincare(x, y) = arccosh(1 + 2||x-y||² / ((1-||x||²)(1-||y||²)))
score(i, j) = -distance_poincare(q_i, k_j)
```

**Test Cases**:
```rust
#[test]
fn test_edge_featured_attention() {
    let attn = EdgeFeaturedAttention::new(128, 32);
    let edge_features = vec![vec![1.0; 32]; 10];

    let output = attn.forward_with_edges(
        &query, &keys, &values, Some(&edge_features)
    );
    assert!(output.len() == 128);
}

#[test]
fn test_hyperbolic_attention_bounds() {
    let attn = HyperbolicAttention::new(128, -1.0);
    let query = vec![0.5; 128]; // Inside Poincaré ball

    // Ensure all embeddings stay in ball (||x|| < 1)
    let output = attn.forward(&query, &keys, &values);
    assert!(l2_norm(&output) < 0.99);
}
```

---

#### FR-003: Sparse Attention Patterns

**Priority**: HIGH
**Description**: Reduce O(n²) complexity through sparsity

**Acceptance Criteria**:
- [ ] FR-003.1: Local + Global Attention (Longformer-style)
- [ ] FR-003.2: Linear Attention (Performer/FAVOR+)
- [ ] FR-003.3: Flash Attention (memory-efficient tiling)
- [ ] FR-003.4: Configurable sparsity patterns

**Local + Global Pattern**:
```
Attention Matrix:
  [L L L G 0 0 0 0]  L = Local (1-hop neighbors)
  [L L L L G 0 0 0]  G = Global (HNSW higher layers)
  [L L L L L G 0 0]  0 = No attention
  ...
```

**Complexity Requirements**:
- Local + Global: O(k_local + k_global) where k << n
- Linear: O(n·d) where d = feature dimension
- Flash: O(n) memory (vs O(n²) standard)

**Test Cases**:
```rust
#[test]
fn test_sparse_attention_complexity() {
    let sparse_attn = SparseGraphAttention::new(
        local_window: 10,
        global_nodes: 5
    );

    // Should only attend to 15 nodes, not all 1000
    let num_neighbors = 1000;
    let actual_attention = sparse_attn.get_attention_mask(num_neighbors);
    assert!(actual_attention.count_nonzero() <= 15);
}
```

---

#### FR-004: Graph-Aware Mechanisms

**Priority**: HIGH
**Description**: Attention specialized for graph structure

**Acceptance Criteria**:
- [ ] FR-004.1: RoPE (Rotary Position Embeddings) for graph distance
- [ ] FR-004.2: HNSW-layer encoding in attention
- [ ] FR-004.3: Cross-Attention (Dual-Space: graph + latent)
- [ ] FR-004.4: Structural feature integration (degree, centrality)

**RoPE for Graphs**:
```rust
// Encode graph distance via rotation
rotation_angle = graph_distance / base^(2i/d)
rotated[i] = emb[i] * cos(θ) - emb[i+1] * sin(θ)
```

**Cross-Attention**:
```
graph_attn = Attention(h, N_graph(h), N_graph(h))
latent_attn = Attention(h, N_latent(h), N_latent(h))
cross_attn = Attention(graph_attn, N_latent(h), N_latent(h))
output = Fusion(graph_attn, latent_attn, cross_attn)
```

---

#### FR-005: Adaptive Mechanisms

**Priority**: MEDIUM
**Description**: Attention that adapts to input patterns

**Acceptance Criteria**:
- [ ] FR-005.1: Mixture of Experts (MoE) Attention
- [ ] FR-005.2: Learned routing between attention types
- [ ] FR-005.3: RL-based navigation function learning
- [ ] FR-005.4: Dynamic head count adjustment

**MoE Attention**:
```rust
router_scores = Router(query)
expert_indices = topk(router_scores, k=2)
output = Σ router_scores[i] * Expert[i](query, keys, values)
```

**Experts**:
1. Local Expert: Standard attention for 1-hop neighbors
2. Hierarchical Expert: Hyperbolic attention for HNSW layers
3. Global Expert: Linear attention for distant nodes
4. Structural Expert: Edge-featured attention

---

#### FR-006: Training and Optimization Utilities

**Priority**: HIGH
**Description**: Tools for training attention-based models

**Acceptance Criteria**:
- [ ] FR-006.1: Contrastive losses (InfoNCE, Local Contrastive)
- [ ] FR-006.2: Spectral regularization (Laplacian smoothness)
- [ ] FR-006.3: Multi-objective loss balancing
- [ ] FR-006.4: Curriculum learning schedules
- [ ] FR-006.5: Hard negative mining

---

#### FR-007: Tensor Compression

**Priority**: MEDIUM
**Description**: Memory-efficient tensor operations

**Acceptance Criteria**:
- [ ] FR-007.1: Quantization (INT8, INT4)
- [ ] FR-007.2: Low-rank factorization
- [ ] FR-007.3: Sparse tensor storage
- [ ] FR-007.4: Hierarchical compression for HNSW layers

---

#### FR-008: SIMD Optimizations

**Priority**: MEDIUM
**Description**: Vectorized operations for performance

**Acceptance Criteria**:
- [ ] FR-008.1: AVX2/AVX-512 support for x86_64
- [ ] FR-008.2: NEON support for ARM
- [ ] FR-008.3: WASM SIMD support
- [ ] FR-008.4: Automatic fallback to scalar operations

---

### 2.2 Non-Functional Requirements

#### NFR-001: Performance

**NFR-001.1**: Latency
- **Requirement**: p95 latency <200ms for 1000-neighbor attention
- **Measurement**: Benchmark suite with synthetic graphs
- **Verification**: CI/CD performance regression tests

**NFR-001.2**: Throughput
- **Requirement**: 5,000 attention operations per second
- **Measurement**: Batch processing benchmarks
- **Verification**: Load testing with real HNSW graphs

**NFR-001.3**: Memory
- **Requirement**: Peak memory <50MB per operation
- **Measurement**: Memory profiling with valgrind/heaptrack
- **Verification**: Memory regression tests in CI

**NFR-001.4**: Scalability
- **Requirement**: Linear scaling up to 10K neighbors
- **Measurement**: Complexity analysis and empirical benchmarks
- **Verification**: Big-O complexity proofs + empirical validation

---

#### NFR-002: Reliability

**NFR-002.1**: Numerical Stability
- **Requirement**: All outputs finite (no NaN, Inf) across 10M operations
- **Measurement**: Fuzzing with random inputs
- **Verification**: Property-based testing with proptest

**NFR-002.2**: Error Handling
- **Requirement**: All errors recoverable, 100% error path coverage
- **Measurement**: Error injection testing
- **Verification**: Unit tests for error cases

**NFR-002.3**: Determinism
- **Requirement**: Same inputs produce same outputs (no random behavior)
- **Measurement**: Repeated execution tests
- **Verification**: Determinism tests in CI

---

#### NFR-003: Maintainability

**NFR-003.1**: Code Quality
- **Requirement**: Clippy clean (zero warnings), rustfmt formatted
- **Measurement**: CI linting checks
- **Verification**: Pre-commit hooks + CI gates

**NFR-003.2**: Documentation
- **Requirement**: 100% public API documented with examples
- **Measurement**: rustdoc coverage tool
- **Verification**: Doc tests pass, examples compile

**NFR-003.3**: Test Coverage
- **Requirement**: >90% line coverage, >95% branch coverage
- **Measurement**: cargo-tarpaulin
- **Verification**: CI coverage reports

---

#### NFR-004: Portability

**NFR-004.1**: Platform Support
- **Requirement**: Linux, macOS, Windows support
- **Measurement**: CI testing on all platforms
- **Verification**: Cross-platform integration tests

**NFR-004.2**: WASM Compatibility
- **Requirement**: Full functionality in WASM (wasm32-unknown-unknown)
- **Measurement**: WASM-specific test suite
- **Verification**: Browser and Node.js WASM tests

**NFR-004.3**: NAPI-RS Support
- **Requirement**: All attention mechanisms callable from Node.js
- **Measurement**: Node.js integration tests
- **Verification**: NPM package smoke tests

---

#### NFR-005: Security

**NFR-005.1**: Memory Safety
- **Requirement**: Zero unsafe code blocks (or 100% audited unsafe)
- **Measurement**: Manual code review
- **Verification**: MIRI checks, cargo-geiger

**NFR-005.2**: Dependency Audit
- **Requirement**: All dependencies audited, no known CVEs
- **Measurement**: cargo-audit
- **Verification**: Automated dependency scanning in CI

---

### 2.3 Constraints

#### C-001: Compatibility Constraints
- **Rust Version**: MSRV 1.77+ (per workspace configuration)
- **No GPU**: All implementations must run on CPU (WASM/NAPI-RS requirement)
- **No Standard Library in WASM**: Must support `#![no_std]` for WASM32

#### C-002: API Constraints
- **Backwards Compatibility**: Once 1.0 released, follow SemVer strictly
- **Trait Consistency**: All attention mechanisms implement common `Attention` trait
- **Builder Pattern**: Configuration via builders, not constructors

#### C-003: Performance Constraints
- **Compilation Time**: Release build <60s on CI runners
- **Binary Size**: WASM bundle <2MB gzipped
- **Memory Footprint**: No global allocators, stack-preferred where possible

#### C-004: Licensing Constraints
- **License**: MIT (per workspace)
- **Dependency Licenses**: MIT/Apache-2.0 only (no GPL/LGPL)

---

## 3. Module Architecture

### 3.1 Crate Structure

```
ruvector-attention/
├── Cargo.toml
├── README.md
├── LICENSE
│
├── src/
│   ├── lib.rs                     # Public API, re-exports
│   │
│   ├── core/                      # Core attention primitives
│   │   ├── mod.rs                 # Core module exports
│   │   ├── base.rs                # Attention trait definition
│   │   ├── scaled_dot.rs          # Scaled dot-product attention
│   │   ├── multi_head.rs          # Multi-head attention
│   │   └── config.rs              # Configuration structs
│   │
│   ├── geometric/                 # Geometric attention
│   │   ├── mod.rs
│   │   ├── hyperbolic.rs          # Poincaré ball attention
│   │   ├── edge_featured.rs       # GAT-style edge attention
│   │   ├── mixed_curvature.rs     # Euclidean + Hyperbolic
│   │   └── manifold.rs            # General manifold attention
│   │
│   ├── sparse/                    # Sparse patterns
│   │   ├── mod.rs
│   │   ├── local_global.rs        # Longformer-style
│   │   ├── linear.rs              # Performer/FAVOR+
│   │   ├── flash.rs               # Flash Attention (tiled)
│   │   └── patterns.rs            # Sparsity pattern utilities
│   │
│   ├── graph/                     # Graph-aware attention
│   │   ├── mod.rs
│   │   ├── rope_graph.rs          # RoPE for graph distances
│   │   ├── cross_space.rs         # Dual-space cross-attention
│   │   ├── hnsw_aware.rs          # HNSW layer encoding
│   │   └── structural.rs          # Degree/centrality features
│   │
│   ├── adaptive/                  # Adaptive/learned mechanisms
│   │   ├── mod.rs
│   │   ├── moe.rs                 # Mixture of Experts
│   │   ├── learned_routing.rs     # Attention routing
│   │   ├── rl_navigator.rs        # RL-based graph navigation
│   │   └── dynamic_heads.rs       # Adaptive head count
│   │
│   ├── training/                  # Training utilities
│   │   ├── mod.rs
│   │   ├── losses.rs              # Contrastive, reconstruction
│   │   ├── optimizers.rs          # SGD, Adam, etc.
│   │   ├── regularizers.rs        # Spectral, L2, etc.
│   │   ├── curriculum.rs          # Curriculum learning
│   │   └── hard_negatives.rs      # Negative sampling
│   │
│   ├── compression/               # Tensor compression
│   │   ├── mod.rs
│   │   ├── quantization.rs        # INT8/INT4 quantization
│   │   ├── low_rank.rs            # SVD/Tucker decomposition
│   │   ├── sparse_storage.rs      # CSR/COO sparse tensors
│   │   └── hierarchical.rs        # Layer-wise compression
│   │
│   ├── simd/                      # SIMD optimizations
│   │   ├── mod.rs
│   │   ├── avx2.rs                # AVX2 kernels
│   │   ├── avx512.rs              # AVX-512 kernels
│   │   ├── neon.rs                # ARM NEON kernels
│   │   ├── wasm_simd.rs           # WASM SIMD
│   │   └── dispatch.rs            # Runtime detection
│   │
│   ├── utils/                     # Utilities
│   │   ├── mod.rs
│   │   ├── math.rs                # Math primitives
│   │   ├── tensor.rs              # Tensor ops
│   │   ├── softmax.rs             # Numerically stable softmax
│   │   └── distances.rs           # Distance metrics
│   │
│   └── prelude.rs                 # Common imports
│
├── benches/                       # Benchmarks
│   ├── attention_benchmark.rs     # Core attention benchmarks
│   ├── geometric_benchmark.rs     # Geometric attention
│   ├── sparse_benchmark.rs        # Sparse patterns
│   └── comparison_benchmark.rs    # Mechanism comparison
│
├── tests/                         # Integration tests
│   ├── core_tests.rs              # Core attention tests
│   ├── geometric_tests.rs         # Geometric tests
│   ├── sparse_tests.rs            # Sparse pattern tests
│   ├── numerical_stability.rs     # Stability tests
│   └── property_tests.rs          # Property-based tests
│
├── ffi/                           # Foreign Function Interface
│   ├── wasm/                      # WASM bindings
│   │   ├── Cargo.toml
│   │   ├── src/
│   │   │   └── lib.rs             # wasm-bindgen exports
│   │   └── tests/
│   │       └── web.rs             # Browser tests
│   │
│   └── napi/                      # NAPI-RS bindings
│       ├── Cargo.toml
│       ├── src/
│       │   └── lib.rs             # napi-derive exports
│       └── index.d.ts             # TypeScript definitions
│
├── cli/                           # CLI interface
│   ├── Cargo.toml
│   └── src/
│       ├── main.rs                # CLI entry point
│       ├── commands/              # CLI commands
│       │   ├── benchmark.rs       # Run benchmarks
│       │   ├── compare.rs         # Compare mechanisms
│       │   └── analyze.rs         # Analyze attention patterns
│       └── output.rs              # Formatting
│
├── examples/                      # Examples
│   ├── basic_attention.rs         # Hello world
│   ├── graph_attention.rs         # Graph-aware usage
│   ├── hnsw_integration.rs        # HNSW integration
│   ├── custom_mechanism.rs        # Extending the library
│   └── distributed_attention.rs   # Parallel processing
│
└── docs/                          # Documentation
    ├── design/                    # Design documents
    │   ├── architecture.md        # Architecture overview
    │   ├── api_design.md          # API design rationale
    │   └── performance.md         # Performance analysis
    │
    ├── guides/                    # User guides
    │   ├── getting_started.md     # Quick start
    │   ├── mechanism_guide.md     # Choosing mechanisms
    │   └── integration.md         # Integration guide
    │
    └── research/                  # Research notes
        ├── attention_mechanisms.md
        ├── benchmarks.md
        └── experiments.md
```

### 3.2 Module Responsibilities

#### Core Module (`src/core/`)
**Responsibility**: Foundational attention mechanisms and trait definitions

**Key Components**:
- `Attention` trait: Common interface for all mechanisms
- `ScaledDotProductAttention`: Baseline implementation
- `MultiHeadAttention`: Standard multi-head decomposition
- `AttentionConfig`: Configuration builders

**Dependencies**: `utils` only

---

#### Geometric Module (`src/geometric/`)
**Responsibility**: Geometry-aware attention mechanisms

**Key Components**:
- `HyperbolicAttention`: Poincaré ball operations
- `EdgeFeaturedAttention`: GAT-style with edge features
- `MixedCurvatureAttention`: Product space (Euclidean × Hyperbolic)

**Dependencies**: `core`, `utils`

---

#### Sparse Module (`src/sparse/`)
**Responsibility**: Efficient sparse attention patterns

**Key Components**:
- `LocalGlobalAttention`: Longformer-style
- `LinearAttention`: Kernel-based approximation
- `FlashAttention`: Memory-efficient tiling

**Dependencies**: `core`, `utils`, `simd` (optional)

---

#### Graph Module (`src/graph/`)
**Responsibility**: Graph structure-aware mechanisms

**Key Components**:
- `GraphRoPE`: Rotary embeddings for graph distance
- `CrossSpaceAttention`: Dual topology + latent space
- `HNSWAwareAttention`: HNSW layer encoding

**Dependencies**: `core`, `geometric`, `utils`

---

#### Adaptive Module (`src/adaptive/`)
**Responsibility**: Learned and adaptive attention

**Key Components**:
- `MoEAttention`: Mixture of experts routing
- `RLNavigator`: Reinforcement learning-based navigation
- `DynamicHeadAttention`: Runtime head count adjustment

**Dependencies**: `core`, `geometric`, `sparse`, `graph`, `training`

---

#### Training Module (`src/training/`)
**Responsibility**: Loss functions and optimization

**Key Components**:
- `ContrastiveLoss`: InfoNCE, Triplet
- `SpectralRegularizer`: Laplacian smoothness
- `HardNegativeSampler`: Mining hard negatives
- `CurriculumScheduler`: Loss weight scheduling

**Dependencies**: `utils` only

---

#### Compression Module (`src/compression/`)
**Responsibility**: Memory-efficient tensor storage

**Key Components**:
- `Quantizer`: INT8/INT4 quantization
- `LowRankFactorizer`: SVD compression
- `SparseStorage`: CSR/COO formats

**Dependencies**: `utils`, `simd` (optional)

---

#### SIMD Module (`src/simd/`)
**Responsibility**: Vectorized operations

**Key Components**:
- `SimdDispatcher`: Runtime CPU feature detection
- Platform-specific kernels: AVX2, AVX-512, NEON, WASM SIMD

**Dependencies**: `utils` only

---

### 3.3 Dependency Graph

```
┌─────────────────────────────────────────────────────────────┐
│                        Public API (lib.rs)                  │
└─────────────────────────────────────────────────────────────┘
                             │
        ┌────────────────────┼────────────────────┐
        │                    │                    │
        v                    v                    v
  ┌──────────┐         ┌──────────┐        ┌──────────┐
  │   core   │         │ training │        │  utils   │
  └──────────┘         └──────────┘        └──────────┘
        │                                        │
        └────────┬───────────┬─────────┬────────┘
                 │           │         │
                 v           v         v
          ┌──────────┐ ┌─────────┐ ┌──────┐
          │geometric │ │ sparse  │ │ simd │
          └──────────┘ └─────────┘ └──────┘
                 │           │
                 └─────┬─────┘
                       │
                       v
                 ┌──────────┐
                 │  graph   │
                 └──────────┘
                       │
                       v
                 ┌──────────┐
                 │ adaptive │
                 └──────────┘
                       │
                       v
                ┌─────────────┐
                │ compression │
                └─────────────┘
```

**Design Principles**:
1. **Acyclic**: No circular dependencies
2. **Layered**: Lower layers have fewer dependencies
3. **Optional Features**: SIMD, compression via feature flags
4. **Core Stability**: `core` and `utils` are most stable

---

## 4. API Design

### 4.1 Core Trait: `Attention`

```rust
/// Core trait for all attention mechanisms
pub trait Attention: Send + Sync {
    /// Forward pass: compute attention over keys/values given query
    ///
    /// # Arguments
    /// * `query` - Query vector (d-dimensional)
    /// * `keys` - Key vectors (n × d)
    /// * `values` - Value vectors (n × d)
    ///
    /// # Returns
    /// Attention-weighted aggregation of values (d-dimensional)
    ///
    /// # Example
    /// ```
    /// use ruvector_attention::core::ScaledDotProductAttention;
    /// use ruvector_attention::Attention;
    ///
    /// let attn = ScaledDotProductAttention::new(128);
    /// let query = vec![1.0; 128];
    /// let keys = vec![vec![0.5; 128]; 10];
    /// let values = keys.clone();
    ///
    /// let output = attn.forward(&query, &keys, &values);
    /// assert_eq!(output.len(), 128);
    /// ```
    fn forward(
        &self,
        query: &[f32],
        keys: &[Vec<f32>],
        values: &[Vec<f32>],
    ) -> Result<Vec<f32>, AttentionError>;

    /// Get attention weights without computing weighted sum
    ///
    /// Useful for visualization and debugging
    fn attention_weights(
        &self,
        query: &[f32],
        keys: &[Vec<f32>],
    ) -> Result<Vec<f32>, AttentionError>;

    /// Get hidden dimension
    fn hidden_dim(&self) -> usize;

    /// Check if mechanism supports variable-length inputs
    fn supports_variable_length(&self) -> bool {
        true
    }

    /// Estimated computational complexity (for documentation)
    fn complexity(&self) -> Complexity {
        Complexity::Quadratic
    }
}

/// Computational complexity categories
#[derive(Debug, Clone, Copy, PartialEq, Eq)]
pub enum Complexity {
    Linear,      // O(n)
    Linearithmic, // O(n log n)
    Quadratic,   // O(n²)
    Custom(&'static str), // Custom complexity description
}
```

---

### 4.2 Error Handling

```rust
/// Errors that can occur during attention computation
#[derive(Debug, thiserror::Error)]
pub enum AttentionError {
    /// Input dimension mismatch
    #[error("Dimension mismatch: expected {expected}, got {actual}")]
    DimensionMismatch { expected: usize, actual: usize },

    /// Empty input
    #[error("Empty input: {context}")]
    EmptyInput { context: String },

    /// Numerical instability detected
    #[error("Numerical instability: {message}")]
    NumericalInstability { message: String },

    /// Invalid configuration
    #[error("Invalid configuration: {message}")]
    InvalidConfig { message: String },

    /// Out of bounds access
    #[error("Index out of bounds: {index} >= {len}")]
    OutOfBounds { index: usize, len: usize },

    /// Unsupported operation
    #[error("Unsupported operation: {operation}")]
    Unsupported { operation: String },

    /// Internal error
    #[error("Internal error: {message}")]
    Internal { message: String },
}

pub type Result<T> = std::result::Result<T, AttentionError>;
```

---

### 4.3 Builder Pattern

```rust
/// Builder for ScaledDotProductAttention
#[derive(Debug, Clone)]
pub struct ScaledDotProductAttentionBuilder {
    hidden_dim: usize,
    dropout: Option<f32>,
    temperature: f32,
    normalize: bool,
}

impl ScaledDotProductAttentionBuilder {
    pub fn new(hidden_dim: usize) -> Self {
        Self {
            hidden_dim,
            dropout: None,
            temperature: 1.0,
            normalize: true,
        }
    }

    pub fn dropout(mut self, rate: f32) -> Self {
        assert!((0.0..=1.0).contains(&rate), "Dropout must be in [0, 1]");
        self.dropout = Some(rate);
        self
    }

    pub fn temperature(mut self, temp: f32) -> Self {
        assert!(temp > 0.0, "Temperature must be positive");
        self.temperature = temp;
        self
    }

    pub fn normalize(mut self, normalize: bool) -> Self {
        self.normalize = normalize;
        self
    }

    pub fn build(self) -> Result<ScaledDotProductAttention> {
        if self.hidden_dim == 0 {
            return Err(AttentionError::InvalidConfig {
                message: "hidden_dim must be > 0".to_string(),
            });
        }

        Ok(ScaledDotProductAttention {
            hidden_dim: self.hidden_dim,
            scale: (self.hidden_dim as f32).sqrt().recip(),
            dropout: self.dropout,
            temperature: self.temperature,
            normalize: self.normalize,
        })
    }
}

// Usage:
let attn = ScaledDotProductAttention::builder(128)
    .dropout(0.1)
    .temperature(0.07)
    .build()?;
```

---

### 4.4 Multi-Head Attention API

```rust
/// Multi-head attention with configurable heads
pub struct MultiHeadAttention {
    num_heads: usize,
    head_dim: usize,
    hidden_dim: usize,
    w_q: Vec<Linear>,  // Query projections per head
    w_k: Vec<Linear>,  // Key projections per head
    w_v: Vec<Linear>,  // Value projections per head
    w_o: Linear,       // Output projection
    dropout: Option<f32>,
}

impl MultiHeadAttention {
    pub fn builder(hidden_dim: usize, num_heads: usize) -> MultiHeadAttentionBuilder {
        MultiHeadAttentionBuilder::new(hidden_dim, num_heads)
    }

    /// Get attention patterns for all heads
    pub fn head_attention_weights(
        &self,
        query: &[f32],
        keys: &[Vec<f32>],
    ) -> Result<Vec<Vec<f32>>> {
        // Returns [num_heads × num_keys] attention weights
        // Useful for interpretability
    }

    /// Get specific head output
    pub fn head_output(
        &self,
        head_idx: usize,
        query: &[f32],
        keys: &[Vec<f32>],
        values: &[Vec<f32>],
    ) -> Result<Vec<f32>> {
        // Get output of a single head (for debugging)
    }
}

impl Attention for MultiHeadAttention {
    fn forward(
        &self,
        query: &[f32],
        keys: &[Vec<f32>],
        values: &[Vec<f32>],
    ) -> Result<Vec<f32>> {
        // 1. Project to heads: Q_i, K_i, V_i for each head i
        // 2. Compute attention per head: head_i = Attention(Q_i, K_i, V_i)
        // 3. Concatenate heads: concat(head_1, ..., head_h)
        // 4. Output projection: W_o @ concat
    }
}
```

---

### 4.5 Geometric Attention API

```rust
/// Hyperbolic attention in Poincaré ball
pub struct HyperbolicAttention {
    hidden_dim: usize,
    curvature: f32,  // Negative curvature (e.g., -1.0)
    w_q: Linear,
    w_k: Linear,
    w_v: Linear,
}

impl HyperbolicAttention {
    /// Create new hyperbolic attention
    ///
    /// # Arguments
    /// * `hidden_dim` - Embedding dimension
    /// * `curvature` - Curvature of hyperbolic space (must be negative)
    pub fn new(hidden_dim: usize, curvature: f32) -> Result<Self> {
        if curvature >= 0.0 {
            return Err(AttentionError::InvalidConfig {
                message: "Hyperbolic curvature must be negative".to_string(),
            });
        }
        // ...
    }

    /// Poincaré distance between two points
    pub fn poincare_distance(&self, x: &[f32], y: &[f32]) -> f32 {
        // d(x,y) = arccosh(1 + 2||x-y||² / ((1-||x||²)(1-||y||²)))
    }

    /// Möbius addition (hyperbolic vector addition)
    pub fn mobius_add(&self, x: &[f32], y: &[f32]) -> Vec<f32> {
        // (1+2⟨x,y⟩+||y||²)x + (1-||x||²)y / (1+2⟨x,y⟩+||x||²||y||²)
    }

    /// Project point onto Poincaré ball (clip to ||x|| < 1)
    pub fn project_to_ball(&self, x: &mut [f32], eps: f32) {
        let norm = l2_norm(x);
        if norm >= 1.0 - eps {
            let scale = (1.0 - eps) / norm;
            for xi in x.iter_mut() {
                *xi *= scale;
            }
        }
    }
}

impl Attention for HyperbolicAttention {
    fn forward(
        &self,
        query: &[f32],
        keys: &[Vec<f32>],
        values: &[Vec<f32>],
    ) -> Result<Vec<f32>> {
        // 1. Compute hyperbolic similarities: -d_poincare(q, k_j)
        // 2. Softmax attention weights
        // 3. Aggregate in hyperbolic space via Möbius operations
    }
}
```

---

### 4.6 Graph-Aware Attention API

```rust
/// Attention with graph-specific features
pub trait GraphAttention: Attention {
    /// Forward pass with edge features
    fn forward_with_edges(
        &self,
        query: &[f32],
        keys: &[Vec<f32>],
        values: &[Vec<f32>],
        edge_features: &[Vec<f32>],
    ) -> Result<Vec<f32>>;

    /// Forward pass with graph metadata
    fn forward_with_metadata(
        &self,
        query: &[f32],
        keys: &[Vec<f32>],
        values: &[Vec<f32>],
        metadata: &GraphMetadata,
    ) -> Result<Vec<f32>>;
}

/// Graph metadata for attention
#[derive(Debug, Clone)]
pub struct GraphMetadata {
    /// Graph distances (e.g., shortest path lengths)
    pub distances: Option<Vec<f32>>,

    /// HNSW layer indices
    pub hnsw_layers: Option<Vec<usize>>,

    /// Edge weights
    pub edge_weights: Option<Vec<f32>>,

    /// Structural features (degree, centrality, etc.)
    pub structural_features: Option<Vec<Vec<f32>>>,
}

/// RoPE-enhanced attention for graphs
pub struct GraphRoPE {
    hidden_dim: usize,
    base: f32,  // Frequency base (default 10000)
    w_q: Linear,
    w_k: Linear,
    w_v: Linear,
}

impl GraphRoPE {
    /// Apply rotation based on graph distance
    pub fn apply_rotation(&self, embedding: &[f32], distance: f32) -> Vec<f32> {
        // Rotate embedding by angle proportional to distance
    }
}

impl GraphAttention for GraphRoPE {
    fn forward_with_metadata(
        &self,
        query: &[f32],
        keys: &[Vec<f32>],
        values: &[Vec<f32>],
        metadata: &GraphMetadata,
    ) -> Result<Vec<f32>> {
        let distances = metadata.distances.as_ref()
            .ok_or_else(|| AttentionError::InvalidConfig {
                message: "GraphRoPE requires distance metadata".to_string(),
            })?;

        // Apply rotations based on distances
        // Compute attention with rotated embeddings
    }
}
```

---

### 4.7 Adaptive Attention API

```rust
/// Mixture of Experts attention
pub struct MoEAttention {
    router: Linear,  // Maps query to expert scores
    experts: Vec<Box<dyn Attention>>,
    top_k: usize,    // Number of experts to activate
}

impl MoEAttention {
    pub fn builder() -> MoEAttentionBuilder {
        MoEAttentionBuilder::new()
    }

    /// Get routing decisions
    pub fn get_routing(
        &self,
        query: &[f32],
    ) -> Result<Vec<(usize, f32)>> {
        // Returns (expert_index, weight) pairs
    }

    /// Add an expert to the mixture
    pub fn add_expert(&mut self, expert: Box<dyn Attention>) {
        self.experts.push(expert);
    }
}

impl Attention for MoEAttention {
    fn forward(
        &self,
        query: &[f32],
        keys: &[Vec<f32>],
        values: &[Vec<f32>],
    ) -> Result<Vec<f32>> {
        // 1. Route: scores = Router(query)
        // 2. Select top-k experts
        // 3. Weighted combination of expert outputs
    }
}

/// Builder for MoE attention
pub struct MoEAttentionBuilder {
    router_hidden_dim: usize,
    experts: Vec<Box<dyn Attention>>,
    top_k: usize,
}

impl MoEAttentionBuilder {
    pub fn add_local_expert(mut self, hidden_dim: usize) -> Self {
        self.experts.push(Box::new(
            ScaledDotProductAttention::new(hidden_dim).unwrap()
        ));
        self
    }

    pub fn add_hyperbolic_expert(mut self, hidden_dim: usize, curvature: f32) -> Self {
        self.experts.push(Box::new(
            HyperbolicAttention::new(hidden_dim, curvature).unwrap()
        ));
        self
    }

    pub fn add_sparse_expert(mut self, local_window: usize, global_nodes: usize) -> Self {
        self.experts.push(Box::new(
            LocalGlobalAttention::new(local_window, global_nodes).unwrap()
        ));
        self
    }

    pub fn top_k(mut self, k: usize) -> Self {
        self.top_k = k;
        self
    }

    pub fn build(self) -> Result<MoEAttention> {
        // Validation and construction
    }
}
```

---

### 4.8 Training Utilities API

```rust
/// Contrastive loss functions
pub mod losses {
    /// InfoNCE contrastive loss
    pub fn info_nce(
        anchor: &[f32],
        positives: &[&[f32]],
        negatives: &[&[f32]],
        temperature: f32,
    ) -> f32;

    /// Triplet loss
    pub fn triplet(
        anchor: &[f32],
        positive: &[f32],
        negative: &[f32],
        margin: f32,
    ) -> f32;

    /// Local contrastive loss (graph-specific)
    pub fn local_contrastive(
        node_embedding: &[f32],
        neighbor_embeddings: &[Vec<f32>],
        non_neighbor_embeddings: &[Vec<f32>],
        temperature: f32,
    ) -> f32;
}

/// Hard negative mining
pub mod hard_negatives {
    pub enum SamplingStrategy {
        Distance,      // Most similar non-neighbors
        Degree,        // Similar degree distribution
        Mixed,         // Combination
    }

    pub fn sample_hard_negatives(
        anchor: &[f32],
        all_embeddings: &[Vec<f32>],
        positive_indices: &[usize],
        k: usize,
        strategy: SamplingStrategy,
    ) -> Vec<Vec<f32>>;
}

/// Spectral regularization
pub mod regularizers {
    /// Laplacian smoothness
    pub fn laplacian(
        embeddings: &[Vec<f32>],
        edges: &[(usize, usize)],
        edge_weights: Option<&[f32]>,
    ) -> f32;

    /// Orthogonality regularization
    pub fn orthogonality(embeddings: &[Vec<f32>]) -> f32;

    /// Embedding norm regularization
    pub fn norm_penalty(embeddings: &[Vec<f32>], target_norm: f32) -> f32;
}
```

---

## 5. Performance Targets

### 5.1 Latency Targets

| Operation | Input Size | p50 | p95 | p99 |
|-----------|------------|-----|-----|-----|
| Scaled Dot-Product | 100 neighbors | <5ms | <10ms | <20ms |
| Scaled Dot-Product | 1K neighbors | <50ms | <100ms | <150ms |
| Multi-Head (4 heads) | 100 neighbors | <10ms | <20ms | <30ms |
| Multi-Head (4 heads) | 1K neighbors | <80ms | <150ms | <200ms |
| Hyperbolic | 100 neighbors | <15ms | <30ms | <50ms |
| Sparse (Local+Global) | 1K neighbors | <30ms | <60ms | <100ms |
| Flash Attention | 1K neighbors | <40ms | <80ms | <120ms |
| MoE (4 experts, top-2) | 1K neighbors | <100ms | <180ms | <250ms |

**Measurement Method**: Criterion.rs benchmarks with 1000 iterations, warm cache

---

### 5.2 Throughput Targets

| Mechanism | Target (ops/sec) | Stretch (ops/sec) |
|-----------|------------------|-------------------|
| Scaled Dot-Product | 10,000 | 20,000 |
| Multi-Head | 5,000 | 10,000 |
| Hyperbolic | 3,000 | 6,000 |
| Sparse | 8,000 | 15,000 |
| Flash | 7,000 | 12,000 |

**Measurement**: Batch processing of 1000 operations, averaged over 10 runs

---

### 5.3 Memory Targets

| Mechanism | Peak Memory (1K neighbors) | Target | Stretch |
|-----------|---------------------------|--------|---------|
| Scaled Dot-Product | Full attention matrix | <50MB | <25MB |
| Multi-Head (4 heads) | 4× attention matrices | <100MB | <50MB |
| Flash Attention | Tiled computation | <20MB | <10MB |
| Sparse | Sparse patterns only | <15MB | <8MB |

**Measurement**: Valgrind/heaptrack during benchmark execution

---

### 5.4 Compilation Targets

| Configuration | Target | Stretch |
|---------------|--------|---------|
| Debug build | <10s | <5s |
| Release build (--release) | <60s | <30s |
| Release with LTO | <120s | <60s |
| WASM build | <90s | <45s |

**Measurement**: CI build times on GitHub Actions standard runners

---

### 5.5 Binary Size Targets

| Target | Size (uncompressed) | Size (gzipped) | Target | Stretch |
|--------|---------------------|----------------|--------|---------|
| WASM | 5-8 MB | 1.5-2 MB | <2MB | <1MB |
| Native (Linux x86_64) | 10-15 MB | N/A | <15MB | <10MB |
| NAPI-RS addon | 8-12 MB | N/A | <12MB | <8MB |

**Measurement**: `wasm-opt` for WASM, `strip` for native

---

### 5.6 Scalability Targets

**Linear Scaling**:
- Operations should scale O(n) or better up to 10K neighbors
- No quadratic blowup in standard use cases

**Benchmark**:
```rust
#[bench]
fn bench_scalability_attention(b: &mut Bencher) {
    for n in [100, 500, 1000, 5000, 10000] {
        let attn = ScaledDotProductAttention::new(128).unwrap();
        let query = vec![1.0; 128];
        let keys = vec![vec![1.0; 128]; n];
        let values = keys.clone();

        let start = Instant::now();
        b.iter(|| attn.forward(&query, &keys, &values));
        let elapsed = start.elapsed();

        println!("n={}: {:?}", n, elapsed);
        // Assert linear or sub-quadratic scaling
    }
}
```

---

## 6. Compatibility Matrix

### 6.1 Rust Version Support

| Rust Version | Support Status | Notes |
|--------------|----------------|-------|
| 1.77.0 (MSRV) | ✅ Supported | Minimum supported version |
| 1.78.x | ✅ Supported | |
| 1.79.x | ✅ Supported | |
| 1.80.x+ | ✅ Supported | Latest stable |
| Nightly | ⚠️ Best-effort | May use unstable features behind flags |

**Testing**: CI runs on MSRV, stable, and nightly

---

### 6.2 Platform Support

#### Desktop Platforms

| Platform | Tier | Support Status | CI Testing |
|----------|------|----------------|------------|
| Linux x86_64 | Tier 1 | ✅ Full support | Yes |
| Linux ARM64 | Tier 2 | ✅ Full support | Yes |
| macOS x86_64 | Tier 1 | ✅ Full support | Yes |
| macOS ARM64 (M1/M2) | Tier 1 | ✅ Full support | Yes |
| Windows x86_64 | Tier 1 | ✅ Full support | Yes |
| Windows ARM64 | Tier 3 | ⚠️ Best-effort | No |

#### WASM Targets

| Target | Support Status | Notes |
|--------|----------------|-------|
| wasm32-unknown-unknown | ✅ Full support | Browser + Node.js |
| wasm32-wasi | ✅ Full support | WASI runtime |
| wasm32-unknown-emscripten | ⚠️ Untested | Should work |

**WASM Features**:
- ✅ All attention mechanisms
- ✅ SIMD support (where available)
- ✅ Multi-threading via Web Workers
- ❌ File I/O (not needed)

#### Mobile Platforms

| Platform | Support Status | Notes |
|----------|----------------|-------|
| iOS ARM64 | ⚠️ Untested | Should work via FFI |
| Android ARM64 | ⚠️ Untested | Should work via FFI |

---

### 6.3 Node.js Support (NAPI-RS)

| Node.js Version | Support Status | Notes |
|-----------------|----------------|-------|
| 18.x LTS | ✅ Supported | NAPI-RS requires N-API 9+ |
| 20.x LTS | ✅ Supported | Recommended |
| 21.x+ Current | ✅ Supported | Latest features |

**NAPI-RS Features**:
- ✅ All attention mechanisms exposed
- ✅ TypeScript definitions
- ✅ Async operations (Tokio runtime)
- ✅ Buffer zero-copy where possible

**Package Platforms**:
```json
{
  "napi": {
    "triples": {
      "defaults": true,
      "additional": [
        "x86_64-unknown-linux-musl",
        "aarch64-unknown-linux-gnu",
        "aarch64-apple-darwin",
        "x86_64-pc-windows-msvc"
      ]
    }
  }
}
```

---

### 6.4 Feature Flags

| Feature | Default | Description | Dependencies |
|---------|---------|-------------|--------------|
| `std` | ✅ | Standard library support | None |
| `simd` | ❌ | SIMD optimizations | `std` |
| `rayon` | ❌ | Parallel processing | `std`, `rayon` |
| `compression` | ❌ | Tensor compression | `std` |
| `wasm` | ❌ | WASM-specific bindings | `wasm-bindgen` |
| `napi` | ❌ | Node.js bindings | `napi-rs` |
| `cli` | ❌ | CLI interface | `std`, `clap` |
| `serde` | ✅ | Serialization support | `serde` |

**Example**:
```toml
[dependencies]
ruvector-attention = { version = "0.1", features = ["simd", "rayon"] }
```

---

## 7. Testing Strategy

### 7.1 Unit Tests

**Coverage Target**: >90% line coverage, >95% branch coverage

**Test Categories**:

#### 7.1.1 Correctness Tests
```rust
#[cfg(test)]
mod correctness_tests {
    use super::*;

    #[test]
    fn test_attention_output_dimension() {
        let attn = ScaledDotProductAttention::new(128).unwrap();
        let output = attn.forward(&query, &keys, &values).unwrap();
        assert_eq!(output.len(), 128);
    }

    #[test]
    fn test_attention_weights_sum_to_one() {
        let attn = ScaledDotProductAttention::new(128).unwrap();
        let weights = attn.attention_weights(&query, &keys).unwrap();
        let sum: f32 = weights.iter().sum();
        assert!((sum - 1.0).abs() < 1e-5);
    }

    #[test]
    fn test_empty_neighbors_handling() {
        let attn = ScaledDotProductAttention::new(128).unwrap();
        let result = attn.forward(&query, &[], &[]);
        assert!(result.is_err());
        assert!(matches!(result, Err(AttentionError::EmptyInput { .. })));
    }
}
```

#### 7.1.2 Numerical Stability Tests
```rust
#[cfg(test)]
mod stability_tests {
    #[test]
    fn test_large_scores_softmax() {
        // Test softmax with very large scores (overflow risk)
        let scores = vec![1000.0, 999.0, 998.0];
        let weights = softmax(&scores);
        assert!(weights.iter().all(|&w| w.is_finite()));
    }

    #[test]
    fn test_small_scores_softmax() {
        // Test softmax with very small scores (underflow risk)
        let scores = vec![-1000.0, -999.0, -998.0];
        let weights = softmax(&scores);
        assert!(weights.iter().all(|&w| w.is_finite()));
    }

    #[test]
    fn test_hyperbolic_boundary() {
        let attn = HyperbolicAttention::new(128, -1.0).unwrap();
        let query = vec![0.99; 128]; // Near ball boundary
        let output = attn.forward(&query, &keys, &values).unwrap();

        // Output must stay inside ball
        assert!(l2_norm(&output) < 1.0);
    }
}
```

#### 7.1.3 Edge Case Tests
```rust
#[cfg(test)]
mod edge_case_tests {
    #[test]
    fn test_single_neighbor() {
        let attn = ScaledDotProductAttention::new(128).unwrap();
        let keys = vec![vec![1.0; 128]];
        let output = attn.forward(&query, &keys, &keys).unwrap();
        // With single neighbor, attention weight should be 1.0
    }

    #[test]
    fn test_identical_keys() {
        // All keys identical -> uniform attention
        let keys = vec![vec![1.0; 128]; 10];
        let weights = attn.attention_weights(&query, &keys).unwrap();
        for w in &weights {
            assert!((w - 0.1).abs() < 1e-5); // 1/10
        }
    }

    #[test]
    fn test_zero_vectors() {
        let query = vec![0.0; 128];
        let keys = vec![vec![0.0; 128]; 10];
        let result = attn.forward(&query, &keys, &keys);
        // Should handle gracefully (may return error or uniform weights)
    }
}
```

---

### 7.2 Integration Tests

**Goal**: Test interactions between modules

#### 7.2.1 Multi-Mechanism Pipeline
```rust
#[test]
fn test_moe_with_multiple_experts() {
    let moe = MoEAttention::builder()
        .add_local_expert(128)
        .add_hyperbolic_expert(128, -1.0)
        .add_sparse_expert(10, 5)
        .top_k(2)
        .build()
        .unwrap();

    let output = moe.forward(&query, &keys, &values).unwrap();
    assert_eq!(output.len(), 128);
}
```

#### 7.2.2 Graph Attention with HNSW
```rust
#[test]
fn test_graph_rope_with_hnsw_layers() {
    let rope = GraphRoPE::new(128, 10000.0).unwrap();

    let metadata = GraphMetadata {
        distances: Some(vec![1.0, 2.0, 3.0]),
        hnsw_layers: Some(vec![0, 1, 2]),
        ..Default::default()
    };

    let output = rope.forward_with_metadata(
        &query, &keys, &values, &metadata
    ).unwrap();

    assert_eq!(output.len(), 128);
}
```

---

### 7.3 Property-Based Tests

**Tool**: `proptest`

```rust
use proptest::prelude::*;

proptest! {
    #[test]
    fn prop_attention_weights_normalized(
        query in prop::collection::vec(-10.0f32..10.0, 128),
        keys in prop::collection::vec(
            prop::collection::vec(-10.0f32..10.0, 128),
            1..100
        )
    ) {
        let attn = ScaledDotProductAttention::new(128).unwrap();
        let weights = attn.attention_weights(&query, &keys).unwrap();

        let sum: f32 = weights.iter().sum();
        prop_assert!((sum - 1.0).abs() < 1e-4);
    }

    #[test]
    fn prop_attention_output_finite(
        query in prop::collection::vec(-100.0f32..100.0, 128),
        keys in prop::collection::vec(
            prop::collection::vec(-100.0f32..100.0, 128),
            1..100
        )
    ) {
        let attn = ScaledDotProductAttention::new(128).unwrap();
        let values = keys.clone();
        let output = attn.forward(&query, &keys, &values).unwrap();

        prop_assert!(output.iter().all(|&x| x.is_finite()));
    }
}
```

---

### 7.4 Benchmark Tests

**Tool**: `criterion`

```rust
use criterion::{black_box, criterion_group, criterion_main, Criterion};

fn bench_scaled_dot_product(c: &mut Criterion) {
    let attn = ScaledDotProductAttention::new(128).unwrap();
    let query = vec![1.0; 128];
    let keys = vec![vec![1.0; 128]; 1000];
    let values = keys.clone();

    c.bench_function("scaled_dot_product_1k", |b| {
        b.iter(|| {
            attn.forward(
                black_box(&query),
                black_box(&keys),
                black_box(&values)
            )
        })
    });
}

fn bench_multi_head(c: &mut Criterion) {
    let mut group = c.benchmark_group("multi_head_attention");

    for num_heads in [1, 2, 4, 8] {
        let attn = MultiHeadAttention::builder(128, num_heads)
            .build()
            .unwrap();

        group.bench_function(format!("heads_{}", num_heads), |b| {
            b.iter(|| {
                attn.forward(
                    black_box(&query),
                    black_box(&keys),
                    black_box(&values)
                )
            })
        });
    }

    group.finish();
}

criterion_group!(benches, bench_scaled_dot_product, bench_multi_head);
criterion_main!(benches);
```

---

### 7.5 Fuzzing

**Tool**: `cargo-fuzz`

```rust
#![no_main]
use libfuzzer_sys::fuzz_target;
use ruvector_attention::core::ScaledDotProductAttention;
use ruvector_attention::Attention;

fuzz_target!(|data: &[u8]| {
    if data.len() < 512 {
        return;
    }

    // Parse fuzzer input into query, keys, values
    let query: Vec<f32> = data[0..128]
        .chunks(4)
        .map(|chunk| f32::from_le_bytes([chunk[0], chunk[1], chunk[2], chunk[3]]))
        .collect();

    // ... similar for keys and values

    let attn = ScaledDotProductAttention::new(32).unwrap();

    // Fuzz target: should never panic
    let _ = attn.forward(&query, &keys, &values);
});
```

---

### 7.6 WASM Tests

```rust
#[cfg(target_arch = "wasm32")]
#[cfg(test)]
mod wasm_tests {
    use wasm_bindgen_test::*;

    #[wasm_bindgen_test]
    fn test_attention_in_wasm() {
        let attn = ScaledDotProductAttention::new(128).unwrap();
        let output = attn.forward(&query, &keys, &values).unwrap();
        assert_eq!(output.len(), 128);
    }

    #[wasm_bindgen_test]
    fn test_simd_in_wasm() {
        #[cfg(feature = "simd")]
        {
            // Test WASM SIMD operations
        }
    }
}
```

---

### 7.7 Performance Regression Tests

**CI Check**: Fail if performance degrades >5% from baseline

```rust
#[test]
fn test_performance_regression() {
    let baseline_latency_ms = 100.0; // From previous run

    let start = Instant::now();
    let attn = ScaledDotProductAttention::new(128).unwrap();
    for _ in 0..1000 {
        attn.forward(&query, &keys, &values).unwrap();
    }
    let elapsed = start.elapsed().as_secs_f64() * 1000.0;

    let current_latency_ms = elapsed / 1000.0;
    let regression = (current_latency_ms - baseline_latency_ms) / baseline_latency_ms;

    assert!(
        regression < 0.05,
        "Performance regression detected: {}%", regression * 100.0
    );
}
```

---

## 8. Success Criteria

### 8.1 Quantifiable Metrics

#### 8.1.1 Functional Completeness
- [ ] **10/10 attention mechanisms implemented** (100%)
- [ ] **All mechanisms pass unit tests** (100% pass rate)
- [ ] **Integration tests pass** (100% pass rate)

#### 8.1.2 Performance
- [ ] **Latency**: p95 <200ms @ 1K neighbors for all mechanisms
- [ ] **Throughput**: >5,000 ops/sec for scaled dot-product
- [ ] **Memory**: Peak usage <50MB per operation
- [ ] **Scalability**: Linear or sub-quadratic up to 10K neighbors

#### 8.1.3 Quality
- [ ] **Test coverage**: >90% line coverage
- [ ] **Documentation coverage**: 100% public APIs documented
- [ ] **Zero compiler warnings**: Clippy clean
- [ ] **Zero unsafe code**: Or 100% audited and justified

#### 8.1.4 Compatibility
- [ ] **Platforms**: Linux, macOS, Windows passing CI
- [ ] **WASM**: All tests pass in wasm32-unknown-unknown
- [ ] **NAPI-RS**: Node.js 18+, all platforms published
- [ ] **MSRV**: Rust 1.77+ supported

#### 8.1.5 Adoption
- [ ] **Examples**: 5+ runnable examples
- [ ] **Documentation**: Getting started guide, API docs, tutorials
- [ ] **Integration**: Used in ruvector-gnn crate

---

### 8.2 Acceptance Tests

#### Phase 1 Acceptance (Weeks 1-4)
```
✅ Core attention mechanisms (scaled dot-product, multi-head)
✅ Unit tests passing (>80% coverage)
✅ Basic benchmarks established
✅ API design finalized
```

#### Phase 2 Acceptance (Weeks 5-8)
```
✅ Geometric attention (hyperbolic, edge-featured)
✅ Integration tests with graph structures
✅ Performance targets met for core mechanisms
✅ WASM compatibility verified
```

#### Phase 3 Acceptance (Weeks 9-12)
```
✅ Sparse mechanisms (flash, linear, local+global)
✅ Memory targets met
✅ NAPI-RS bindings complete
✅ Documentation 50% complete
```

#### Phase 4 Acceptance (Weeks 13-16)
```
✅ Adaptive mechanisms (MoE, cross-attention)
✅ Training utilities complete
✅ CLI interface functional
✅ All performance targets met
```

#### Phase 5 Acceptance (Weeks 17-20)
```
✅ Full integration with ruvector-gnn
✅ Documentation 100% complete
✅ Optimization passes complete
✅ Ready for 1.0 release
```

---

### 8.3 Release Criteria (v1.0)

**Blocker Issues** (must fix before release):
- [ ] Zero failing tests
- [ ] Zero compiler warnings
- [ ] All performance targets met
- [ ] 100% public API documented
- [ ] Security audit complete
- [ ] Cross-platform CI passing

**Nice-to-Have** (can defer to 1.1):
- [ ] GPU acceleration (CUDA/Metal)
- [ ] Additional attention variants
- [ ] Advanced SIMD optimizations
- [ ] Distributed attention

---

## 9. Constraints and Dependencies

### 9.1 Technical Constraints

#### C-001: No GPU Dependency
**Constraint**: All implementations must run on CPU
**Rationale**: WASM and NAPI-RS environments lack GPU access
**Impact**: May limit performance for very large graphs
**Mitigation**: SIMD optimizations, algorithm choice (sparse/linear attention)

#### C-002: Memory Constraints in WASM
**Constraint**: WASM has limited memory (typically 2-4GB)
**Rationale**: Browser and Node.js WASM environments
**Impact**: Cannot materialize large attention matrices
**Mitigation**: Flash Attention, sparse patterns, streaming computation

#### C-003: Serialization Requirements
**Constraint**: All types must be serializable (serde)
**Rationale**: Model saving/loading, network transfer
**Impact**: Design complexity, trait object limitations
**Mitigation**: Enum-based polymorphism, careful trait design

---

### 9.2 Dependencies

#### Core Dependencies

```toml
[dependencies]
# Math and numerics
ndarray = { version = "0.16", default-features = false }
rand = { version = "0.8", default-features = false }
rand_distr = { version = "0.4", default-features = false }

# Serialization
serde = { version = "1.0", features = ["derive"], optional = true }
rkyv = { version = "0.8", optional = true }

# Error handling
thiserror = "2.0"

# Optional: SIMD
simsimd = { version = "5.9", optional = true, features = ["nightly"] }

# Optional: Parallel processing
rayon = { version = "1.10", optional = true }

# Optional: WASM
wasm-bindgen = { version = "0.2", optional = true }
js-sys = { version = "0.3", optional = true }

# Optional: NAPI-RS
napi = { version = "2.16", optional = true }
napi-derive = { version = "2.16", optional = true }
```

**Dependency Audit**: All dependencies must be MIT/Apache-2.0 licensed

---

### 9.3 Integration Dependencies

#### Upstream (Used By)
- `ruvector-gnn`: Uses attention mechanisms in GNN layers
- `ruvector-graph`: Graph construction with attention-based edge selection

#### Downstream (Depends On)
- `ruvector-core`: Core vector operations, distance metrics
- `hnsw_rs`: HNSW graph structure (optional, for examples)

---

## 10. Risk Assessment

### 10.1 Technical Risks

| Risk | Probability | Impact | Mitigation |
|------|-------------|--------|------------|
| **Hyperbolic numerical instability** | High | Medium | Careful boundary handling, epsilon clipping, extensive testing |
| **WASM performance degradation** | Medium | High | WASM SIMD, algorithmic optimizations, benchmarking |
| **Memory bloat in large graphs** | Medium | High | Flash Attention, sparse patterns, streaming |
| **API breaking changes** | Low | High | Careful API design, SemVer, deprecation warnings |
| **Dependency conflicts** | Low | Medium | Minimal dependencies, version pinning |

---

### 10.2 Schedule Risks

| Risk | Probability | Impact | Mitigation |
|------|-------------|--------|------------|
| **Hyperbolic implementation complexity** | Medium | Medium | 20% buffer time, fallback to Euclidean |
| **Performance targets not met** | Low | High | Early benchmarking, iterative optimization |
| **WASM/NAPI-RS compatibility issues** | Low | Medium | Early CI setup, continuous testing |

**Buffer**: 20% time buffer in each phase for unexpected issues

---

### 10.3 Operational Risks

| Risk | Probability | Impact | Mitigation |
|------|-------------|--------|------------|
| **CI infrastructure failures** | Low | Low | GitHub Actions redundancy, local testing |
| **Documentation drift** | Medium | Medium | Doc tests, CI doc generation checks |
| **Contributor onboarding difficulty** | Medium | Low | Comprehensive docs, clear examples |

---

## 11. Open Questions

### 11.1 Design Questions

**Q1**: Should we support dynamic mechanism selection at runtime?
**Options**:
- A) Enum-based (`AttentionMechanism::ScaledDotProduct`)
- B) Trait objects (`Box<dyn Attention>`)
- C) Both

**Q2**: How to handle attention visualization?
**Options**:
- A) Return attention weights separately
- B) Integrate with vis library (e.g., `plotters`)
- C) Export to JSON for external tools

**Q3**: Should we support distributed attention computation?
**Options**:
- A) In-crate via `rayon`
- B) External crate (e.g., `ruvector-attention-distributed`)
- C) Defer to v2.0

---

### 11.2 API Questions

**Q4**: Naming convention for attention mechanisms?
**Options**:
- A) Descriptive (`ScaledDotProductAttention`)
- B) Abbreviated (`SDPAttention`)
- C) Mixed (long in code, short in docs)

**Q5**: Should builders be mandatory or optional?
**Options**:
- A) Mandatory (always use builder)
- B) Optional (provide `new()` for defaults)
- C) Hybrid (simple types use `new()`, complex use builder)

---

## Appendix A: Glossary

| Term | Definition |
|------|------------|
| **Attention** | Mechanism for weighted aggregation based on learned similarities |
| **Scaled Dot-Product** | `Attention(Q,K,V) = softmax(QK^T/√d) V` |
| **Multi-Head** | Parallel attention mechanisms with different projections |
| **Hyperbolic** | Non-Euclidean geometry with negative curvature |
| **Poincaré Ball** | Model of hyperbolic space as unit ball |
| **GAT** | Graph Attention Networks |
| **RoPE** | Rotary Position Embeddings |
| **Flash Attention** | Memory-efficient tiled attention computation |
| **MoE** | Mixture of Experts (learned routing between mechanisms) |
| **InfoNCE** | Contrastive loss function |
| **HNSW** | Hierarchical Navigable Small World graphs |

---

## Appendix B: References

### Research Papers

1. **Attention Mechanism**: Vaswani et al. (2017) - "Attention Is All You Need"
2. **GAT**: Veličković et al. (2018) - "Graph Attention Networks"
3. **Hyperbolic**: Chami et al. (2019) - "Hyperbolic Graph Convolutional Neural Networks"
4. **Flash Attention**: Dao et al. (2022) - "FlashAttention: Fast and Memory-Efficient Exact Attention"
5. **Performer**: Choromanski et al. (2020) - "Rethinking Attention with Performers"
6. **RoPE**: Su et al. (2021) - "RoFormer: Enhanced Transformer with Rotary Position Embedding"
7. **MoE**: Shazeer et al. (2017) - "Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer"

### RuVector Research Documents

- `/docs/latent-space/attention-mechanisms-research.md`
- `/docs/latent-space/gnn-architecture-analysis.md`
- `/docs/latent-space/optimization-strategies.md`
- `/docs/latent-space/implementation-roadmap.md`

### External Resources

- [Rust WASM Book](https://rustwasm.github.io/book/)
- [NAPI-RS Documentation](https://napi.rs/)
- [Criterion.rs Guide](https://bheisler.github.io/criterion.rs/book/)
- [Proptest Book](https://proptest-rs.github.io/proptest/)

---

## Document History

| Version | Date | Author | Changes |
|---------|------|--------|---------|
| 1.0.0 | 2025-11-30 | RuVector Team | Initial specification |

---

## Approvals

| Role | Name | Signature | Date |
|------|------|-----------|------|
| **Technical Lead** | | | |
| **Architecture Review** | | | |
| **QA Lead** | | | |
| **Product Owner** | | | |

---

**END OF SPECIFICATION**

This document represents the complete specification for the `ruvector-attention` crate. Implementation should proceed according to the SPARC methodology:
- **S**pecification ✅ (this document)
- **P**seudocode (next phase)
- **A**rchitecture (detailed design)
- **R**efinement (iterative TDD)
- **C**ompletion (integration and release)