Files
wifi-densepose/examples/ruvLLM/docs/sparc/05-completion.md
ruv d803bfe2b1 Squashed 'vendor/ruvector/' content from commit b64c2172
git-subtree-dir: vendor/ruvector
git-subtree-split: b64c21726f2bb37286d9ee36a7869fef60cc6900
2026-02-28 14:39:40 -05:00

887 lines
21 KiB
Markdown

# RuvLLM: Integration and Deployment
## SPARC Phase 5: Completion
---
## 1. Integration Strategy
### 1.1 Crate Structure
```
ruvector/
├── crates/
│ ├── ruvector-core/ # Existing: Vector DB
│ ├── ruvector-gnn/ # Existing: GNN + EWC + Replay
│ ├── ruvector-attention/ # Existing: Attention mechanisms
│ ├── ruvector-graph/ # Existing: Graph storage
│ └── ruvector-router-core/ # Existing: Routing primitives
└── examples/
└── ruvLLM/ # NEW: Self-learning LLM
├── src/
│ ├── lib.rs # Main library entry
│ ├── orchestrator.rs # Request orchestration
│ ├── embedding.rs # LFM2 embedding service
│ ├── router.rs # FastGRNN router
│ ├── memory.rs # Ruvector memory layer
│ ├── attention.rs # Graph attention wrapper
│ ├── inference.rs # LFM2 model pool
│ ├── learning.rs # Self-learning service
│ ├── compression.rs # Concept abstraction
│ ├── config.rs # Configuration
│ ├── types.rs # Core types
│ └── error.rs # Error handling
├── tests/
│ ├── unit/
│ └── integration/
├── benches/
├── config/
└── docs/ # SPARC documentation
```
### 1.2 Dependency Integration
```toml
# examples/ruvLLM/Cargo.toml
[package]
name = "ruvllm"
version = "0.1.0"
edition = "2021"
description = "Self-learning LLM with LFM2 and Ruvector integration"
[dependencies]
# Internal dependencies (path-based for development)
ruvector-core = { path = "../../crates/ruvector-core" }
ruvector-gnn = { path = "../../crates/ruvector-gnn" }
ruvector-attention = { path = "../../crates/ruvector-attention" }
ruvector-graph = { path = "../../crates/ruvector-graph" }
ruvector-router-core = { path = "../../crates/ruvector-router-core" }
# LLM inference
llama-cpp-rs = "0.3" # CPU inference via llama.cpp
tokenizers = "0.15" # Fast tokenization
# Async runtime
tokio = { version = "1.41", features = ["full"] }
futures = "0.3"
# Serialization
serde = { version = "1.0", features = ["derive"] }
serde_json = "1.0"
bincode = "2.0.0-rc.3"
# Numerics
ndarray = { version = "0.16", features = ["serde"] }
rand = "0.8"
# Utilities
uuid = { version = "1.11", features = ["v4", "serde"] }
chrono = { version = "0.4", features = ["serde"] }
thiserror = "2.0"
anyhow = "1.0"
tracing = "0.1"
# Performance
dashmap = "6.1"
parking_lot = "0.12"
lru = "0.12"
# Metrics
prometheus = "0.13"
[dev-dependencies]
criterion = { version = "0.5", features = ["html_reports"] }
proptest = "1.5"
tokio-test = "0.4"
tempfile = "3.13"
tracing-subscriber = "0.3"
[features]
default = ["cpu"]
cpu = [] # llama.cpp CPU inference
gpu = ["vllm"] # vLLM GPU inference (optional)
vllm = []
[[bench]]
name = "pipeline"
harness = false
[[bench]]
name = "router"
harness = false
[[bench]]
name = "memory"
harness = false
```
### 1.3 API Surface
```rust
//! # RuvLLM - Self-Learning LLM
//!
//! A self-learning language model system integrating LFM2 with Ruvector.
//!
//! ## Architecture
//!
//! - **LFM2**: Frozen reasoning engine (350M-2.6B parameters)
//! - **Ruvector**: Living memory that adapts continuously
//! - **FastGRNN**: Control circuit for intelligent routing
//!
//! ## Quick Start
//!
//! ```rust,ignore
//! use ruvllm::{RuvLLM, Config};
//!
//! #[tokio::main]
//! async fn main() -> Result<()> {
//! // Initialize system
//! let config = Config::builder()
//! .db_path("./memory.db")
//! .model_path_350m("./models/lfm2-350m-q4.gguf")
//! .model_path_700m("./models/lfm2-700m-q4.gguf")
//! .build()?;
//!
//! let llm = RuvLLM::new(config).await?;
//!
//! // Process query
//! let response = llm.query("What is machine learning?").await?;
//! println!("Response: {}", response.text);
//! println!("Confidence: {:.2}", response.confidence);
//!
//! Ok(())
//! }
//! ```
//!
//! ## Self-Learning Loops
//!
//! The system learns through three feedback loops:
//!
//! 1. **Memory Growth**: Every interaction strengthens/weakens graph edges
//! 2. **Router Learning**: FastGRNN learns optimal model selection
//! 3. **Compression**: Periodic summarization creates concept hierarchies
pub mod attention;
pub mod compression;
pub mod config;
pub mod embedding;
pub mod error;
pub mod inference;
pub mod learning;
pub mod memory;
pub mod orchestrator;
pub mod router;
pub mod types;
// Re-exports for convenience
pub use config::{Config, ConfigBuilder};
pub use error::{Error, Result};
pub use orchestrator::RuvLLM;
pub use types::{Request, Response, Session};
/// Library version
pub const VERSION: &str = env!("CARGO_PKG_VERSION");
```
---
## 2. Implementation Checklist
### 2.1 Core Components
```
Phase 1: Foundation
━━━━━━━━━━━━━━━━━━━━
[x] Project structure setup
[x] Cargo.toml with dependencies
[ ] Error types definition
[ ] Configuration system
[ ] Core types (Request, Response, Session)
Phase 2: Services
━━━━━━━━━━━━━━━━━━
[ ] EmbeddingService
[ ] LFM2 encoder wrapper
[ ] Dimension projection
[ ] Tokenization
[ ] Batch processing
[ ] MemoryService
[ ] VectorDB initialization
[ ] GraphStore integration
[ ] HNSW search wrapper
[ ] Graph expansion
[ ] Writeback queue
[ ] FastGRNNRouter
[ ] Cell implementation
[ ] Sparse matrix operations
[ ] Low-rank matrices
[ ] Output heads
[ ] Training loop
[ ] GraphAttentionEngine
[ ] Attention layer wrapper
[ ] Edge feature encoding
[ ] Multi-head aggregation
[ ] Context ranking
[ ] InferencePool
[ ] Model loading
[ ] Lazy initialization
[ ] KV cache management
[ ] LRU eviction
[ ] LearningService
[ ] Quality judge
[ ] Replay buffer
[ ] EWC integration
[ ] Background training
[ ] Compression jobs
Phase 3: Orchestration
━━━━━━━━━━━━━━━━━━━━━━
[ ] Orchestrator
[ ] Request routing
[ ] Session management
[ ] Pipeline coordination
[ ] Metrics collection
[ ] Error handling
Phase 4: Integration
━━━━━━━━━━━━━━━━━━━━
[ ] Integration tests
[ ] Benchmark suite
[ ] Example applications
[ ] Documentation
```
### 2.2 Test Coverage Requirements
| Component | Unit Tests | Integration | Benchmark |
|-----------|------------|-------------|-----------|
| Embedding | 15+ | 3+ | 2 |
| Memory | 20+ | 5+ | 3 |
| Router | 25+ | 5+ | 2 |
| Attention | 15+ | 3+ | 2 |
| Inference | 10+ | 3+ | 2 |
| Learning | 20+ | 5+ | 1 |
| Orchestrator | 10+ | 5+ | 2 |
| **Total** | **115+** | **29+** | **14** |
---
## 3. Deployment Configurations
### 3.1 Edge Deployment (Raspberry Pi / Mobile)
```toml
# config/edge.toml
[system]
device_class = "edge"
max_memory_mb = 2048
max_concurrent_requests = 2
[embedding]
model = "onnx" # ONNX for portability
dimension = 384
batch_size = 1
[memory]
hnsw_m = 16
hnsw_ef_construction = 100
hnsw_ef_search = 32
max_nodes = 100_000
[router]
hidden_dim = 32
sparsity = 0.95
confidence_threshold = 0.6
[inference]
models = ["350m"]
quantization = "q4_k"
max_context = 1024
max_loaded_models = 1
[learning]
enabled = true
quality_threshold = 0.8
replay_capacity = 1000
training_interval_ms = 300_000 # 5 minutes
```
### 3.2 Server Deployment (CPU)
```toml
# config/server-cpu.toml
[system]
device_class = "server"
max_memory_mb = 16384
max_concurrent_requests = 20
[embedding]
model = "lfm2-encoder"
dimension = 768
batch_size = 8
[memory]
hnsw_m = 32
hnsw_ef_construction = 200
hnsw_ef_search = 64
max_nodes = 10_000_000
[router]
hidden_dim = 64
sparsity = 0.9
confidence_threshold = 0.7
[inference]
models = ["700m", "1.2b", "2.6b"]
quantization = "q5_k"
max_context = 4096
max_loaded_models = 2
[learning]
enabled = true
quality_threshold = 0.75
replay_capacity = 100_000
training_interval_ms = 60_000 # 1 minute
```
### 3.3 Server Deployment (GPU)
```toml
# config/server-gpu.toml
[system]
device_class = "gpu"
max_memory_mb = 32768
max_concurrent_requests = 100
[embedding]
model = "lfm2-encoder"
dimension = 1024
batch_size = 32
[memory]
hnsw_m = 48
hnsw_ef_construction = 300
hnsw_ef_search = 128
max_nodes = 100_000_000
[router]
hidden_dim = 64
sparsity = 0.85
confidence_threshold = 0.75
[inference]
models = ["1.2b", "2.6b"]
quantization = "fp16"
max_context = 8192
max_loaded_models = 2
use_vllm = true
tensor_parallel = 1
[learning]
enabled = true
quality_threshold = 0.7
replay_capacity = 1_000_000
training_interval_ms = 30_000 # 30 seconds
```
---
## 4. Operational Runbook
### 4.1 Startup Sequence
```bash
#!/bin/bash
# scripts/start.sh
set -e
CONFIG=${1:-"config/server-cpu.toml"}
LOG_LEVEL=${LOG_LEVEL:-"info"}
echo "Starting RuvLLM with config: $CONFIG"
# 1. Validate configuration
cargo run --release --bin ruvllm-validate -- --config "$CONFIG"
# 2. Initialize database if needed
if [ ! -f "data/memory.db" ]; then
echo "Initializing database..."
cargo run --release --bin ruvllm-init -- --config "$CONFIG"
fi
# 3. Download models if needed
cargo run --release --bin ruvllm-models -- --config "$CONFIG" --check-or-download
# 4. Start server
RUST_LOG=$LOG_LEVEL cargo run --release --bin ruvllm-server -- \
--config "$CONFIG" \
--metrics-port 9090 \
--http-port 8080
```
### 4.2 Health Checks
```rust
/// Health check endpoint implementation
pub struct HealthCheck {
memory: Arc<RuvectorMemory>,
router: Arc<FastGRNNRouter>,
inference: Arc<InferencePool>,
}
impl HealthCheck {
pub async fn check(&self) -> HealthStatus {
let mut status = HealthStatus::default();
// Check memory service
status.memory = match self.memory.ping().await {
Ok(latency) => ComponentHealth::Healthy { latency_ms: latency },
Err(e) => ComponentHealth::Unhealthy { error: e.to_string() },
};
// Check router
status.router = match self.router.ping() {
Ok(latency) => ComponentHealth::Healthy { latency_ms: latency },
Err(e) => ComponentHealth::Unhealthy { error: e.to_string() },
};
// Check inference (at least one model loadable)
status.inference = match self.inference.health_check().await {
Ok(info) => ComponentHealth::Healthy {
latency_ms: info.latency,
details: json!({
"loaded_models": info.loaded_models,
"available_memory": info.available_memory,
}),
},
Err(e) => ComponentHealth::Unhealthy { error: e.to_string() },
};
status.overall = if status.all_healthy() {
OverallHealth::Healthy
} else if status.any_critical() {
OverallHealth::Critical
} else {
OverallHealth::Degraded
};
status
}
}
```
### 4.3 Monitoring Dashboards
```yaml
# Prometheus alerting rules
groups:
- name: ruvllm
rules:
- alert: HighLatency
expr: histogram_quantile(0.95, ruvllm_request_latency_seconds_bucket) > 1.0
for: 5m
labels:
severity: warning
annotations:
summary: "RuvLLM P95 latency above 1s"
- alert: LowQualityScore
expr: avg(ruvllm_quality_score) < 0.7
for: 10m
labels:
severity: warning
annotations:
summary: "Average quality score dropped below 0.7"
- alert: MemoryPressure
expr: ruvllm_memory_usage_bytes / ruvllm_memory_limit_bytes > 0.9
for: 5m
labels:
severity: critical
annotations:
summary: "Memory usage above 90%"
- alert: RouterLowConfidence
expr: avg(ruvllm_router_confidence) < 0.5
for: 15m
labels:
severity: warning
annotations:
summary: "Router confidence consistently low"
- alert: HighErrorRate
expr: rate(ruvllm_errors_total[5m]) > 0.1
for: 5m
labels:
severity: critical
annotations:
summary: "Error rate above 10%"
```
### 4.4 Backup and Recovery
```bash
#!/bin/bash
# scripts/backup.sh
BACKUP_DIR="/backups/ruvllm/$(date +%Y%m%d_%H%M%S)"
mkdir -p "$BACKUP_DIR"
echo "Creating backup in $BACKUP_DIR"
# 1. Backup memory database
cp -r data/memory.db "$BACKUP_DIR/memory.db"
# 2. Backup router weights
cp -r data/router_weights.bin "$BACKUP_DIR/router_weights.bin"
# 3. Backup EWC state
cp -r data/ewc_state.bin "$BACKUP_DIR/ewc_state.bin"
# 4. Backup replay buffer
cp -r data/replay_buffer.bin "$BACKUP_DIR/replay_buffer.bin"
# 5. Backup configuration
cp -r config/ "$BACKUP_DIR/config/"
# 6. Create manifest
cat > "$BACKUP_DIR/manifest.json" << EOF
{
"timestamp": "$(date -Iseconds)",
"version": "$(cargo run --release --bin ruvllm-version)",
"components": {
"memory_db": "memory.db",
"router_weights": "router_weights.bin",
"ewc_state": "ewc_state.bin",
"replay_buffer": "replay_buffer.bin",
"config": "config/"
}
}
EOF
echo "Backup complete: $BACKUP_DIR"
# 7. Upload to S3 if configured
if [ -n "$S3_BACKUP_BUCKET" ]; then
aws s3 sync "$BACKUP_DIR" "s3://$S3_BACKUP_BUCKET/$(basename $BACKUP_DIR)/"
echo "Uploaded to S3: $S3_BACKUP_BUCKET"
fi
```
---
## 5. Production Checklist
### 5.1 Pre-Launch
```
Security
━━━━━━━━
[ ] Input validation and sanitization
[ ] Rate limiting configured
[ ] TLS/HTTPS enabled
[ ] API authentication (if public)
[ ] Secrets in environment variables
[ ] Model integrity verification
Performance
━━━━━━━━━━━
[ ] Load tested to expected traffic
[ ] Memory profiled (no leaks)
[ ] Latency targets met
[ ] Caching configured
[ ] Connection pooling
Reliability
━━━━━━━━━━━
[ ] Health checks implemented
[ ] Graceful shutdown
[ ] Automatic restarts (systemd/k8s)
[ ] Backup procedures tested
[ ] Recovery procedures documented
Observability
━━━━━━━━━━━━━
[ ] Structured logging
[ ] Metrics exported
[ ] Distributed tracing
[ ] Alerting rules configured
[ ] Dashboards created
```
### 5.2 Post-Launch
```
Daily
━━━━━
[ ] Check error rates
[ ] Review quality scores
[ ] Monitor latency trends
[ ] Verify backup success
Weekly
━━━━━━
[ ] Review router decisions distribution
[ ] Analyze forgetting metrics
[ ] Check memory growth rate
[ ] Run compression job
[ ] Update router weights
Monthly
━━━━━━━
[ ] Full system backup
[ ] Performance benchmark
[ ] Security audit
[ ] Dependency updates
[ ] Evaluate student model candidates
```
---
## 6. API Reference
### 6.1 HTTP API
```yaml
openapi: "3.0.0"
info:
title: RuvLLM API
version: "0.1.0"
description: Self-learning LLM with LFM2 and Ruvector
paths:
/v1/query:
post:
summary: Process a query
requestBody:
required: true
content:
application/json:
schema:
type: object
required:
- query
properties:
query:
type: string
description: The user query
session_id:
type: string
description: Optional session for multi-turn
constraints:
type: object
properties:
max_latency_ms:
type: integer
max_tokens:
type: integer
temperature:
type: number
responses:
"200":
description: Successful response
content:
application/json:
schema:
type: object
properties:
text:
type: string
confidence:
type: number
sources:
type: array
items:
type: object
routing_info:
type: object
/v1/feedback:
post:
summary: Provide feedback on a response
requestBody:
required: true
content:
application/json:
schema:
type: object
required:
- request_id
properties:
request_id:
type: string
rating:
type: integer
minimum: 1
maximum: 5
correction:
type: string
responses:
"200":
description: Feedback recorded
/v1/health:
get:
summary: Health check
responses:
"200":
description: System healthy
"503":
description: System unhealthy
/v1/metrics:
get:
summary: Prometheus metrics
responses:
"200":
description: Metrics in Prometheus format
```
### 6.2 Rust SDK
```rust
use ruvllm::{RuvLLM, Config, Request, Response};
/// Simple query
async fn simple_query(llm: &RuvLLM) -> Result<Response> {
llm.query("What is Rust?").await
}
/// Query with options
async fn query_with_options(llm: &RuvLLM) -> Result<Response> {
llm.query_with(Request {
query: "Explain backpropagation".into(),
session_id: Some("user-123".into()),
constraints: Constraints {
max_latency_ms: Some(500),
max_tokens: Some(500),
temperature: Some(0.7),
..Default::default()
},
}).await
}
/// Multi-turn conversation
async fn conversation(llm: &RuvLLM) -> Result<()> {
let session = llm.new_session();
let r1 = llm.query_session(&session, "What is a neural network?").await?;
println!("Turn 1: {}", r1.text);
let r2 = llm.query_session(&session, "How do you train one?").await?;
println!("Turn 2: {}", r2.text);
let r3 = llm.query_session(&session, "What about overfitting?").await?;
println!("Turn 3: {}", r3.text);
Ok(())
}
/// Provide feedback
async fn with_feedback(llm: &RuvLLM) -> Result<()> {
let response = llm.query("What is 2+2?").await?;
llm.feedback(Feedback {
request_id: response.request_id,
rating: 5,
correction: None,
}).await?;
Ok(())
}
/// Stream response
async fn streaming(llm: &RuvLLM) -> Result<()> {
let mut stream = llm.query_stream("Tell me a story").await?;
while let Some(chunk) = stream.next().await {
print!("{}", chunk?);
}
Ok(())
}
```
---
## 7. Future Roadmap
### 7.1 Short-Term (1-3 months)
- [ ] LFM2-VL integration (vision-language)
- [ ] Multi-GPU inference with tensor parallelism
- [ ] Retrieval-augmented fine-tuning pipeline
- [ ] Improved compression algorithms
- [ ] WebAssembly deployment target
### 7.2 Medium-Term (3-6 months)
- [ ] Federated learning across edge nodes
- [ ] LFM2-Audio integration (speech)
- [ ] Custom domain fine-tuning toolkit
- [ ] Advanced curriculum learning
- [ ] Hyperbolic embeddings for hierarchies
### 7.3 Long-Term (6-12 months)
- [ ] Multi-agent collaboration
- [ ] Neuro-symbolic reasoning integration
- [ ] Continuous pre-training pipeline
- [ ] Hardware-specific optimizations (NPU, TPU)
- [ ] Enterprise multi-tenancy
---
## 8. Success Criteria
### 8.1 Technical Metrics
| Metric | Target | Current |
|--------|--------|---------|
| Latency P50 | <500ms | - |
| Latency P99 | <2s | - |
| Quality Score | >0.8 | - |
| Router Accuracy | >90% | - |
| Memory Efficiency | <4GB (edge) | - |
| Throughput | 20 QPS (edge) | - |
| Forgetting Rate | <5%/10K | - |
| Test Coverage | >80% | - |
### 8.2 Business Metrics
| Metric | Target | Notes |
|--------|--------|-------|
| User Satisfaction | >4.0/5.0 | Survey scores |
| Response Relevance | >85% | Human eval |
| Knowledge Retention | >90% | Multi-turn coherence |
| Cost Reduction | >50% | vs. always-big baseline |
---
## 9. Conclusion
RuvLLM represents a paradigm shift from static LLMs to adaptive, self-learning systems. By treating:
- **LFM2 as the stable cortex** (reasoning)
- **Ruvector as the living synaptic mesh** (memory)
- **FastGRNN as the control circuit** (routing)
We create intelligence that emerges from the loop, not just the model.
The three learning loops—memory growth, router optimization, and concept compression—enable continuous adaptation without the risks of in-place weight modification.
**The intelligence is not in one model anymore. It is in the loop.**
---
*Document Version: 1.0*
*Last Updated: 2025-12-02*
*Author: RuvLLM Architecture Team*