21 KiB
21 KiB
RuvLLM: Integration and Deployment
SPARC Phase 5: Completion
1. Integration Strategy
1.1 Crate Structure
ruvector/
├── crates/
│ ├── ruvector-core/ # Existing: Vector DB
│ ├── ruvector-gnn/ # Existing: GNN + EWC + Replay
│ ├── ruvector-attention/ # Existing: Attention mechanisms
│ ├── ruvector-graph/ # Existing: Graph storage
│ └── ruvector-router-core/ # Existing: Routing primitives
│
└── examples/
└── ruvLLM/ # NEW: Self-learning LLM
├── src/
│ ├── lib.rs # Main library entry
│ ├── orchestrator.rs # Request orchestration
│ ├── embedding.rs # LFM2 embedding service
│ ├── router.rs # FastGRNN router
│ ├── memory.rs # Ruvector memory layer
│ ├── attention.rs # Graph attention wrapper
│ ├── inference.rs # LFM2 model pool
│ ├── learning.rs # Self-learning service
│ ├── compression.rs # Concept abstraction
│ ├── config.rs # Configuration
│ ├── types.rs # Core types
│ └── error.rs # Error handling
├── tests/
│ ├── unit/
│ └── integration/
├── benches/
├── config/
└── docs/ # SPARC documentation
1.2 Dependency Integration
# examples/ruvLLM/Cargo.toml
[package]
name = "ruvllm"
version = "0.1.0"
edition = "2021"
description = "Self-learning LLM with LFM2 and Ruvector integration"
[dependencies]
# Internal dependencies (path-based for development)
ruvector-core = { path = "../../crates/ruvector-core" }
ruvector-gnn = { path = "../../crates/ruvector-gnn" }
ruvector-attention = { path = "../../crates/ruvector-attention" }
ruvector-graph = { path = "../../crates/ruvector-graph" }
ruvector-router-core = { path = "../../crates/ruvector-router-core" }
# LLM inference
llama-cpp-rs = "0.3" # CPU inference via llama.cpp
tokenizers = "0.15" # Fast tokenization
# Async runtime
tokio = { version = "1.41", features = ["full"] }
futures = "0.3"
# Serialization
serde = { version = "1.0", features = ["derive"] }
serde_json = "1.0"
bincode = "2.0.0-rc.3"
# Numerics
ndarray = { version = "0.16", features = ["serde"] }
rand = "0.8"
# Utilities
uuid = { version = "1.11", features = ["v4", "serde"] }
chrono = { version = "0.4", features = ["serde"] }
thiserror = "2.0"
anyhow = "1.0"
tracing = "0.1"
# Performance
dashmap = "6.1"
parking_lot = "0.12"
lru = "0.12"
# Metrics
prometheus = "0.13"
[dev-dependencies]
criterion = { version = "0.5", features = ["html_reports"] }
proptest = "1.5"
tokio-test = "0.4"
tempfile = "3.13"
tracing-subscriber = "0.3"
[features]
default = ["cpu"]
cpu = [] # llama.cpp CPU inference
gpu = ["vllm"] # vLLM GPU inference (optional)
vllm = []
[[bench]]
name = "pipeline"
harness = false
[[bench]]
name = "router"
harness = false
[[bench]]
name = "memory"
harness = false
1.3 API Surface
//! # RuvLLM - Self-Learning LLM
//!
//! A self-learning language model system integrating LFM2 with Ruvector.
//!
//! ## Architecture
//!
//! - **LFM2**: Frozen reasoning engine (350M-2.6B parameters)
//! - **Ruvector**: Living memory that adapts continuously
//! - **FastGRNN**: Control circuit for intelligent routing
//!
//! ## Quick Start
//!
//! ```rust,ignore
//! use ruvllm::{RuvLLM, Config};
//!
//! #[tokio::main]
//! async fn main() -> Result<()> {
//! // Initialize system
//! let config = Config::builder()
//! .db_path("./memory.db")
//! .model_path_350m("./models/lfm2-350m-q4.gguf")
//! .model_path_700m("./models/lfm2-700m-q4.gguf")
//! .build()?;
//!
//! let llm = RuvLLM::new(config).await?;
//!
//! // Process query
//! let response = llm.query("What is machine learning?").await?;
//! println!("Response: {}", response.text);
//! println!("Confidence: {:.2}", response.confidence);
//!
//! Ok(())
//! }
//! ```
//!
//! ## Self-Learning Loops
//!
//! The system learns through three feedback loops:
//!
//! 1. **Memory Growth**: Every interaction strengthens/weakens graph edges
//! 2. **Router Learning**: FastGRNN learns optimal model selection
//! 3. **Compression**: Periodic summarization creates concept hierarchies
pub mod attention;
pub mod compression;
pub mod config;
pub mod embedding;
pub mod error;
pub mod inference;
pub mod learning;
pub mod memory;
pub mod orchestrator;
pub mod router;
pub mod types;
// Re-exports for convenience
pub use config::{Config, ConfigBuilder};
pub use error::{Error, Result};
pub use orchestrator::RuvLLM;
pub use types::{Request, Response, Session};
/// Library version
pub const VERSION: &str = env!("CARGO_PKG_VERSION");
2. Implementation Checklist
2.1 Core Components
Phase 1: Foundation
━━━━━━━━━━━━━━━━━━━━
[x] Project structure setup
[x] Cargo.toml with dependencies
[ ] Error types definition
[ ] Configuration system
[ ] Core types (Request, Response, Session)
Phase 2: Services
━━━━━━━━━━━━━━━━━━
[ ] EmbeddingService
[ ] LFM2 encoder wrapper
[ ] Dimension projection
[ ] Tokenization
[ ] Batch processing
[ ] MemoryService
[ ] VectorDB initialization
[ ] GraphStore integration
[ ] HNSW search wrapper
[ ] Graph expansion
[ ] Writeback queue
[ ] FastGRNNRouter
[ ] Cell implementation
[ ] Sparse matrix operations
[ ] Low-rank matrices
[ ] Output heads
[ ] Training loop
[ ] GraphAttentionEngine
[ ] Attention layer wrapper
[ ] Edge feature encoding
[ ] Multi-head aggregation
[ ] Context ranking
[ ] InferencePool
[ ] Model loading
[ ] Lazy initialization
[ ] KV cache management
[ ] LRU eviction
[ ] LearningService
[ ] Quality judge
[ ] Replay buffer
[ ] EWC integration
[ ] Background training
[ ] Compression jobs
Phase 3: Orchestration
━━━━━━━━━━━━━━━━━━━━━━
[ ] Orchestrator
[ ] Request routing
[ ] Session management
[ ] Pipeline coordination
[ ] Metrics collection
[ ] Error handling
Phase 4: Integration
━━━━━━━━━━━━━━━━━━━━
[ ] Integration tests
[ ] Benchmark suite
[ ] Example applications
[ ] Documentation
2.2 Test Coverage Requirements
| Component | Unit Tests | Integration | Benchmark |
|---|---|---|---|
| Embedding | 15+ | 3+ | 2 |
| Memory | 20+ | 5+ | 3 |
| Router | 25+ | 5+ | 2 |
| Attention | 15+ | 3+ | 2 |
| Inference | 10+ | 3+ | 2 |
| Learning | 20+ | 5+ | 1 |
| Orchestrator | 10+ | 5+ | 2 |
| Total | 115+ | 29+ | 14 |
3. Deployment Configurations
3.1 Edge Deployment (Raspberry Pi / Mobile)
# config/edge.toml
[system]
device_class = "edge"
max_memory_mb = 2048
max_concurrent_requests = 2
[embedding]
model = "onnx" # ONNX for portability
dimension = 384
batch_size = 1
[memory]
hnsw_m = 16
hnsw_ef_construction = 100
hnsw_ef_search = 32
max_nodes = 100_000
[router]
hidden_dim = 32
sparsity = 0.95
confidence_threshold = 0.6
[inference]
models = ["350m"]
quantization = "q4_k"
max_context = 1024
max_loaded_models = 1
[learning]
enabled = true
quality_threshold = 0.8
replay_capacity = 1000
training_interval_ms = 300_000 # 5 minutes
3.2 Server Deployment (CPU)
# config/server-cpu.toml
[system]
device_class = "server"
max_memory_mb = 16384
max_concurrent_requests = 20
[embedding]
model = "lfm2-encoder"
dimension = 768
batch_size = 8
[memory]
hnsw_m = 32
hnsw_ef_construction = 200
hnsw_ef_search = 64
max_nodes = 10_000_000
[router]
hidden_dim = 64
sparsity = 0.9
confidence_threshold = 0.7
[inference]
models = ["700m", "1.2b", "2.6b"]
quantization = "q5_k"
max_context = 4096
max_loaded_models = 2
[learning]
enabled = true
quality_threshold = 0.75
replay_capacity = 100_000
training_interval_ms = 60_000 # 1 minute
3.3 Server Deployment (GPU)
# config/server-gpu.toml
[system]
device_class = "gpu"
max_memory_mb = 32768
max_concurrent_requests = 100
[embedding]
model = "lfm2-encoder"
dimension = 1024
batch_size = 32
[memory]
hnsw_m = 48
hnsw_ef_construction = 300
hnsw_ef_search = 128
max_nodes = 100_000_000
[router]
hidden_dim = 64
sparsity = 0.85
confidence_threshold = 0.75
[inference]
models = ["1.2b", "2.6b"]
quantization = "fp16"
max_context = 8192
max_loaded_models = 2
use_vllm = true
tensor_parallel = 1
[learning]
enabled = true
quality_threshold = 0.7
replay_capacity = 1_000_000
training_interval_ms = 30_000 # 30 seconds
4. Operational Runbook
4.1 Startup Sequence
#!/bin/bash
# scripts/start.sh
set -e
CONFIG=${1:-"config/server-cpu.toml"}
LOG_LEVEL=${LOG_LEVEL:-"info"}
echo "Starting RuvLLM with config: $CONFIG"
# 1. Validate configuration
cargo run --release --bin ruvllm-validate -- --config "$CONFIG"
# 2. Initialize database if needed
if [ ! -f "data/memory.db" ]; then
echo "Initializing database..."
cargo run --release --bin ruvllm-init -- --config "$CONFIG"
fi
# 3. Download models if needed
cargo run --release --bin ruvllm-models -- --config "$CONFIG" --check-or-download
# 4. Start server
RUST_LOG=$LOG_LEVEL cargo run --release --bin ruvllm-server -- \
--config "$CONFIG" \
--metrics-port 9090 \
--http-port 8080
4.2 Health Checks
/// Health check endpoint implementation
pub struct HealthCheck {
memory: Arc<RuvectorMemory>,
router: Arc<FastGRNNRouter>,
inference: Arc<InferencePool>,
}
impl HealthCheck {
pub async fn check(&self) -> HealthStatus {
let mut status = HealthStatus::default();
// Check memory service
status.memory = match self.memory.ping().await {
Ok(latency) => ComponentHealth::Healthy { latency_ms: latency },
Err(e) => ComponentHealth::Unhealthy { error: e.to_string() },
};
// Check router
status.router = match self.router.ping() {
Ok(latency) => ComponentHealth::Healthy { latency_ms: latency },
Err(e) => ComponentHealth::Unhealthy { error: e.to_string() },
};
// Check inference (at least one model loadable)
status.inference = match self.inference.health_check().await {
Ok(info) => ComponentHealth::Healthy {
latency_ms: info.latency,
details: json!({
"loaded_models": info.loaded_models,
"available_memory": info.available_memory,
}),
},
Err(e) => ComponentHealth::Unhealthy { error: e.to_string() },
};
status.overall = if status.all_healthy() {
OverallHealth::Healthy
} else if status.any_critical() {
OverallHealth::Critical
} else {
OverallHealth::Degraded
};
status
}
}
4.3 Monitoring Dashboards
# Prometheus alerting rules
groups:
- name: ruvllm
rules:
- alert: HighLatency
expr: histogram_quantile(0.95, ruvllm_request_latency_seconds_bucket) > 1.0
for: 5m
labels:
severity: warning
annotations:
summary: "RuvLLM P95 latency above 1s"
- alert: LowQualityScore
expr: avg(ruvllm_quality_score) < 0.7
for: 10m
labels:
severity: warning
annotations:
summary: "Average quality score dropped below 0.7"
- alert: MemoryPressure
expr: ruvllm_memory_usage_bytes / ruvllm_memory_limit_bytes > 0.9
for: 5m
labels:
severity: critical
annotations:
summary: "Memory usage above 90%"
- alert: RouterLowConfidence
expr: avg(ruvllm_router_confidence) < 0.5
for: 15m
labels:
severity: warning
annotations:
summary: "Router confidence consistently low"
- alert: HighErrorRate
expr: rate(ruvllm_errors_total[5m]) > 0.1
for: 5m
labels:
severity: critical
annotations:
summary: "Error rate above 10%"
4.4 Backup and Recovery
#!/bin/bash
# scripts/backup.sh
BACKUP_DIR="/backups/ruvllm/$(date +%Y%m%d_%H%M%S)"
mkdir -p "$BACKUP_DIR"
echo "Creating backup in $BACKUP_DIR"
# 1. Backup memory database
cp -r data/memory.db "$BACKUP_DIR/memory.db"
# 2. Backup router weights
cp -r data/router_weights.bin "$BACKUP_DIR/router_weights.bin"
# 3. Backup EWC state
cp -r data/ewc_state.bin "$BACKUP_DIR/ewc_state.bin"
# 4. Backup replay buffer
cp -r data/replay_buffer.bin "$BACKUP_DIR/replay_buffer.bin"
# 5. Backup configuration
cp -r config/ "$BACKUP_DIR/config/"
# 6. Create manifest
cat > "$BACKUP_DIR/manifest.json" << EOF
{
"timestamp": "$(date -Iseconds)",
"version": "$(cargo run --release --bin ruvllm-version)",
"components": {
"memory_db": "memory.db",
"router_weights": "router_weights.bin",
"ewc_state": "ewc_state.bin",
"replay_buffer": "replay_buffer.bin",
"config": "config/"
}
}
EOF
echo "Backup complete: $BACKUP_DIR"
# 7. Upload to S3 if configured
if [ -n "$S3_BACKUP_BUCKET" ]; then
aws s3 sync "$BACKUP_DIR" "s3://$S3_BACKUP_BUCKET/$(basename $BACKUP_DIR)/"
echo "Uploaded to S3: $S3_BACKUP_BUCKET"
fi
5. Production Checklist
5.1 Pre-Launch
Security
━━━━━━━━
[ ] Input validation and sanitization
[ ] Rate limiting configured
[ ] TLS/HTTPS enabled
[ ] API authentication (if public)
[ ] Secrets in environment variables
[ ] Model integrity verification
Performance
━━━━━━━━━━━
[ ] Load tested to expected traffic
[ ] Memory profiled (no leaks)
[ ] Latency targets met
[ ] Caching configured
[ ] Connection pooling
Reliability
━━━━━━━━━━━
[ ] Health checks implemented
[ ] Graceful shutdown
[ ] Automatic restarts (systemd/k8s)
[ ] Backup procedures tested
[ ] Recovery procedures documented
Observability
━━━━━━━━━━━━━
[ ] Structured logging
[ ] Metrics exported
[ ] Distributed tracing
[ ] Alerting rules configured
[ ] Dashboards created
5.2 Post-Launch
Daily
━━━━━
[ ] Check error rates
[ ] Review quality scores
[ ] Monitor latency trends
[ ] Verify backup success
Weekly
━━━━━━
[ ] Review router decisions distribution
[ ] Analyze forgetting metrics
[ ] Check memory growth rate
[ ] Run compression job
[ ] Update router weights
Monthly
━━━━━━━
[ ] Full system backup
[ ] Performance benchmark
[ ] Security audit
[ ] Dependency updates
[ ] Evaluate student model candidates
6. API Reference
6.1 HTTP API
openapi: "3.0.0"
info:
title: RuvLLM API
version: "0.1.0"
description: Self-learning LLM with LFM2 and Ruvector
paths:
/v1/query:
post:
summary: Process a query
requestBody:
required: true
content:
application/json:
schema:
type: object
required:
- query
properties:
query:
type: string
description: The user query
session_id:
type: string
description: Optional session for multi-turn
constraints:
type: object
properties:
max_latency_ms:
type: integer
max_tokens:
type: integer
temperature:
type: number
responses:
"200":
description: Successful response
content:
application/json:
schema:
type: object
properties:
text:
type: string
confidence:
type: number
sources:
type: array
items:
type: object
routing_info:
type: object
/v1/feedback:
post:
summary: Provide feedback on a response
requestBody:
required: true
content:
application/json:
schema:
type: object
required:
- request_id
properties:
request_id:
type: string
rating:
type: integer
minimum: 1
maximum: 5
correction:
type: string
responses:
"200":
description: Feedback recorded
/v1/health:
get:
summary: Health check
responses:
"200":
description: System healthy
"503":
description: System unhealthy
/v1/metrics:
get:
summary: Prometheus metrics
responses:
"200":
description: Metrics in Prometheus format
6.2 Rust SDK
use ruvllm::{RuvLLM, Config, Request, Response};
/// Simple query
async fn simple_query(llm: &RuvLLM) -> Result<Response> {
llm.query("What is Rust?").await
}
/// Query with options
async fn query_with_options(llm: &RuvLLM) -> Result<Response> {
llm.query_with(Request {
query: "Explain backpropagation".into(),
session_id: Some("user-123".into()),
constraints: Constraints {
max_latency_ms: Some(500),
max_tokens: Some(500),
temperature: Some(0.7),
..Default::default()
},
}).await
}
/// Multi-turn conversation
async fn conversation(llm: &RuvLLM) -> Result<()> {
let session = llm.new_session();
let r1 = llm.query_session(&session, "What is a neural network?").await?;
println!("Turn 1: {}", r1.text);
let r2 = llm.query_session(&session, "How do you train one?").await?;
println!("Turn 2: {}", r2.text);
let r3 = llm.query_session(&session, "What about overfitting?").await?;
println!("Turn 3: {}", r3.text);
Ok(())
}
/// Provide feedback
async fn with_feedback(llm: &RuvLLM) -> Result<()> {
let response = llm.query("What is 2+2?").await?;
llm.feedback(Feedback {
request_id: response.request_id,
rating: 5,
correction: None,
}).await?;
Ok(())
}
/// Stream response
async fn streaming(llm: &RuvLLM) -> Result<()> {
let mut stream = llm.query_stream("Tell me a story").await?;
while let Some(chunk) = stream.next().await {
print!("{}", chunk?);
}
Ok(())
}
7. Future Roadmap
7.1 Short-Term (1-3 months)
- LFM2-VL integration (vision-language)
- Multi-GPU inference with tensor parallelism
- Retrieval-augmented fine-tuning pipeline
- Improved compression algorithms
- WebAssembly deployment target
7.2 Medium-Term (3-6 months)
- Federated learning across edge nodes
- LFM2-Audio integration (speech)
- Custom domain fine-tuning toolkit
- Advanced curriculum learning
- Hyperbolic embeddings for hierarchies
7.3 Long-Term (6-12 months)
- Multi-agent collaboration
- Neuro-symbolic reasoning integration
- Continuous pre-training pipeline
- Hardware-specific optimizations (NPU, TPU)
- Enterprise multi-tenancy
8. Success Criteria
8.1 Technical Metrics
| Metric | Target | Current |
|---|---|---|
| Latency P50 | <500ms | - |
| Latency P99 | <2s | - |
| Quality Score | >0.8 | - |
| Router Accuracy | >90% | - |
| Memory Efficiency | <4GB (edge) | - |
| Throughput | 20 QPS (edge) | - |
| Forgetting Rate | <5%/10K | - |
| Test Coverage | >80% | - |
8.2 Business Metrics
| Metric | Target | Notes |
|---|---|---|
| User Satisfaction | >4.0/5.0 | Survey scores |
| Response Relevance | >85% | Human eval |
| Knowledge Retention | >90% | Multi-turn coherence |
| Cost Reduction | >50% | vs. always-big baseline |
9. Conclusion
RuvLLM represents a paradigm shift from static LLMs to adaptive, self-learning systems. By treating:
- LFM2 as the stable cortex (reasoning)
- Ruvector as the living synaptic mesh (memory)
- FastGRNN as the control circuit (routing)
We create intelligence that emerges from the loop, not just the model.
The three learning loops—memory growth, router optimization, and concept compression—enable continuous adaptation without the risks of in-place weight modification.
The intelligence is not in one model anymore. It is in the loop.
Document Version: 1.0 Last Updated: 2025-12-02 Author: RuvLLM Architecture Team