Merge commit 'd803bfe2b1fe7f5e219e50ac20d6801a0a58ac75' as 'vendor/ruvector'

This commit is contained in:
ruv
2026-02-28 14:39:40 -05:00
7854 changed files with 3522914 additions and 0 deletions

View File

@@ -0,0 +1,544 @@
# Phase 1: Specification
## S - Specification Phase
**Duration**: Weeks 1-2
**Goal**: Define complete requirements, constraints, and success criteria
---
## 1. Product Vision
### 1.1 Mission Statement
**RvLite** is a standalone, WASM-first vector database that brings the full power of ruvector-postgres to any environment - browser, Node.js, edge workers, mobile apps - without requiring PostgreSQL installation.
### 1.2 Target Users
1. **Frontend Developers** - Building AI-powered web apps with in-browser vector search
2. **Edge Computing** - Serverless/edge environments (Cloudflare Workers, Deno Deploy)
3. **Mobile Developers** - React Native, Capacitor apps with local vector storage
4. **Data Scientists** - Rapid prototyping without infrastructure setup
5. **Embedded Systems** - IoT, embedded devices with limited resources
### 1.3 Use Cases
#### UC-1: In-Browser Semantic Search
```typescript
// User browses documentation site
// All searches happen locally, no backend needed
const db = await RvLite.create();
await db.loadDocuments(docs);
const results = await db.searchSimilar(queryEmbedding);
```
#### UC-2: Edge AI Search
```typescript
// Cloudflare Worker handles product search
// Vector DB runs at the edge, globally distributed
export default {
async fetch(request) {
const db = await RvLite.create();
return searchProducts(db, query);
}
}
```
#### UC-3: Knowledge Graph Exploration
```typescript
// Interactive graph visualization in browser
// SPARQL + Cypher queries run client-side
const db = await RvLite.create();
await db.cypher('MATCH (a)-[r]->(b) RETURN a, r, b');
await db.sparql('SELECT ?s ?p ?o WHERE { ?s ?p ?o }');
```
#### UC-4: Self-Learning Agent
```typescript
// AI agent learns from user interactions
// ReasoningBank stores patterns locally
const db = await RvLite.create();
await db.learning.recordTrajectory(state, action, reward);
const nextAction = await db.learning.predictBest(state);
```
---
## 2. Functional Requirements
### 2.1 Core Database Features
#### FR-1: Vector Operations
- **FR-1.1** Support vector types: `vector(n)`, `halfvec(n)`, `binaryvec(n)`, `sparsevec(n)`
- **FR-1.2** Distance metrics: L2, cosine, inner product, L1, Hamming
- **FR-1.3** Vector operations: add, subtract, scale, normalize
- **FR-1.4** SIMD-optimized computations using WASM SIMD
#### FR-2: Indexing
- **FR-2.1** HNSW index for approximate nearest neighbor search
- **FR-2.2** Configurable parameters: M (connections), ef_construction, ef_search
- **FR-2.3** Dynamic index updates (insert/delete)
- **FR-2.4** B-Tree index for scalar columns
- **FR-2.5** Triple store indexes (SPO, POS, OSP) for RDF data
#### FR-3: Query Languages
**FR-3.1 SQL Support**
```sql
-- Table creation
CREATE TABLE documents (
id SERIAL PRIMARY KEY,
content TEXT,
embedding VECTOR(384)
);
-- Index creation
CREATE INDEX ON documents USING hnsw (embedding vector_cosine_ops);
-- Vector search
SELECT *, embedding <=> $1 AS distance
FROM documents
ORDER BY distance
LIMIT 10;
-- Hybrid search
SELECT *
FROM documents
WHERE content ILIKE '%query%'
ORDER BY embedding <=> $1
LIMIT 10;
```
**FR-3.2 SPARQL 1.1 Support**
```sparql
# SELECT queries
SELECT ?subject ?label
WHERE {
?subject rdfs:label ?label .
FILTER(lang(?label) = "en")
}
# CONSTRUCT queries
CONSTRUCT { ?s foaf:knows ?o }
WHERE { ?s :similar_to ?o }
# INSERT/DELETE updates
INSERT DATA {
<http://example.org/person1> foaf:name "Alice" .
}
# Property paths
SELECT ?person ?friend
WHERE {
?person foaf:knows+ ?friend .
}
```
**FR-3.3 Cypher Support**
```cypher
// Pattern matching
MATCH (a:Person)-[:KNOWS]->(b:Person)
WHERE a.age > 30
RETURN a.name, b.name
// Graph creation
CREATE (a:Person {name: 'Alice', embedding: $emb})
CREATE (b:Person {name: 'Bob'})
CREATE (a)-[:KNOWS]->(b)
// Vector-enhanced queries
MATCH (p:Person)
WHERE vector.cosine(p.embedding, $query) > 0.8
RETURN p.name, p.embedding
ORDER BY vector.cosine(p.embedding, $query) DESC
```
#### FR-4: Graph Operations
- **FR-4.1** Graph traversal (BFS, DFS)
- **FR-4.2** Shortest path algorithms (Dijkstra, A*)
- **FR-4.3** Community detection
- **FR-4.4** PageRank and centrality metrics
- **FR-4.5** Vector-enhanced graph search
#### FR-5: Graph Neural Networks (GNN)
- **FR-5.1** GCN (Graph Convolutional Networks)
- **FR-5.2** GraphSage
- **FR-5.3** GAT (Graph Attention Networks)
- **FR-5.4** GIN (Graph Isomorphism Networks)
- **FR-5.5** Node/edge embeddings
- **FR-5.6** Graph classification
#### FR-6: Self-Learning (ReasoningBank)
- **FR-6.1** Trajectory recording (state, action, reward)
- **FR-6.2** Pattern recognition
- **FR-6.3** Memory distillation
- **FR-6.4** Strategy optimization
- **FR-6.5** Verdict judgment
- **FR-6.6** Adaptive learning rates
#### FR-7: Hyperbolic Embeddings
- **FR-7.1** Poincaré disk model
- **FR-7.2** Lorentz/hyperboloid model
- **FR-7.3** Hyperbolic distance metrics
- **FR-7.4** Exponential/logarithmic maps
- **FR-7.5** Hyperbolic neural networks
#### FR-8: Storage & Persistence
**FR-8.1 In-Memory Storage**
- Primary storage: DashMap (concurrent hash maps)
- Fast access: O(1) lookup for primary keys
- Thread-safe concurrent access
**FR-8.2 Persistence Backends**
```rust
// Browser: IndexedDB
await db.save(); // Saves to IndexedDB
const db = await RvLite.load(); // Loads from IndexedDB
// Browser: OPFS (Origin Private File System)
await db.saveToOPFS();
await db.loadFromOPFS();
// Node.js/Deno/Bun: File system
await db.saveToFile('database.rvlite');
await RvLite.loadFromFile('database.rvlite');
```
**FR-8.3 Serialization Formats**
- Binary: rkyv (zero-copy deserialization)
- JSON: For debugging and exports
- Apache Arrow: For data exchange
#### FR-9: Transactions (ACID)
- **FR-9.1** Atomic operations (all-or-nothing)
- **FR-9.2** Consistency (integrity constraints)
- **FR-9.3** Isolation (snapshot isolation)
- **FR-9.4** Durability (write-ahead logging)
#### FR-10: Quantization
- **FR-10.1** Binary quantization (1-bit)
- **FR-10.2** Scalar quantization (8-bit)
- **FR-10.3** Product quantization (configurable)
- **FR-10.4** Automatic quantization selection
---
## 3. Non-Functional Requirements
### 3.1 Performance
| Metric | Target | Measurement |
|--------|--------|-------------|
| WASM bundle size | < 6MB gzipped | `du -h rvlite_bg.wasm` |
| Initial load time | < 1s | Performance API |
| Query latency (1k vectors) | < 20ms | Benchmark suite |
| Insert throughput | > 10k/s | Benchmark suite |
| Memory usage (100k vectors) | < 200MB | Chrome DevTools |
| HNSW search recall@10 | > 95% | ANN benchmarks |
### 3.2 Scalability
| Dimension | Limit | Rationale |
|-----------|-------|-----------|
| Max table size | 10M rows | Memory constraints |
| Max vector dimensions | 4096 | WASM memory limits |
| Max tables | 1000 | Reasonable use case |
| Max indexes per table | 10 | Performance trade-off |
| Max concurrent queries | 100 | WASM thread pool |
### 3.3 Compatibility
**Browser Support**
- Chrome/Edge 91+ (WASM SIMD)
- Firefox 89+ (WASM SIMD)
- Safari 16.4+ (WASM SIMD)
**Runtime Support**
- Node.js 18+
- Deno 1.30+
- Bun 1.0+
- Cloudflare Workers
- Vercel Edge Functions
- Netlify Edge Functions
**Platform Support**
- x86-64 (Intel/AMD)
- ARM64 (Apple Silicon, AWS Graviton)
- WebAssembly (universal)
### 3.4 Security
- **SEC-1** No arbitrary code execution
- **SEC-2** Memory-safe (Rust guarantees)
- **SEC-3** No SQL injection (prepared statements)
- **SEC-4** Sandboxed WASM execution
- **SEC-5** CORS-compliant (browser)
- **SEC-6** No sensitive data in errors
### 3.5 Usability
- **US-1** Zero-config installation: `npm install @rvlite/wasm`
- **US-2** TypeScript-first API with full type definitions
- **US-3** Comprehensive documentation with examples
- **US-4** Error messages with helpful suggestions
- **US-5** Debug logging (optional, configurable)
### 3.6 Maintainability
- **MAIN-1** Test coverage > 90%
- **MAIN-2** CI/CD pipeline (GitHub Actions)
- **MAIN-3** Semantic versioning (semver)
- **MAIN-4** Automated releases
- **MAIN-5** Deprecation warnings (6-month notice)
---
## 4. Constraints
### 4.1 Technical Constraints
**WASM Limitations**
- Single-threaded by default (multi-threading experimental)
- Limited to 4GB memory (32-bit address space)
- No direct file system access (browser)
- No native threads (use Web Workers)
**Rust/WASM Constraints**
- No `std::fs` in `wasm32-unknown-unknown`
- No native threading (use `wasm-bindgen-futures`)
- Must use `no_std` or WASM-compatible crates
- Size overhead from Rust std library
### 4.2 Performance Constraints
- WASM is ~2-3x slower than native code
- SIMD limited to 128-bit (vs 512-bit AVX-512)
- Garbage collection overhead (JS interop)
- Copy overhead for large data transfers
### 4.3 Resource Constraints
**Development Team**
- 1 developer (8 weeks)
- Community contributions (optional)
**Timeline**
- 8 weeks total
- 2 weeks per major phase
- Beta release by Week 8
**Budget**
- Open source (no monetary budget)
- CI/CD: GitHub Actions (free tier)
- Hosting: npm registry (free)
---
## 5. Success Criteria
### 5.1 Functional Completeness
- [ ] All vector operations working
- [ ] SQL queries execute correctly
- [ ] SPARQL queries pass W3C test suite
- [ ] Cypher queries compatible with Neo4j syntax
- [ ] GNN layers produce correct outputs
- [ ] ReasoningBank learns from trajectories
- [ ] Hyperbolic operations validated
### 5.2 Performance Benchmarks
- [ ] Bundle size < 6MB gzipped
- [ ] Load time < 1s (browser)
- [ ] Query latency < 20ms (1k vectors)
- [ ] HNSW recall@10 > 95%
- [ ] Memory usage < 200MB (100k vectors)
### 5.3 Quality Metrics
- [ ] Test coverage > 90%
- [ ] Zero clippy warnings
- [ ] All examples working
- [ ] Documentation complete
- [ ] API stable (no breaking changes)
### 5.4 Adoption Metrics (Post-Release)
- [ ] 100+ npm downloads/week
- [ ] 10+ GitHub stars
- [ ] 3+ community contributions
- [ ] Featured in blog posts/articles
---
## 6. Out of Scope (v1.0)
### Not Included in Initial Release
- **Multi-user access** - Single-user database only
- **Distributed queries** - No sharding or replication
- **Advanced SQL** - No JOINs, subqueries, CTEs (future)
- **Full-text search** - Basic LIKE only (no Elasticsearch-level)
- **Geospatial** - No PostGIS-like features
- **Time series** - No specialized time-series optimizations
- **Streaming queries** - No live query updates
- **Custom UDFs** - No user-defined functions in v1.0
### Future Considerations (v2.0+)
- Multi-threading support (WASM threads)
- Advanced SQL features (JOINs, CTEs)
- Streaming/reactive queries
- Plugin system for extensions
- Custom vector distance metrics
- GPU acceleration (WebGPU)
---
## 7. Dependencies & Licenses
### Rust Crates (MIT/Apache-2.0)
```toml
[dependencies]
wasm-bindgen = "0.2"
serde = { version = "1.0", features = ["derive"] }
serde-wasm-bindgen = "0.6"
js-sys = "0.3"
web-sys = { version = "0.3", features = ["Window", "IdbDatabase"] }
dashmap = "6.0"
parking_lot = "0.12"
simsimd = "5.9"
half = "2.4"
rkyv = "0.8"
once_cell = "1.19"
thiserror = "1.0"
[dev-dependencies]
wasm-bindgen-test = "0.3"
criterion = "0.5"
```
### License
**MIT License** (permissive, compatible with ruvector-postgres)
---
## 8. Risk Analysis
### High Risk
| Risk | Impact | Probability | Mitigation |
|------|--------|-------------|------------|
| WASM size > 10MB | High | Medium | Aggressive tree-shaking, feature gating |
| Performance < 50% of native | High | Medium | WASM SIMD, optimized algorithms |
| Browser compatibility issues | High | Low | Polyfills, fallbacks |
### Medium Risk
| Risk | Impact | Probability | Mitigation |
|------|--------|-------------|------------|
| IndexedDB quota limits | Medium | Medium | OPFS fallback, compression |
| Memory leaks in WASM | Medium | Low | Careful lifetime management |
| Breaking API changes | Medium | Medium | Semver, deprecation warnings |
### Low Risk
| Risk | Impact | Probability | Mitigation |
|------|--------|-------------|------------|
| Dependency vulnerabilities | Low | Low | Dependabot, security audits |
| Documentation outdated | Low | Medium | CI checks, automated validation |
---
## 9. Validation & Acceptance
### 9.1 Validation Methods
**Unit Tests**
```rust
#[cfg(test)]
mod tests {
#[test]
fn test_vector_cosine_distance() {
let a = vec![1.0, 0.0, 0.0];
let b = vec![0.0, 1.0, 0.0];
let dist = cosine_distance(&a, &b);
assert!((dist - 1.0).abs() < 0.001);
}
}
```
**Integration Tests**
```typescript
import { RvLite } from '@rvlite/wasm';
describe('Vector Search', () => {
it('should find similar vectors', async () => {
const db = await RvLite.create();
await db.sql('CREATE TABLE docs (id INT, vec VECTOR(3))');
await db.sql('INSERT INTO docs VALUES (1, $1)', [[1, 0, 0]]);
const results = await db.sql('SELECT * FROM docs ORDER BY vec <=> $1', [[1, 0, 0]]);
expect(results[0].id).toBe(1);
});
});
```
**Benchmark Tests**
```rust
use criterion::{black_box, criterion_group, criterion_main, Criterion};
fn bench_hnsw_search(c: &mut Criterion) {
let index = build_hnsw_index(1000);
let query = random_vector(384);
c.bench_function("hnsw_search_1k", |b| {
b.iter(|| index.search(black_box(&query), 10))
});
}
```
### 9.2 Acceptance Criteria
**Must Have**
- [ ] All functional requirements implemented
- [ ] Performance benchmarks met
- [ ] Test coverage > 90%
- [ ] Documentation complete
- [ ] Examples working in browser, Node.js, Deno
**Should Have**
- [ ] TypeScript types accurate
- [ ] Error messages helpful
- [ ] Debug logging available
- [ ] Migration guide from ruvector-postgres
**Could Have**
- [ ] Interactive playground
- [ ] Video tutorials
- [ ] Community forum
---
## 10. Glossary
| Term | Definition |
|------|------------|
| **WASM** | WebAssembly - binary instruction format for stack-based virtual machine |
| **HNSW** | Hierarchical Navigable Small World - graph-based ANN algorithm |
| **ANN** | Approximate Nearest Neighbor - fast similarity search |
| **SIMD** | Single Instruction Multiple Data - parallel computation |
| **GNN** | Graph Neural Network - neural networks for graph data |
| **SPARQL** | SPARQL Protocol and RDF Query Language - RDF query language |
| **Cypher** | Neo4j's graph query language |
| **ReasoningBank** | Self-learning framework for AI agents |
| **RDF** | Resource Description Framework - semantic web standard |
| **Triple Store** | Database for storing RDF triples (subject-predicate-object) |
| **OPFS** | Origin Private File System - browser file storage API |
| **IndexedDB** | Browser-based NoSQL database |
---
**Next**: [02_API_SPECIFICATION.md](./02_API_SPECIFICATION.md) - Complete API design