Merge commit 'd803bfe2b1fe7f5e219e50ac20d6801a0a58ac75' as 'vendor/ruvector'

This commit is contained in:
ruv
2026-02-28 14:39:40 -05:00
7854 changed files with 3522914 additions and 0 deletions

View File

@@ -0,0 +1,645 @@
# RuVector Postgres v2 - Architecture Overview
<!-- Last reviewed: 2025-12-25 -->
## What We're Building
Most databases, including vector databases, are **performance-first systems**. They optimize for speed, recall, and throughput, then bolt on monitoring. Structural safety is assumed, not measured.
RuVector does something different.
We give the system a **continuous, internal measure of its own structural integrity**, and the ability to **change its own behavior based on that signal**.
This puts RuVector in a very small class of systems.
---
## Why This Actually Matters
### 1. From Symptom Monitoring to Causal Monitoring
Everyone else watches outputs: latency, errors, recall.
We watch **connectivity and dependence**, which are upstream causes.
By the time latency spikes, the graph has already weakened. We detect that weakening while everything still looks healthy.
> **This is the difference between a smoke alarm and a structural stress sensor.**
### 2. Mincut Is a Leading Indicator, Not a Metric
Mincut answers a question no metric answers:
> *"How close is this system to splitting?"*
Not how slow it is. Not how many errors. **How close it is to losing coherence.**
That is a different axis of observability.
### 3. An Algorithm Becomes a Control Signal
Most people use graph algorithms for analysis. We use mincut to **gate behavior**.
That makes it a **control plane**, not analytics.
Very few production systems have mathematically grounded control loops.
### 4. Failure Mode Changes Class
| Without Integrity Control | With Integrity Control |
|---------------------------|------------------------|
| Fast → stressed → cascading failure → manual recovery | Fast → stressed → scope reduction → graceful degradation → automatic recovery |
Changing failure mode is what separates hobby systems from infrastructure.
### 5. Explainable Operations
The **witness edges** are huge.
When something slows down or freezes, we can say: *"Here are the exact links that would have failed next."*
That is gold in production, audits, and regulated environments.
---
## Why Nobody Else Has Done This
Not because it's impossible. Because:
1. **Most systems don't model themselves as graphs** — we do
2. **Mincut was too expensive dynamically** — we use contracted graphs (~1000 nodes, not millions)
3. **Ops culture reacts, it doesn't preempt** — we preempt
4. **Survivability isn't a KPI until after outages** — we measure it continuously
---
## The Honest Framing
Will this get applause from model benchmarks or social media? No.
Will this make systems boringly reliable and therefore indispensable? Yes.
Those are the ideas that end up everywhere.
**We're not making vector search faster. We're making vector infrastructure survivable.**
---
## What This Is, Concretely
RuVector Postgres v2 is a **PostgreSQL extension** (built with pgrx) that provides:
- **100% pgvector compatibility** — drop-in replacement, change extension name, queries work unchanged
- **Architecture separation** — PostgreSQL handles ACID/joins, RuVector handles vectors/graphs/learning
- **Dynamic mincut integrity gating** — the control plane described above
- **Self-learning pipeline** — GNN-based query optimization that improves over time
- **Tiered storage** — automatic hot/warm/cool/cold management with compression
- **Graph engine with Cypher** — property graphs with SQL joins
---
## Architecture Principles
### Separation of Concerns
```
+------------------------------------------------------------------+
| PostgreSQL Frontend |
| (SQL Parsing, Planning, ACID, Transactions, Joins, Aggregates) |
+------------------------------------------------------------------+
|
v
+------------------------------------------------------------------+
| Extension Boundary (pgrx) |
| - Type definitions (vector, sparsevec, halfvec) |
| - Operator overloads (<->, <=>, <#>) |
| - Index access method hooks |
| - Background worker registration |
+------------------------------------------------------------------+
|
v
+------------------------------------------------------------------+
| RuVector Engine (Rust) |
| - HNSW/IVFFlat indexing |
| - SIMD distance calculations |
| - Graph storage & Cypher execution |
| - GNN training & inference |
| - Compression & tiering |
| - Mincut integrity control |
+------------------------------------------------------------------+
```
### Core Design Decisions
| Decision | Rationale |
|----------|-----------|
| **pgrx for extension** | Safe Rust bindings, modern build system, well-maintained |
| **Background worker pattern** | Long-lived engine, avoid per-query initialization |
| **Shared memory IPC** | Bounded request queue with explicit payload limits (see [02-background-workers](02-background-workers.md)) |
| **WAL as source of truth** | Leverage Postgres replication, durability guarantees |
| **Contracted mincut graph** | Never compute on full similarity - use operational graph |
| **Hybrid consistency** | Synchronous hot tier, async background ops (see [10-consistency-replication](10-consistency-replication.md)) |
---
## System Architecture
### High-Level Components
```
+-----------------------+
| Client Application |
+-----------+-----------+
|
+-----------v-----------+
| PostgreSQL |
| +-----------------+ |
| | Query Executor | |
| +--------+--------+ |
| | |
| +--------v--------+ |
| | RuVector SQL | |
| | Surface Layer | |
| +--------+--------+ |
+-----------|----------+
|
+--------------------+--------------------+
| |
+----------v----------+ +-----------v-----------+
| Index AM Hooks | | Background Workers |
| (HNSW, IVFFlat) | | (Maintenance, GNN) |
+----------+----------+ +-----------+-----------+
| |
+--------------------+--------------------+
|
+-----------v-----------+
| Shared Memory |
| Communication |
+-----------+-----------+
|
+-----------v-----------+
| RuVector Engine |
| +-------+ +-------+ |
| | Index | | Graph | |
| +-------+ +-------+ |
| +-------+ +-------+ |
| | GNN | | Tier | |
| +-------+ +-------+ |
| +------------------+|
| | Integrity Ctrl ||
| +------------------+|
+-----------------------+
```
### Component Responsibilities
#### 1. SQL Surface Layer
- **pgvector type compatibility**: `vector(n)`, operators `<->`, `<#>`, `<=>`
- **Extended types**: `sparsevec`, `halfvec`, `binaryvec`
- **Function catalog**: `ruvector_*` functions for advanced features
- **Views**: `ruvector_nodes`, `ruvector_edges`, `ruvector_hyperedges`
#### 2. Index Access Methods
- **ruhnsw**: HNSW index with configurable M, ef_construction
- **ruivfflat**: IVF-Flat index with automatic centroid updates
- **Scan hooks**: Route queries to RuVector engine
- **Build hooks**: Incremental and bulk index construction
#### 3. Background Workers
- **Engine Worker**: Long-lived RuVector engine instance
- **Maintenance Worker**: Tiering, compaction, statistics
- **GNN Training Worker**: Periodic model updates
- **Integrity Worker**: Mincut sampling and state updates
#### 4. RuVector Engine
- **Index Manager**: HNSW/IVFFlat in-memory structures
- **Graph Store**: Property graph with Cypher support
- **GNN Pipeline**: Training data capture, model inference
- **Tier Manager**: Hot/warm/cool/cold classification
- **Integrity Controller**: Mincut-based operation gating
---
## Feature Matrix
### Phase 1: pgvector Compatibility (Foundation)
| Feature | Status | Description |
|---------|--------|-------------|
| `vector(n)` type | Core | Dense vector storage |
| `<->` operator | Core | L2 (Euclidean) distance |
| `<=>` operator | Core | Cosine distance |
| `<#>` operator | Core | Negative inner product |
| HNSW index | Core | `CREATE INDEX ... USING hnsw` |
| IVFFlat index | Core | `CREATE INDEX ... USING ivfflat` |
| `vector_l2_ops` | Core | Operator class for L2 |
| `vector_cosine_ops` | Core | Operator class for cosine |
| `vector_ip_ops` | Core | Operator class for inner product |
### Phase 2: Tiered Storage & Compression
| Feature | Status | Description |
|---------|--------|-------------|
| `ruvector_set_tiers()` | v2 | Configure tier thresholds |
| `ruvector_compact()` | v2 | Trigger manual compaction |
| Access frequency tracking | v2 | Background counter updates |
| Automatic tier promotion/demotion | v2 | Policy-based migration |
| SQ8/PQ compression | v2 | Transparent quantization |
### Phase 3: Graph Engine & Cypher
| Feature | Status | Description |
|---------|--------|-------------|
| `ruvector_cypher()` | v2 | Execute Cypher queries |
| `ruvector_nodes` view | v2 | Graph nodes as relations |
| `ruvector_edges` view | v2 | Graph edges as relations |
| `ruvector_hyperedges` view | v2 | Hyperedge support |
| SQL-graph joins | v2 | Mix Cypher with SQL |
### Phase 4: Integrity Control Plane
| Feature | Status | Description |
|---------|--------|-------------|
| `ruvector_integrity_sample()` | v2 | Sample contracted graph |
| `ruvector_integrity_policy_set()` | v2 | Configure policies |
| `ruvector_integrity_gate()` | v2 | Check operation permission |
| Integrity states | v2 | normal/stress/critical |
| Signed audit events | v2 | Cryptographic audit trail |
---
## Data Flow Patterns
### Vector Search (Read Path)
```
1. Client: SELECT ... ORDER BY embedding <-> $query LIMIT k
2. PostgreSQL Planner:
- Recognizes index on embedding column
- Generates Index Scan plan using ruhnsw
3. Index AM (amgettuple):
- Submits search request to shared memory queue
- Engine worker receives request
4. RuVector Engine:
- Checks integrity gate (normal state: proceed)
- Executes HNSW greedy search
- Applies post-filter if needed
- Returns top-k with distances
5. Index AM:
- Fetches results from shared memory
- Returns TIDs to executor
6. PostgreSQL Executor:
- Fetches heap tuples
- Applies remaining WHERE clauses
- Returns to client
```
### Vector Insert (Write Path)
```
1. Client: INSERT INTO items (embedding) VALUES ($vec)
2. PostgreSQL Executor:
- Assigns TID, writes heap tuple
- Generates WAL record
3. Index AM (aminsert):
- Checks integrity gate (normal: proceed, stress: throttle)
- Submits insert to engine queue
4. RuVector Engine:
- Integrates vector into HNSW graph
- Updates tier counters
- Writes to hot tier
5. WAL Writer:
- Persists operation for durability
6. Replication (if configured):
- Streams WAL to replicas
- Replicas apply via engine
```
### Integrity Gating
```
1. Background Worker (periodic):
- Samples contracted operational graph
- Computes lambda_cut (minimum cut value) on contracted graph
- Optionally computes lambda2 (algebraic connectivity) as drift signal
- Updates integrity state in shared memory
2. Any Operation:
- Reads current integrity state
- normal (lambda > T_high): allow all
- stress (T_low < lambda < T_high): throttle bulk ops
- critical (lambda < T_low): freeze mutations
3. On State Change:
- Logs signed integrity event
- Notifies waiting operations
- Adjusts background worker priorities
```
---
## Deployment Modes
### Mode 1: Single Postgres Embedded
```
+--------------------------------------------+
| PostgreSQL Instance |
| +--------------------------------------+ |
| | RuVector Extension | |
| | +--------+ +---------+ +-------+ | |
| | | Engine | | Workers | | Index | | |
| | +--------+ +---------+ +-------+ | |
| +--------------------------------------+ |
| |
| +--------------------------------------+ |
| | Data Directory | |
| | vectors/ graphs/ indexes/ wal/ | |
| +--------------------------------------+ |
+--------------------------------------------+
```
**Use case**: Development, small-medium deployments (< 100M vectors)
### Mode 2: Postgres + RuVector Cluster
```
+------------------+ +------------------+
| PostgreSQL 1 | | PostgreSQL 2 |
| (Primary) | | (Replica) |
+--------+---------+ +--------+---------+
| |
| WAL Stream | WAL Apply
| |
+--------v-------------------------v---------+
| RuVector Cluster |
| +-------+ +-------+ +-------+ +------+ |
| | Node1 | | Node2 | | Node3 | | ... | |
| +-------+ +-------+ +-------+ +------+ |
| |
| Distributed HNSW | Sharded Graph | GNN |
+---------------------------------------------+
```
**Use case**: Production, large deployments (100M+ vectors)
### v2 Cluster Mode Clarification
```
+------------------------------------------------------------------+
| CLUSTER DEPLOYMENT DECISION |
+------------------------------------------------------------------+
v2 cluster mode is a SEPARATE SERVICE with a stable RPC API.
The Postgres extension acts as a CLIENT to the cluster.
ARCHITECTURE OPTIONS:
Option A: SIDECAR (per Postgres instance)
• RuVector cluster node co-located with each Postgres
• Pros: Low latency, simple networking
• Cons: Resource contention, harder to scale independently
• Use when: Latency-sensitive, moderate scale
Option B: SHARED SERVICE (separate cluster)
• Dedicated RuVector cluster serving multiple Postgres instances
• Pros: Independent scaling, resource isolation
• Cons: Network latency, requires service discovery
• Use when: Large scale, multi-tenant
PROTOCOL:
• gRPC with protobuf serialization
• mTLS for authentication
• Connection pooling in extension
PARTITION ASSIGNMENT:
• Consistent hashing for shard routing
• Automatic rebalancing on node join/leave
• Partition map cached in extension shared memory
PARTITION MAP VERSIONING AND FENCING:
• partition_map_version: monotonic counter incremented on any change
• lease_epoch: obtained from cluster leader, prevents split-brain
• Extension rejects stale map updates unless epoch matches current
• On leader failover:
1. New leader increments epoch
2. Extensions must re-fetch map with new epoch
3. Stale-epoch operations return ESTALE, client retries
v2 RECOMMENDATION:
Start with Mode 1 (embedded). Add cluster mode only when:
• Dataset exceeds single-node memory
• Need independent scaling of compute/storage
• Multi-region deployment required
+------------------------------------------------------------------+
```
---
## Consistency Contract
### Heap-Engine Relationship
```
+------------------------------------------------------------------+
| CONSISTENCY CONTRACT |
+------------------------------------------------------------------+
| |
| PostgreSQL Heap is AUTHORITATIVE for: |
| • Row existence and visibility (MVCC xmin/xmax) |
| • Transaction commit status |
| • Data integrity constraints |
| |
| RuVector Engine Index is EVENTUALLY CONSISTENT: |
| • Bounded lag window (configurable, default 100ms) |
| • Never returns invisible tuples (heap recheck) |
| • Never resurrects deleted vectors |
| |
| v2 HYBRID MODEL: |
| • SYNCHRONOUS: Hot tier mutations, primary HNSW inserts |
| • ASYNCHRONOUS: Compaction, tier moves, graph maintenance |
| |
+------------------------------------------------------------------+
```
See [10-consistency-replication.md](10-consistency-replication.md) for full specification.
---
## Performance Targets
| Metric | Target | Notes |
|--------|--------|-------|
| Query latency (p50) | < 5ms | 1M vectors, top-10 |
| Query latency (p99) | < 20ms | 1M vectors, top-10 |
| Insert throughput | > 10K/sec | Bulk mode |
| Index build | < 30min | 10M 768-dim vectors |
| Recall@10 | > 95% | HNSW default params |
| Compression ratio | 4-32x | Tier-dependent |
| Memory overhead | < 2x | Compared to pgvector |
### Benchmark Specification
Performance targets must be validated against a defined benchmark suite:
```
+------------------------------------------------------------------+
| BENCHMARK SPECIFICATION |
+------------------------------------------------------------------+
VECTOR CONFIGURATIONS:
• Dimensions: 768 (typical text embeddings), 1536 (large embedding models)
• Row counts: 1M, 10M, 100M
• Data type: float32
QUERY PATTERNS:
• Pure vector search (no filter)
• Vector + metadata filter (10% selectivity)
• Vector + metadata filter (1% selectivity)
• Batch query (100 queries)
HARDWARE BASELINE:
• CPU: 8 cores (AMD EPYC or Intel Xeon)
• RAM: 64GB
• Storage: NVMe SSD (3GB/s read)
• Single node, no replication
CONCURRENCY:
• Single thread baseline
• 8 concurrent queries (parallel)
• 32 concurrent queries (stress)
RECALL MEASUREMENT:
• Brute-force baseline on 10K sampled queries
• Report recall@1, recall@10, recall@100
• Calculate 95th percentile recall
INDEX CONFIGURATIONS:
• HNSW: M=16, ef_construction=200, ef_search=100
• IVFFlat: nlist=sqrt(N), nprobe=10
TIER-SPECIFIC TARGETS:
• Hot tier: exact float32, recall > 98%
• Warm tier: exact or float16, recall > 96%
• Cool tier: approximate + rerank, recall > 94%
• Cold tier: approximate only, recall > 90%
+------------------------------------------------------------------+
```
---
## Security Considerations
### Integrity Event Signing
All integrity state changes are cryptographically signed:
```rust
struct IntegrityEvent {
timestamp: DateTime<Utc>,
event_type: IntegrityEventType,
previous_state: IntegrityState,
new_state: IntegrityState,
lambda_cut: f64,
witness_edges: Vec<EdgeId>,
signature: Ed25519Signature,
}
```
### Access Control
- Leverages PostgreSQL GRANT/REVOKE
- Separate roles for:
- `ruvector_admin`: Full access
- `ruvector_operator`: Maintenance operations
- `ruvector_user`: Query and insert only
### Audit Trail
- All administrative operations logged
- Integrity events stored in `ruvector_integrity_events`
- Optional export to external SIEM
---
## Implementation Roadmap
### Phase 1: Foundation (Weeks 1-4)
- [ ] Extension skeleton with pgrx
- [ ] Collection metadata tables
- [ ] Basic HNSW integration
- [ ] pgvector compatibility tests
- [ ] Recall/performance benchmarks
### Phase 2: Tiered Storage (Weeks 5-8)
- [ ] Access counter infrastructure
- [ ] Tier policy table
- [ ] Background compactor
- [ ] Compression integration
- [ ] Tier report functions
### Phase 3: Graph & Cypher (Weeks 9-12)
- [ ] Graph storage schema
- [ ] Cypher parser integration
- [ ] Relational bridge views
- [ ] SQL-graph join helpers
- [ ] Graph maintenance
### Phase 4: Integrity Control (Weeks 13-16)
- [ ] Contracted graph construction
- [ ] Lambda cut computation
- [ ] Policy application layer
- [ ] Signed audit events
- [ ] Control plane testing
---
## Dependencies
### Rust Crates
| Crate | Purpose |
|-------|---------|
| `pgrx` | PostgreSQL extension framework |
| `parking_lot` | Fast synchronization primitives |
| `crossbeam` | Lock-free data structures |
| `serde` | Serialization |
| `ed25519-dalek` | Signature verification |
### PostgreSQL Features
| Feature | Minimum Version |
|---------|-----------------|
| Background workers | 9.4+ |
| Custom access methods | 9.6+ |
| Parallel query | 9.6+ |
| Logical replication | 10+ |
| Partitioning | 10+ (native) |
---
## Related Documents
| Document | Description |
|----------|-------------|
| [01-sql-schema.md](01-sql-schema.md) | Complete SQL schema |
| [02-background-workers.md](02-background-workers.md) | Worker specifications with IPC contract |
| [03-index-access-methods.md](03-index-access-methods.md) | Index AM details |
| [04-integrity-events.md](04-integrity-events.md) | Event schema, policies, hysteresis, operation classes |
| [05-phase1-pgvector-compat.md](05-phase1-pgvector-compat.md) | Phase 1 specification with incremental AM path |
| [06-phase2-tiered-storage.md](06-phase2-tiered-storage.md) | Phase 2 specification with tier exactness modes |
| [07-phase3-graph-cypher.md](07-phase3-graph-cypher.md) | Phase 3 specification with SQL join keys |
| [08-phase4-integrity-control.md](08-phase4-integrity-control.md) | Phase 4 specification (mincut + λ₂) |
| [09-migration-guide.md](09-migration-guide.md) | pgvector migration |
| [10-consistency-replication.md](10-consistency-replication.md) | Consistency contract, MVCC, WAL, recovery |

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,656 @@
# RuVector Postgres v2 - Migration Guide
## Overview
This guide provides step-by-step instructions for migrating from pgvector to RuVector Postgres v2. The migration is designed to be **non-disruptive** with zero data loss and minimal downtime.
---
## Migration Approaches
### Approach 1: In-Place Extension Swap (Recommended)
Swap the extension while keeping data in place. Fastest with zero data copy.
**Downtime**: < 5 minutes
**Risk**: Low
### Approach 2: Parallel Run with Gradual Cutover
Run both extensions simultaneously, gradually shifting traffic.
**Downtime**: Zero
**Risk**: Very Low
### Approach 3: Full Data Migration
Export and re-import all data. Use when changing schema significantly.
**Downtime**: Proportional to data size
**Risk**: Medium
---
## Pre-Migration Checklist
### 1. Verify Compatibility
```sql
-- Check pgvector version
SELECT extversion FROM pg_extension WHERE extname = 'vector';
-- Check PostgreSQL version (RuVector requires 14+)
SELECT version();
-- Count vectors and indexes
SELECT
relname AS table_name,
pg_size_pretty(pg_relation_size(c.oid)) AS size,
(SELECT COUNT(*) FROM pg_class WHERE relname = c.relname) AS rows
FROM pg_class c
JOIN pg_namespace n ON n.oid = c.relnamespace
WHERE c.relkind = 'r'
AND EXISTS (
SELECT 1 FROM pg_attribute a
JOIN pg_type t ON a.atttypid = t.oid
WHERE a.attrelid = c.oid AND t.typname = 'vector'
);
-- List vector indexes
SELECT
i.relname AS index_name,
t.relname AS table_name,
am.amname AS index_type,
pg_size_pretty(pg_relation_size(i.oid)) AS size
FROM pg_index ix
JOIN pg_class i ON ix.indexrelid = i.oid
JOIN pg_class t ON ix.indrelid = t.oid
JOIN pg_am am ON i.relam = am.oid
WHERE am.amname IN ('hnsw', 'ivfflat');
```
### 2. Backup
```bash
# Full database backup
pg_dump -Fc -f backup_before_migration.dump mydb
# Or just schema with vector data
pg_dump -Fc --table='*embedding*' -f vector_tables.dump mydb
```
### 3. Test Environment
```bash
# Restore to test environment
createdb mydb_test
pg_restore -d mydb_test backup_before_migration.dump
# Install RuVector extension for testing
psql mydb_test -c "CREATE EXTENSION ruvector"
```
---
## Approach 1: In-Place Extension Swap
### Step 1: Install RuVector Extension
```bash
# Install RuVector package
# Option A: From source
cd ruvector-postgres
cargo pgrx install --release
# Option B: From package (when available)
apt install postgresql-16-ruvector
```
### Step 2: Stop Application Writes
```sql
-- Optional: Put tables in read-only mode
BEGIN;
LOCK TABLE items IN EXCLUSIVE MODE;
-- Keep transaction open to block writes
```
### Step 3: Drop pgvector Indexes
```sql
-- Save index definitions for recreation
SELECT indexdef
FROM pg_indexes
WHERE indexname IN (
SELECT i.relname
FROM pg_index ix
JOIN pg_class i ON ix.indexrelid = i.oid
JOIN pg_am am ON i.relam = am.oid
WHERE am.amname IN ('hnsw', 'ivfflat')
);
-- Drop indexes (saves original DDL first)
DO $$
DECLARE
idx RECORD;
BEGIN
FOR idx IN
SELECT i.relname AS index_name
FROM pg_index ix
JOIN pg_class i ON ix.indexrelid = i.oid
JOIN pg_am am ON i.relam = am.oid
WHERE am.amname IN ('hnsw', 'ivfflat')
LOOP
EXECUTE format('DROP INDEX IF EXISTS %I', idx.index_name);
END LOOP;
END $$;
```
### Step 4: Swap Extensions
```sql
-- Drop pgvector
DROP EXTENSION vector CASCADE;
-- Create RuVector
CREATE EXTENSION ruvector;
```
### Step 5: Recreate Indexes
```sql
-- Recreate HNSW index (same syntax)
CREATE INDEX idx_items_embedding ON items
USING hnsw (embedding vector_l2_ops)
WITH (m = 16, ef_construction = 64);
-- Or with RuVector-specific options
CREATE INDEX idx_items_embedding ON items
USING hnsw (embedding vector_l2_ops)
WITH (m = 16, ef_construction = 64);
```
### Step 6: Verify
```sql
-- Check extension
SELECT * FROM pg_extension WHERE extname = 'ruvector';
-- Test query
EXPLAIN ANALYZE
SELECT id, embedding <-> '[0.1, 0.2, ...]' AS distance
FROM items
ORDER BY embedding <-> '[0.1, 0.2, ...]'
LIMIT 10;
-- Compare recall (optional)
-- Run same query with and without index
SET enable_indexscan = off;
-- Query without index (exact)
SET enable_indexscan = on;
-- Query with index (approximate)
```
### Step 7: Resume Application
```sql
-- Release lock
ROLLBACK; -- If you started a transaction for locking
```
---
## Approach 2: Parallel Run
### Step 1: Install RuVector (Different Schema)
```sql
-- Create schema for RuVector
CREATE SCHEMA ruvector_new;
-- Install RuVector in new schema
CREATE EXTENSION ruvector WITH SCHEMA ruvector_new;
```
### Step 2: Create Shadow Tables
```sql
-- Create shadow table with same structure
CREATE TABLE ruvector_new.items AS
SELECT * FROM items WHERE false;
-- Add vector column using RuVector type
ALTER TABLE ruvector_new.items
ALTER COLUMN embedding TYPE ruvector_new.vector(768);
-- Copy data
INSERT INTO ruvector_new.items
SELECT * FROM items;
-- Create index
CREATE INDEX ON ruvector_new.items
USING hnsw (embedding ruvector_new.vector_l2_ops)
WITH (m = 16, ef_construction = 64);
```
### Step 3: Set Up Triggers for Sync
```sql
-- Sync inserts
CREATE OR REPLACE FUNCTION sync_to_ruvector()
RETURNS TRIGGER AS $$
BEGIN
INSERT INTO ruvector_new.items VALUES (NEW.*);
RETURN NEW;
END;
$$ LANGUAGE plpgsql;
CREATE TRIGGER trg_sync_insert
AFTER INSERT ON items
FOR EACH ROW EXECUTE FUNCTION sync_to_ruvector();
-- Sync updates
CREATE TRIGGER trg_sync_update
AFTER UPDATE ON items
FOR EACH ROW EXECUTE FUNCTION sync_to_ruvector_update();
-- Sync deletes
CREATE TRIGGER trg_sync_delete
AFTER DELETE ON items
FOR EACH ROW EXECUTE FUNCTION sync_to_ruvector_delete();
```
### Step 4: Gradual Cutover
```python
# Application code with gradual cutover
import random
def search_embeddings(query_vector, use_ruvector_pct=0):
"""
Gradually shift traffic to RuVector.
Start with 0%, increase to 100% over time.
"""
if random.random() * 100 < use_ruvector_pct:
# Use RuVector
return db.execute("""
SELECT id, embedding <-> %s AS distance
FROM ruvector_new.items
ORDER BY embedding <-> %s
LIMIT 10
""", [query_vector, query_vector])
else:
# Use pgvector
return db.execute("""
SELECT id, embedding <-> %s AS distance
FROM items
ORDER BY embedding <-> %s
LIMIT 10
""", [query_vector, query_vector])
```
### Step 5: Complete Migration
Once 100% traffic on RuVector with no issues:
```sql
-- Rename tables
ALTER TABLE items RENAME TO items_pgvector_backup;
ALTER TABLE ruvector_new.items RENAME TO items;
ALTER TABLE items SET SCHEMA public;
-- Drop pgvector
DROP EXTENSION vector CASCADE;
DROP TABLE items_pgvector_backup;
-- Clean up triggers
DROP FUNCTION sync_to_ruvector CASCADE;
```
---
## Approach 3: Full Data Migration
### Step 1: Export Data
```sql
-- Export to CSV
\copy (SELECT id, embedding::text, metadata FROM items) TO 'items_export.csv' CSV;
-- Or to binary format
\copy items TO 'items_export.bin' BINARY;
```
### Step 2: Switch Extensions
```sql
DROP EXTENSION vector CASCADE;
CREATE EXTENSION ruvector;
```
### Step 3: Recreate Tables
```sql
-- Recreate with RuVector type
CREATE TABLE items (
id SERIAL PRIMARY KEY,
embedding vector(768),
metadata JSONB
);
-- Import data
\copy items FROM 'items_export.csv' CSV;
-- Create index
CREATE INDEX ON items USING hnsw (embedding vector_l2_ops);
```
---
## Query Compatibility Reference
### Identical Syntax (No Changes Needed)
```sql
-- Vector type declaration
CREATE TABLE items (embedding vector(768));
-- Distance operators
SELECT * FROM items ORDER BY embedding <-> query LIMIT 10; -- L2
SELECT * FROM items ORDER BY embedding <=> query LIMIT 10; -- Cosine
SELECT * FROM items ORDER BY embedding <#> query LIMIT 10; -- Inner product
-- Index creation
CREATE INDEX ON items USING hnsw (embedding vector_l2_ops);
CREATE INDEX ON items USING hnsw (embedding vector_cosine_ops);
CREATE INDEX ON items USING ivfflat (embedding vector_l2_ops) WITH (lists = 100);
-- Operator classes
vector_l2_ops
vector_cosine_ops
vector_ip_ops
-- Utility functions
SELECT vector_dims(embedding) FROM items LIMIT 1;
SELECT vector_norm(embedding) FROM items LIMIT 1;
```
### Extended Syntax (RuVector Only)
```sql
-- New distance operators
SELECT * FROM items ORDER BY embedding <+> query LIMIT 10; -- L1/Manhattan
-- Collection registration
SELECT ruvector_register_collection(
'my_embeddings',
'public',
'items',
'embedding',
768,
'l2'
);
-- Advanced search options
SELECT * FROM ruvector_search(
'my_embeddings',
query_vector,
10, -- k
100, -- ef_search
FALSE, -- use_gnn
'{"category": "electronics"}' -- filter
);
-- Tiered storage
SELECT ruvector_set_tiers('my_embeddings', 24, 168, 720);
SELECT ruvector_tier_report('my_embeddings');
-- Graph integration
SELECT ruvector_graph_create('knowledge_graph');
SELECT ruvector_cypher('knowledge_graph', 'MATCH (n) RETURN n LIMIT 10');
-- Integrity monitoring
SELECT ruvector_integrity_status('my_embeddings');
```
---
## GUC Parameter Mapping
| pgvector | RuVector | Notes |
|----------|----------|-------|
| `ivfflat.probes` | `ruvector.probes` | Same behavior |
| `hnsw.ef_search` | `ruvector.ef_search` | Same behavior |
| N/A | `ruvector.use_simd` | Enable/disable SIMD |
| N/A | `ruvector.max_index_memory` | Memory limit |
```sql
-- Set runtime parameters (same syntax)
SET ruvector.ef_search = 100;
SET ruvector.probes = 10;
```
---
## Common Migration Issues
### Issue 1: Type Mismatch After Migration
```sql
-- Error: operator does not exist: ruvector.vector <-> public.vector
-- Solution: Ensure all tables use the new type
SELECT
c.relname AS table_name,
a.attname AS column_name,
t.typname AS type_name,
n.nspname AS type_schema
FROM pg_attribute a
JOIN pg_class c ON a.attrelid = c.oid
JOIN pg_type t ON a.atttypid = t.oid
JOIN pg_namespace n ON t.typnamespace = n.oid
WHERE t.typname = 'vector';
-- Fix by recreating column
ALTER TABLE items ALTER COLUMN embedding TYPE ruvector.vector(768);
```
### Issue 2: Index Not Using RuVector AM
```sql
-- Check which AM is being used
SELECT
i.relname AS index_name,
am.amname AS access_method
FROM pg_index ix
JOIN pg_class i ON ix.indexrelid = i.oid
JOIN pg_am am ON i.relam = am.oid;
-- Rebuild index with correct AM
DROP INDEX old_index;
CREATE INDEX new_index ON items USING hnsw (embedding vector_l2_ops);
```
### Issue 3: Different Recall/Performance
```sql
-- RuVector may have different default parameters
-- Adjust ef_search for recall
SET ruvector.ef_search = 200; -- Higher for better recall
-- Check actual ef being used
EXPLAIN (ANALYZE, VERBOSE)
SELECT * FROM items ORDER BY embedding <-> query LIMIT 10;
```
### Issue 4: Extension Dependencies
```sql
-- Check what depends on vector extension
SELECT
dependent.relname AS dependent_object,
dependent.relkind AS object_type
FROM pg_depend d
JOIN pg_extension e ON d.refobjid = e.oid
JOIN pg_class dependent ON d.objid = dependent.oid
WHERE e.extname = 'vector';
-- May need to drop dependent objects first
```
---
## Rollback Procedure
If migration fails, rollback to pgvector:
```bash
# Restore from backup
pg_restore -d mydb --clean backup_before_migration.dump
# Or manually:
```
```sql
-- Drop RuVector
DROP EXTENSION ruvector CASCADE;
-- Reinstall pgvector
CREATE EXTENSION vector;
-- Restore schema (from saved DDL)
-- Recreate indexes (from saved DDL)
```
---
## Performance Validation
### Compare Query Performance
```python
import time
import psycopg2
import numpy as np
def benchmark_extension(conn, query_vector, n_queries=100):
"""Benchmark query latency"""
latencies = []
for _ in range(n_queries):
start = time.time()
with conn.cursor() as cur:
cur.execute("""
SELECT id, embedding <-> %s AS distance
FROM items
ORDER BY embedding <-> %s
LIMIT 10
""", [query_vector, query_vector])
cur.fetchall()
latencies.append((time.time() - start) * 1000)
return {
'p50': np.percentile(latencies, 50),
'p95': np.percentile(latencies, 95),
'p99': np.percentile(latencies, 99),
'mean': np.mean(latencies),
}
# Run before migration (pgvector)
pgvector_results = benchmark_extension(conn, query_vec)
# Run after migration (RuVector)
ruvector_results = benchmark_extension(conn, query_vec)
print(f"pgvector p50: {pgvector_results['p50']:.2f}ms")
print(f"RuVector p50: {ruvector_results['p50']:.2f}ms")
```
### Compare Recall
```python
def measure_recall(conn, query_vectors, k=10):
"""Measure recall@k against brute force"""
recalls = []
for query in query_vectors:
# Index scan result
with conn.cursor() as cur:
cur.execute("""
SELECT id FROM items
ORDER BY embedding <-> %s
LIMIT %s
""", [query, k])
index_results = set(row[0] for row in cur.fetchall())
# Brute force (disable index)
with conn.cursor() as cur:
cur.execute("SET enable_indexscan = off")
cur.execute("""
SELECT id FROM items
ORDER BY embedding <-> %s
LIMIT %s
""", [query, k])
exact_results = set(row[0] for row in cur.fetchall())
cur.execute("SET enable_indexscan = on")
recall = len(index_results & exact_results) / k
recalls.append(recall)
return np.mean(recalls)
```
---
## Post-Migration Steps
### 1. Register Collections (Optional but Recommended)
```sql
-- Register for RuVector-specific features
SELECT ruvector_register_collection(
'items_embeddings',
'public',
'items',
'embedding',
768,
'l2'
);
```
### 2. Enable Tiered Storage (Optional)
```sql
-- Configure tiers
SELECT ruvector_set_tiers('items_embeddings', 24, 168, 720);
```
### 3. Set Up Integrity Monitoring (Optional)
```sql
-- Enable integrity monitoring
SELECT ruvector_integrity_policy_set('items_embeddings', 'default', '{
"threshold_high": 0.8,
"threshold_low": 0.3
}'::jsonb);
```
### 4. Update Application Code
```python
# Minimal changes needed for basic operations
# No change needed:
cursor.execute("SELECT * FROM items ORDER BY embedding <-> %s LIMIT 10", [vec])
# Optional: Use new features
cursor.execute("SELECT * FROM ruvector_search('items_embeddings', %s, 10)", [vec])
```
---
## Support
- GitHub Issues: https://github.com/ruvnet/ruvector/issues
- Documentation: https://ruvector.dev/docs
- Migration Support: migration@ruvector.dev

View File

@@ -0,0 +1,826 @@
# RuVector Postgres v2 - Consistency and Replication Model
## Overview
This document specifies the consistency contract between PostgreSQL heap tuples and the RuVector engine, MVCC interaction, WAL and logical decoding strategy, crash recovery, replay order, and idempotency guarantees.
---
## Core Consistency Contract
### Authoritative Source of Truth
```
+------------------------------------------------------------------+
| CONSISTENCY HIERARCHY |
+------------------------------------------------------------------+
| |
| 1. PostgreSQL Heap is AUTHORITATIVE for: |
| - Row existence |
| - Visibility rules (MVCC xmin/xmax) |
| - Transaction commit status |
| - Data integrity constraints |
| |
| 2. RuVector Engine Index is EVENTUALLY CONSISTENT: |
| - Bounded lag window (configurable, default 100ms) |
| - Reconciled on demand |
| - Never returns invisible tuples |
| - Never resurrects deleted embeddings |
| |
+------------------------------------------------------------------+
```
### Consistency Guarantees
| Property | Guarantee | Enforcement |
|----------|-----------|-------------|
| **No phantom reads** | Index never returns invisible tuples | Heap visibility check on every result |
| **No zombie vectors** | Deleted vectors never return | Delete markers + tombstone cleanup |
| **No stale updates** | Updated vectors show new values | Version-aware index entries |
| **Bounded staleness** | Max lag from commit to searchable | Configurable, default 100ms |
| **Crash consistency** | Recoverable to last WAL checkpoint | WAL-based recovery |
---
## Consistency Mechanisms
### Option A: Synchronous Index Maintenance
```
INSERT/UPDATE Transaction:
+------------------------------------------------------------------+
| |
| 1. BEGIN |
| 2. Write heap tuple |
| 3. Call engine (synchronous) |
| └─ If engine rejects → ROLLBACK |
| 4. Append to WAL |
| 5. COMMIT |
| |
+------------------------------------------------------------------+
Pros:
- Strongest consistency
- Simple mental model
- No reconciliation needed
Cons:
- Higher latency per operation
- Engine failure blocks writes
- Reduces write throughput
```
### Option B: Asynchronous Maintenance with Reconciliation
```
INSERT/UPDATE Transaction:
+------------------------------------------------------------------+
| |
| 1. BEGIN |
| 2. Write heap tuple |
| 3. Write to change log table OR trigger logical decoding |
| 4. Append to WAL |
| 5. COMMIT |
| |
| Background (continuous): |
| 6. Engine reads change log / logical replication stream |
| 7. Applies changes to index |
| 8. Index scan checks heap visibility for every result |
| |
+------------------------------------------------------------------+
Pros:
- Lower write latency
- Engine failure doesn't block writes
- Higher throughput
Cons:
- Bounded staleness window
- Requires visibility rechecks
- More complex recovery
```
### v2 Hybrid Model (Recommended)
```
+------------------------------------------------------------------+
| v2 HYBRID CONSISTENCY MODEL |
+------------------------------------------------------------------+
| |
| SYNCHRONOUS (Hot Tier): |
| - Primary HNSW index mutations |
| - Hot tier inserts/updates |
| - Visibility-critical operations |
| |
| ASYNCHRONOUS (Background): |
| - Compaction and tier moves |
| - Graph edge maintenance |
| - GNN training data capture |
| - Cold tier updates |
| - Index optimization/rewiring |
| |
+------------------------------------------------------------------+
```
---
## Implementation Details
### Visibility Check Protocol
```rust
/// Check heap visibility for index results
pub fn check_visibility(
snapshot: &Snapshot,
results: &[IndexResult],
) -> Vec<IndexResult> {
results.iter()
.filter(|r| {
// Fetch heap tuple header
let htup = heap_fetch_tuple_header(r.tid);
// Check MVCC visibility
htup.map_or(false, |h| {
heap_tuple_satisfies_snapshot(h, snapshot)
})
})
.cloned()
.collect()
}
/// Index scan must always recheck heap
impl IndexScan {
fn next(&mut self) -> Option<HeapTuple> {
loop {
// Get next candidate from index
let candidate = self.index.next()?;
// CRITICAL: Always verify against heap
if let Some(tuple) = self.heap_fetch_visible(candidate.tid) {
return Some(tuple);
}
// Invisible tuple, try next
}
}
}
```
### Incremental Candidate Paging API
The engine must support incremental candidate paging so the executor can skip MVCC-invisible rows and request more until k visible results are produced.
```rust
/// Search request with cursor support for incremental paging
#[derive(Debug)]
pub struct SearchRequest {
pub collection_id: i32,
pub query: Vec<f32>,
pub want_k: usize, // Desired visible results
pub cursor: Option<Cursor>, // Resume from previous batch
pub max_candidates: usize, // Max to return per batch (default: want_k * 2)
}
/// Search response with cursor for pagination
#[derive(Debug)]
pub struct SearchResponse {
pub candidates: Vec<Candidate>,
pub cursor: Option<Cursor>, // None if exhausted
pub total_scanned: usize,
}
/// Cursor token for resuming search
#[derive(Debug, Clone)]
pub struct Cursor {
pub ef_search_position: usize,
pub last_distance: f32,
pub visited_count: usize,
}
/// Engine returns batches with cursor tokens
impl Engine {
pub fn search_batch(&self, req: SearchRequest) -> SearchResponse {
let start_pos = req.cursor.map(|c| c.ef_search_position).unwrap_or(0);
// Continue HNSW search from cursor position
let (candidates, next_pos, exhausted) = self.hnsw.search_continue(
&req.query,
req.max_candidates,
start_pos,
);
SearchResponse {
candidates,
cursor: if exhausted {
None
} else {
Some(Cursor {
ef_search_position: next_pos,
last_distance: candidates.last().map(|c| c.distance).unwrap_or(f32::MAX),
visited_count: start_pos + candidates.len(),
})
},
total_scanned: start_pos + candidates.len(),
}
}
}
/// Executor uses incremental paging
fn execute_vector_search(query: &[f32], k: usize, snapshot: &Snapshot) -> Vec<HeapTuple> {
let mut results = Vec::with_capacity(k);
let mut cursor = None;
loop {
// Request batch from engine
let response = engine.search_batch(SearchRequest {
collection_id,
query: query.to_vec(),
want_k: k - results.len(),
cursor,
max_candidates: (k - results.len()) * 2, // Over-fetch
});
// Check visibility and collect visible tuples
for candidate in response.candidates {
if let Some(tuple) = heap_fetch_visible(candidate.tid, snapshot) {
results.push(tuple);
if results.len() >= k {
return results;
}
}
}
// Check if exhausted
match response.cursor {
Some(c) => cursor = Some(c),
None => break, // No more candidates
}
}
results
}
```
### Change Log Table (Async Mode)
```sql
-- Change log for async reconciliation
CREATE TABLE ruvector._change_log (
id BIGSERIAL PRIMARY KEY,
collection_id INTEGER NOT NULL,
operation CHAR(1) NOT NULL CHECK (operation IN ('I', 'U', 'D')),
tuple_tid TID NOT NULL,
vector_data BYTEA, -- NULL for deletes
xmin XID NOT NULL,
committed BOOLEAN DEFAULT FALSE,
applied BOOLEAN DEFAULT FALSE,
created_at TIMESTAMPTZ NOT NULL DEFAULT clock_timestamp()
);
CREATE INDEX idx_change_log_pending
ON ruvector._change_log(collection_id, id)
WHERE NOT applied;
-- Trigger to capture changes
CREATE FUNCTION ruvector._log_change() RETURNS TRIGGER AS $$
BEGIN
IF TG_OP = 'INSERT' THEN
INSERT INTO ruvector._change_log (collection_id, operation, tuple_tid, vector_data, xmin)
SELECT collection_id, 'I', NEW.ctid, NEW.embedding, txid_current()
FROM ruvector.collections WHERE table_name = TG_TABLE_NAME;
ELSIF TG_OP = 'UPDATE' THEN
INSERT INTO ruvector._change_log (collection_id, operation, tuple_tid, vector_data, xmin)
SELECT collection_id, 'U', NEW.ctid, NEW.embedding, txid_current()
FROM ruvector.collections WHERE table_name = TG_TABLE_NAME;
ELSIF TG_OP = 'DELETE' THEN
INSERT INTO ruvector._change_log (collection_id, operation, tuple_tid, vector_data, xmin)
SELECT collection_id, 'D', OLD.ctid, NULL, txid_current()
FROM ruvector.collections WHERE table_name = TG_TABLE_NAME;
END IF;
RETURN NULL;
END;
$$ LANGUAGE plpgsql;
```
### Logical Decoding (Alternative)
```rust
/// Logical decoding output plugin for RuVector
pub struct RuVectorOutputPlugin;
impl OutputPlugin for RuVectorOutputPlugin {
fn begin_txn(&mut self, xid: TransactionId) {
self.current_xid = Some(xid);
self.changes.clear();
}
fn change(&mut self, relation: &Relation, change: &Change) {
// Only process tables with vector columns
if !self.is_vector_table(relation) {
return;
}
match change {
Change::Insert(new) => {
self.changes.push(VectorChange::Insert {
tid: new.tid,
vector: extract_vector(new),
});
}
Change::Update(old, new) => {
self.changes.push(VectorChange::Update {
old_tid: old.tid,
new_tid: new.tid,
vector: extract_vector(new),
});
}
Change::Delete(old) => {
self.changes.push(VectorChange::Delete {
tid: old.tid,
});
}
}
}
fn commit_txn(&mut self, xid: TransactionId, commit_lsn: XLogRecPtr) {
// Apply all changes atomically
self.engine.apply_changes(&self.changes, commit_lsn);
}
}
```
---
## MVCC Interaction
### Transaction Visibility Rules
```rust
/// Snapshot-aware index search
pub fn search_with_snapshot(
collection_id: i32,
query: &[f32],
k: usize,
snapshot: &Snapshot,
) -> Vec<SearchResult> {
// Get more candidates than k to account for invisible tuples
let over_fetch_factor = 2.0;
let candidates = engine.search(
collection_id,
query,
(k as f32 * over_fetch_factor) as usize,
);
// Filter by visibility
let visible: Vec<_> = candidates.into_iter()
.filter(|c| is_visible(c.tid, snapshot))
.take(k)
.collect();
// If we don't have enough, fetch more
if visible.len() < k {
// Recursive fetch with larger over_fetch
return search_with_larger_pool(...);
}
visible
}
/// Check tuple visibility against snapshot
fn is_visible(tid: TupleId, snapshot: &Snapshot) -> bool {
let htup = unsafe { heap_fetch_tuple(tid) };
match htup {
Some(tuple) => {
// HeapTupleSatisfiesVisibility equivalent
let xmin = tuple.t_xmin;
let xmax = tuple.t_xmax;
// Inserted by committed transaction visible to us
let xmin_visible = snapshot.xmin <= xmin &&
!snapshot.xip.contains(&xmin) &&
pg_xact_status(xmin) == XACT_STATUS_COMMITTED;
// Not deleted, or deleted by transaction not visible to us
let not_deleted = xmax == InvalidTransactionId ||
snapshot.xmax <= xmax ||
snapshot.xip.contains(&xmax) ||
pg_xact_status(xmax) != XACT_STATUS_COMMITTED;
xmin_visible && not_deleted
}
None => false, // Tuple vacuumed away
}
}
```
### HOT Update Handling
```rust
/// Handle Heap-Only Tuple updates
pub fn handle_hot_update(old_tid: TupleId, new_tid: TupleId, new_vector: &[f32]) {
// HOT updates may change ctid without changing embedding
if vectors_equal(get_vector(old_tid), new_vector) {
// Only ctid changed, update TID mapping
engine.update_tid_mapping(old_tid, new_tid);
} else {
// Vector changed, full update needed
engine.delete(old_tid);
engine.insert(new_tid, new_vector);
}
}
```
---
## WAL and Recovery
### WAL Record Types
```rust
/// Custom WAL record types for RuVector
#[repr(u8)]
pub enum RuVectorWalRecord {
/// Vector inserted into index
IndexInsert = 0x10,
/// Vector deleted from index
IndexDelete = 0x11,
/// Index page split
IndexSplit = 0x12,
/// HNSW edge added
HnswEdgeAdd = 0x20,
/// HNSW edge removed
HnswEdgeRemove = 0x21,
/// Tier change
TierChange = 0x30,
/// Integrity state change
IntegrityChange = 0x40,
}
impl RuVectorWalRecord {
/// Write WAL record
pub fn write(&self, data: &[u8]) -> XLogRecPtr {
unsafe {
let rdata = XLogRecData {
data: data.as_ptr() as *mut c_char,
len: data.len() as u32,
next: std::ptr::null_mut(),
};
XLogInsert(RM_RUVECTOR_ID, self.to_u8(), &rdata)
}
}
}
```
### Crash Recovery
```rust
/// Redo function for crash recovery
pub extern "C" fn ruvector_redo(record: *mut XLogReaderState) {
let info = unsafe { (*record).decoded_record.as_ref() };
match RuVectorWalRecord::from_u8(info.xl_info) {
Some(RuVectorWalRecord::IndexInsert) => {
let insert_data: IndexInsertData = deserialize(info.data);
engine.redo_insert(insert_data);
}
Some(RuVectorWalRecord::IndexDelete) => {
let delete_data: IndexDeleteData = deserialize(info.data);
engine.redo_delete(delete_data);
}
Some(RuVectorWalRecord::HnswEdgeAdd) => {
let edge_data: HnswEdgeData = deserialize(info.data);
engine.redo_edge_add(edge_data);
}
// ... other record types
_ => {
pgrx::warning!("Unknown RuVector WAL record type");
}
}
}
/// Startup recovery sequence
pub fn startup_recovery() {
pgrx::log!("RuVector: Starting crash recovery");
// 1. Load last consistent checkpoint
let checkpoint = load_checkpoint();
// 2. Rebuild in-memory structures
engine.load_from_checkpoint(&checkpoint);
// 3. Replay WAL from checkpoint
let wal_reader = WalReader::from_lsn(checkpoint.redo_lsn);
for record in wal_reader {
ruvector_redo(&record);
}
// 4. Reconcile with heap if needed
if checkpoint.needs_reconciliation {
reconcile_with_heap();
}
pgrx::log!("RuVector: Recovery complete");
}
```
### Replay Order Guarantees
```
WAL Replay Order Contract:
+------------------------------------------------------------------+
| |
| 1. WAL records replayed in LSN order (guaranteed by PostgreSQL) |
| |
| 2. Within a transaction: |
| - Heap insert before index insert |
| - Index delete before heap delete (for visibility) |
| |
| 3. Cross-transaction: |
| - Commit order preserved |
| - Visibility respects commit timestamps |
| |
| 4. Recovery invariant: |
| - After recovery, index matches committed heap state |
| - No uncommitted changes in index |
| |
+------------------------------------------------------------------+
```
---
## Idempotency and Ordering Rules
**CRITICAL**: If WAL is truth, these invariants prevent "eventual corruption".
### Explicit Replay Rules
```
+------------------------------------------------------------------+
| ENGINE REPLAY INVARIANTS |
+------------------------------------------------------------------+
RULE 1: Apply operations in LSN order
- Each operation carries its source LSN
- Engine rejects out-of-order operations
- Crash recovery replays from last checkpoint LSN
RULE 2: Store last applied LSN per collection
- Persisted in ruvector.collection_state.last_applied_lsn
- Updated atomically after each operation
- Skip operations with LSN <= last_applied_lsn
RULE 3: Delete wins over insert for same TID
- If TID inserted then deleted, final state is deleted
- Replay order handles this naturally if LSN-ordered
- Edge case: TID reuse after VACUUM requires checking xmin
RULE 4: Update = Delete + Insert
- Updates decompose to delete old, insert new
- Both carry same transaction LSN
- Applied atomically
RULE 5: Rollback handling
- Uncommitted operations not in WAL (crash safe)
- For explicit ROLLBACK during runtime:
- Synchronous mode: engine notified, reverts in-memory state
- Async mode: change log entry marked rollback, skipped on apply
+------------------------------------------------------------------+
```
### Conflict Resolution
```rust
/// Handle conflicts during replay
pub fn apply_with_conflict_resolution(
&mut self,
op: WalOperation,
) -> Result<(), ReplayError> {
// Check LSN ordering
let last_lsn = self.lsn_tracker.get(op.collection_id);
if op.lsn <= last_lsn {
// Already applied, skip (idempotent)
return Ok(());
}
match op.kind {
OpKind::Insert { tid, vector } => {
if self.index.contains_tid(tid) {
// TID exists - check if this is TID reuse after VACUUM
let existing_lsn = self.index.get_lsn(tid);
if op.lsn > existing_lsn {
// Newer insert wins - delete old, insert new
self.index.delete(tid);
self.index.insert(tid, &vector, op.lsn);
}
// else: stale insert, skip
} else {
self.index.insert(tid, &vector, op.lsn);
}
}
OpKind::Delete { tid } => {
// Delete always wins if LSN is newer
if self.index.contains_tid(tid) {
let existing_lsn = self.index.get_lsn(tid);
if op.lsn > existing_lsn {
self.index.delete(tid);
}
}
// If not present, already deleted - idempotent
}
OpKind::Update { old_tid, new_tid, vector } => {
// Atomic delete + insert
self.index.delete(old_tid);
self.index.insert(new_tid, &vector, op.lsn);
}
}
self.lsn_tracker.update(op.collection_id, op.lsn);
Ok(())
}
```
### Idempotent Operations
```rust
/// All engine operations must be idempotent for safe replay
impl Engine {
/// Idempotent insert - safe to replay
pub fn redo_insert(&mut self, data: IndexInsertData) {
// Check if already exists
if self.index.contains_tid(data.tid) {
// Already inserted, skip
return;
}
// Insert with LSN tracking
self.index.insert_with_lsn(data.tid, &data.vector, data.lsn);
}
/// Idempotent delete - safe to replay
pub fn redo_delete(&mut self, data: IndexDeleteData) {
// Check if already deleted
if !self.index.contains_tid(data.tid) {
// Already deleted, skip
return;
}
// Delete with tombstone
self.index.delete_with_lsn(data.tid, data.lsn);
}
/// Idempotent edge add - safe to replay
pub fn redo_edge_add(&mut self, data: HnswEdgeData) {
// HNSW edges are idempotent by nature
self.hnsw.add_edge(data.from, data.to, data.lsn);
}
}
```
### LSN-Based Deduplication
```rust
/// Track applied LSN per collection
pub struct LsnTracker {
applied_lsn: HashMap<i32, XLogRecPtr>,
}
impl LsnTracker {
/// Check if operation should be applied
pub fn should_apply(&self, collection_id: i32, lsn: XLogRecPtr) -> bool {
match self.applied_lsn.get(&collection_id) {
Some(&last_lsn) => lsn > last_lsn,
None => true,
}
}
/// Mark operation as applied
pub fn mark_applied(&mut self, collection_id: i32, lsn: XLogRecPtr) {
self.applied_lsn.insert(collection_id, lsn);
}
}
```
---
## Replication Strategies
### Physical Replication (Streaming)
```
Primary → Standby streaming with RuVector:
Primary:
1. Write heap + index changes
2. Generate WAL records
3. Stream to standby
Standby:
1. Receive WAL stream
2. Apply heap changes (PostgreSQL)
3. Apply index changes (RuVector redo)
4. Engine state matches primary
```
### Logical Replication
```
Publisher → Subscriber with RuVector:
Publisher:
1. Changes captured via logical decoding
2. RuVector output plugin extracts vector changes
3. Publishes to replication slot
Subscriber:
1. Receives logical changes
2. Applies to local heap
3. Local RuVector engine indexes changes
4. Independent index structures
```
---
## Configuration
```sql
-- Consistency configuration
ALTER SYSTEM SET ruvector.consistency_mode = 'hybrid'; -- 'sync', 'async', 'hybrid'
ALTER SYSTEM SET ruvector.max_lag_ms = 100; -- Max staleness window
ALTER SYSTEM SET ruvector.visibility_recheck = true; -- Always recheck heap
ALTER SYSTEM SET ruvector.wal_level = 'logical'; -- For logical replication
-- Recovery configuration
ALTER SYSTEM SET ruvector.checkpoint_interval = 300; -- Checkpoint every 5 min
ALTER SYSTEM SET ruvector.wal_buffer_size = '64MB'; -- WAL buffer
ALTER SYSTEM SET ruvector.recovery_target_timeline = 'latest';
```
---
## Monitoring
```sql
-- Consistency lag monitoring
SELECT
c.name AS collection,
s.last_heap_lsn,
s.last_index_lsn,
pg_wal_lsn_diff(s.last_heap_lsn, s.last_index_lsn) AS lag_bytes,
s.lag_ms,
s.pending_changes
FROM ruvector.consistency_status s
JOIN ruvector.collections c ON s.collection_id = c.id;
-- Visibility recheck statistics
SELECT
collection_name,
total_searches,
visibility_rechecks,
invisible_filtered,
(invisible_filtered::float / NULLIF(visibility_rechecks, 0) * 100)::numeric(5,2) AS invisible_pct
FROM ruvector.visibility_stats
ORDER BY invisible_pct DESC;
-- WAL replay status
SELECT
pg_last_wal_receive_lsn() AS receive_lsn,
pg_last_wal_replay_lsn() AS replay_lsn,
ruvector_last_applied_lsn() AS ruvector_lsn,
pg_wal_lsn_diff(pg_last_wal_replay_lsn(), ruvector_last_applied_lsn()) AS ruvector_lag_bytes;
```
---
## Testing Requirements
### Unit Tests
- Visibility check correctness
- Idempotent operation replay
- LSN tracking accuracy
- MVCC snapshot handling
### Integration Tests
- Crash recovery scenarios
- Concurrent transaction visibility
- Replication lag handling
- HOT update handling
### Chaos Tests
- Primary failover
- Network partition during replication
- Partial WAL replay
- Checkpoint corruption recovery
---
## Summary
The v2 consistency model ensures:
1. **Heap is authoritative** - All visibility decisions defer to PostgreSQL heap
2. **Bounded staleness** - Index catches up within configurable lag window
3. **Crash safe** - WAL-based recovery with idempotent replay
4. **Replication compatible** - Works with streaming and logical replication
5. **MVCC aware** - Respects transaction isolation guarantees

View File

@@ -0,0 +1,608 @@
# RuVector Postgres v2 - Hybrid Search (BM25 + Vector)
## Why Hybrid Search Matters
Vector search finds semantically similar content. Keyword search finds exact matches.
Neither is sufficient alone:
- **Vector-only** misses exact keyword matches (product SKUs, error codes, names)
- **Keyword-only** misses semantic similarity ("car" vs "automobile")
Every production RAG system needs both. pgvector doesn't have this. We do.
---
## Design Goals
1. **Single query, both signals** — No application-level fusion
2. **Configurable blending** — RRF, linear, learned weights
3. **Integrity-aware** — Hybrid index participates in contracted graph
4. **PostgreSQL-native** — Leverages `tsvector` and GIN indexes
---
## Architecture
```
+------------------+
| Hybrid Query |
| "error 500 fix" |
+--------+---------+
|
+---------------+---------------+
| |
+--------v--------+ +---------v---------+
| Vector Branch | | Keyword Branch |
| (HNSW/IVF) | | (GIN/tsvector) |
+--------+--------+ +---------+---------+
| |
| top-100 by cosine | top-100 by BM25
| |
+---------------+---------------+
|
+--------v--------+
| Fusion Layer |
| (RRF / Linear) |
+--------+--------+
|
+--------v--------+
| Final top-k |
+--------+--------+
|
+--------v--------+
| Optional Rerank |
+-----------------+
```
---
## SQL Interface
### Basic Hybrid Search
```sql
-- Simple hybrid search with default RRF fusion
SELECT * FROM ruvector_hybrid_search(
'documents', -- collection name
query_text := 'database connection timeout error',
query_vector := $embedding,
k := 10
);
-- Returns: id, content, vector_score, keyword_score, hybrid_score
```
### Configurable Fusion
```sql
-- RRF (Reciprocal Rank Fusion) - default, robust
SELECT * FROM ruvector_hybrid_search(
'documents',
query_text := 'postgres replication lag',
query_vector := $embedding,
k := 20,
fusion := 'rrf',
rrf_k := 60 -- RRF constant (default 60)
);
-- Linear blend with alpha
SELECT * FROM ruvector_hybrid_search(
'documents',
query_text := 'postgres replication lag',
query_vector := $embedding,
k := 20,
fusion := 'linear',
alpha := 0.7 -- 0.7 * vector + 0.3 * keyword
);
-- Learned fusion weights (from query patterns)
SELECT * FROM ruvector_hybrid_search(
'documents',
query_text := 'postgres replication lag',
query_vector := $embedding,
k := 20,
fusion := 'learned' -- Uses GNN-trained weights
);
```
### Operator Syntax (Advanced)
```sql
-- Using hybrid operator in ORDER BY
SELECT id, content,
ruvector_hybrid_score(
embedding <=> $query_vec,
ts_rank_cd(fts, plainto_tsquery($query_text)),
alpha := 0.6
) AS score
FROM documents
WHERE fts @@ plainto_tsquery($query_text) -- Pre-filter
OR embedding <=> $query_vec < 0.5 -- Or similar vectors
ORDER BY score DESC
LIMIT 10;
```
---
## Schema Requirements
### Collection with Hybrid Support
```sql
-- Create table with both vector and FTS columns
CREATE TABLE documents (
id BIGSERIAL PRIMARY KEY,
content TEXT NOT NULL,
embedding vector(1536) NOT NULL,
fts tsvector GENERATED ALWAYS AS (to_tsvector('english', content)) STORED,
metadata JSONB DEFAULT '{}'::jsonb,
created_at TIMESTAMPTZ DEFAULT NOW()
);
-- Vector index
CREATE INDEX idx_documents_embedding
ON documents USING ruhnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 100);
-- FTS index
CREATE INDEX idx_documents_fts
ON documents USING gin (fts);
-- Register for hybrid search
SELECT ruvector_register_hybrid(
collection := 'documents',
vector_column := 'embedding',
fts_column := 'fts',
text_column := 'content' -- For BM25 stats
);
```
### Hybrid Registration Table
```sql
-- Internal: tracks hybrid-enabled collections
CREATE TABLE ruvector.hybrid_collections (
id SERIAL PRIMARY KEY,
collection_id INTEGER NOT NULL REFERENCES ruvector.collections(id),
vector_column TEXT NOT NULL,
fts_column TEXT NOT NULL,
text_column TEXT NOT NULL,
-- BM25 parameters (computed from corpus)
avg_doc_length REAL,
doc_count BIGINT,
k1 REAL DEFAULT 1.2,
b REAL DEFAULT 0.75,
-- Fusion settings
default_fusion TEXT DEFAULT 'rrf',
default_alpha REAL DEFAULT 0.5,
learned_weights JSONB,
-- Stats
last_stats_update TIMESTAMPTZ,
created_at TIMESTAMPTZ DEFAULT NOW()
);
```
---
## BM25 Implementation
### Why Not Just ts_rank?
PostgreSQL's `ts_rank` is not true BM25. It doesn't account for:
- Document length normalization
- IDF weighting across corpus
- Term frequency saturation
We implement proper BM25 in the engine.
### BM25 Scoring
```rust
// src/hybrid/bm25.rs
/// BM25 scorer with corpus statistics
pub struct BM25Scorer {
k1: f32, // Term frequency saturation (default 1.2)
b: f32, // Length normalization (default 0.75)
avg_doc_len: f32, // Average document length
doc_count: u64, // Total documents
idf_cache: HashMap<String, f32>, // Cached IDF values
}
impl BM25Scorer {
/// Compute IDF for a term
fn idf(&self, doc_freq: u64) -> f32 {
let n = self.doc_count as f32;
let df = doc_freq as f32;
((n - df + 0.5) / (df + 0.5) + 1.0).ln()
}
/// Score a document for a query
pub fn score(&self, doc: &Document, query_terms: &[String]) -> f32 {
let doc_len = doc.term_count as f32;
let len_norm = 1.0 - self.b + self.b * (doc_len / self.avg_doc_len);
query_terms.iter()
.filter_map(|term| {
let tf = doc.term_freq(term)? as f32;
let idf = self.idf_cache.get(term)?;
// BM25 formula
let numerator = tf * (self.k1 + 1.0);
let denominator = tf + self.k1 * len_norm;
Some(idf * numerator / denominator)
})
.sum()
}
}
```
### Corpus Statistics Update
```sql
-- Update BM25 statistics (run periodically or after bulk inserts)
SELECT ruvector_hybrid_update_stats('documents');
-- Stats stored in hybrid_collections table
-- Computed via background worker or on-demand
```
```rust
// Background worker updates corpus stats
pub fn update_bm25_stats(collection_id: i32) -> Result<(), Error> {
Spi::run(|client| {
// Get average document length
let avg_len: f64 = client.select(
"SELECT AVG(LENGTH(content)) FROM documents",
None, &[]
)?.first().unwrap().get(1)?;
// Get document count
let doc_count: i64 = client.select(
"SELECT COUNT(*) FROM documents",
None, &[]
)?.first().unwrap().get(1)?;
// Update term frequencies (using tsvector stats)
// ... compute IDF cache ...
client.update(
"UPDATE ruvector.hybrid_collections
SET avg_doc_length = $1, doc_count = $2, last_stats_update = NOW()
WHERE collection_id = $3",
None,
&[avg_len.into(), doc_count.into(), collection_id.into()]
)
})
}
```
---
## Fusion Algorithms
### Reciprocal Rank Fusion (RRF)
Default and most robust. Works without score calibration.
```rust
// src/hybrid/fusion.rs
/// RRF fusion: score = sum(1 / (k + rank_i))
pub fn rrf_fusion(
vector_results: &[(DocId, f32)], // (id, distance)
keyword_results: &[(DocId, f32)], // (id, bm25_score)
k: usize, // RRF constant (default 60)
limit: usize,
) -> Vec<(DocId, f32)> {
let mut scores: HashMap<DocId, f32> = HashMap::new();
// Vector ranking (lower distance = higher rank)
for (rank, (doc_id, _)) in vector_results.iter().enumerate() {
*scores.entry(*doc_id).or_default() += 1.0 / (k + rank + 1) as f32;
}
// Keyword ranking (higher BM25 = higher rank)
for (rank, (doc_id, _)) in keyword_results.iter().enumerate() {
*scores.entry(*doc_id).or_default() += 1.0 / (k + rank + 1) as f32;
}
// Sort by fused score
let mut results: Vec<_> = scores.into_iter().collect();
results.sort_by(|a, b| b.1.partial_cmp(&a.1).unwrap());
results.truncate(limit);
results
}
```
### Linear Fusion
Simple weighted combination. Requires score normalization.
```rust
/// Linear fusion: score = alpha * vec_score + (1 - alpha) * kw_score
pub fn linear_fusion(
vector_results: &[(DocId, f32)],
keyword_results: &[(DocId, f32)],
alpha: f32,
limit: usize,
) -> Vec<(DocId, f32)> {
// Normalize vector scores (convert distance to similarity)
let vec_scores = normalize_to_similarity(vector_results);
// Normalize BM25 scores to [0, 1]
let kw_scores = min_max_normalize(keyword_results);
// Combine
let mut combined: HashMap<DocId, f32> = HashMap::new();
for (doc_id, score) in vec_scores {
*combined.entry(doc_id).or_default() += alpha * score;
}
for (doc_id, score) in kw_scores {
*combined.entry(doc_id).or_default() += (1.0 - alpha) * score;
}
let mut results: Vec<_> = combined.into_iter().collect();
results.sort_by(|a, b| b.1.partial_cmp(&a.1).unwrap());
results.truncate(limit);
results
}
```
### Learned Fusion
Uses query characteristics to select weights dynamically.
```rust
/// Learned fusion using GNN-predicted weights
pub fn learned_fusion(
query_embedding: &[f32],
query_terms: &[String],
vector_results: &[(DocId, f32)],
keyword_results: &[(DocId, f32)],
model: &FusionModel,
limit: usize,
) -> Vec<(DocId, f32)> {
// Query features
let features = QueryFeatures {
embedding_norm: l2_norm(query_embedding),
term_count: query_terms.len(),
avg_term_idf: compute_avg_idf(query_terms),
has_exact_match: detect_exact_match_intent(query_terms),
query_type: classify_query_type(query_terms), // navigational, informational, etc.
};
// Predict optimal alpha for this query
let alpha = model.predict_alpha(&features);
linear_fusion(vector_results, keyword_results, alpha, limit)
}
```
---
## Integrity Integration
Hybrid search participates in the integrity control plane.
### Contracted Graph Nodes
```sql
-- Hybrid index adds nodes to contracted graph
INSERT INTO ruvector.contracted_graph (collection_id, node_type, node_id, node_name, health_score)
SELECT
c.id,
'hybrid_index',
h.id,
'hybrid_' || c.name,
CASE
WHEN h.last_stats_update > NOW() - INTERVAL '1 day' THEN 1.0
WHEN h.last_stats_update > NOW() - INTERVAL '7 days' THEN 0.7
ELSE 0.3 -- Stale stats degrade health
END
FROM ruvector.hybrid_collections h
JOIN ruvector.collections c ON h.collection_id = c.id;
```
### Integrity-Aware Hybrid Search
```rust
/// Hybrid search with integrity gating
pub fn hybrid_search_with_integrity(
collection_id: i32,
query: &HybridQuery,
) -> Result<Vec<HybridResult>, Error> {
// Check integrity gate
let gate = check_integrity_gate(collection_id, "hybrid_search");
match gate.state {
IntegrityState::Normal => {
// Full hybrid: both branches
execute_full_hybrid(query)
}
IntegrityState::Stress => {
// Degrade gracefully: prefer faster branch
if query.alpha > 0.5 {
// Vector-heavy query: use vector only
execute_vector_only(query)
} else {
// Keyword-heavy query: use keyword only
execute_keyword_only(query)
}
}
IntegrityState::Critical => {
// Minimal: keyword only (cheapest)
execute_keyword_only(query)
}
}
}
```
---
## Performance Optimization
### Pre-filtering Strategy
```sql
-- Hybrid search with pre-filter (faster for selective filters)
SELECT * FROM ruvector_hybrid_search(
'documents',
query_text := 'error handling',
query_vector := $embedding,
k := 10,
filter := 'category = ''backend'' AND created_at > NOW() - INTERVAL ''30 days'''
);
```
```rust
// Execution strategy selection
fn choose_strategy(filter_selectivity: f32, corpus_size: u64) -> HybridStrategy {
if filter_selectivity < 0.01 {
// Very selective: pre-filter, then hybrid on small set
HybridStrategy::PreFilter
} else if filter_selectivity < 0.1 && corpus_size > 1_000_000 {
// Moderately selective, large corpus: hybrid first, post-filter
HybridStrategy::PostFilter
} else {
// Not selective: full hybrid
HybridStrategy::Full
}
}
```
### Parallel Execution
```rust
/// Execute vector and keyword branches in parallel
pub async fn parallel_hybrid(query: &HybridQuery) -> HybridResults {
let (vector_results, keyword_results) = tokio::join!(
execute_vector_branch(&query.embedding, query.prefetch_k),
execute_keyword_branch(&query.text, query.prefetch_k),
);
fuse_results(vector_results, keyword_results, query.fusion, query.k)
}
```
### Caching
```rust
/// Cache BM25 scores for repeated terms
pub struct HybridCache {
term_doc_scores: LruCache<(String, DocId), f32>,
idf_cache: HashMap<String, f32>,
ttl: Duration,
}
```
---
## Configuration
### GUC Parameters
```sql
-- Default fusion method
SET ruvector.hybrid_fusion = 'rrf'; -- 'rrf', 'linear', 'learned'
-- Default alpha for linear fusion
SET ruvector.hybrid_alpha = 0.5;
-- RRF constant
SET ruvector.hybrid_rrf_k = 60;
-- Prefetch size for each branch
SET ruvector.hybrid_prefetch_k = 100;
-- Enable parallel branch execution
SET ruvector.hybrid_parallel = true;
```
### Per-Collection Settings
```sql
SELECT ruvector_hybrid_configure('documents', '{
"default_fusion": "learned",
"prefetch_k": 200,
"bm25_k1": 1.5,
"bm25_b": 0.8,
"stats_refresh_interval": "1 hour"
}'::jsonb);
```
---
## Monitoring
```sql
-- Hybrid search statistics
SELECT * FROM ruvector_hybrid_stats('documents');
-- Returns:
-- {
-- "total_searches": 15234,
-- "avg_vector_latency_ms": 4.2,
-- "avg_keyword_latency_ms": 2.1,
-- "avg_fusion_latency_ms": 0.3,
-- "cache_hit_rate": 0.67,
-- "last_stats_update": "2024-01-15T10:30:00Z",
-- "corpus_size": 1250000,
-- "avg_doc_length": 542
-- }
```
---
## Testing Requirements
### Correctness Tests
- BM25 scoring matches reference implementation
- RRF fusion produces expected rankings
- Linear fusion respects alpha parameter
- Learned fusion adapts to query type
### Performance Tests
- Hybrid search < 2x single-branch latency
- Parallel execution shows speedup
- Cache hit rate > 50% for repeated queries
### Integration Tests
- Integrity degradation triggers graceful fallback
- Stats update doesn't block queries
- Large corpus (10M+ docs) scales
---
## Example: RAG Application
```sql
-- Complete RAG retrieval with hybrid search
WITH retrieved AS (
SELECT
id,
content,
hybrid_score,
metadata
FROM ruvector_hybrid_search(
'knowledge_base',
query_text := $user_question,
query_vector := $question_embedding,
k := 5,
fusion := 'rrf',
filter := 'status = ''published'''
)
)
SELECT
string_agg(content, E'\n\n---\n\n') AS context,
array_agg(id) AS source_ids
FROM retrieved;
-- Pass context to LLM for answer generation
```

View File

@@ -0,0 +1,719 @@
# RuVector Postgres v2 - Multi-Tenancy Model
## Why Multi-Tenancy Matters
Every SaaS application needs tenant isolation. Without native support, teams build:
- Separate databases per tenant (operational nightmare)
- Manual partition schemes (error-prone)
- Application-level filtering (security risk)
RuVector provides **first-class multi-tenancy** with:
- Tenant-isolated search (data never leaks)
- Per-tenant integrity monitoring (one bad tenant doesn't sink others)
- Efficient shared infrastructure (cost-effective)
- Row-level security integration (PostgreSQL-native)
---
## Design Goals
1. **Zero data leakage** — Tenant A never sees Tenant B's vectors
2. **Per-tenant integrity** — Stress in one tenant doesn't affect others
3. **Fair resource allocation** — No noisy neighbor problems
4. **Transparent to queries** — SET tenant, then normal SQL
5. **Efficient storage** — Shared indexes where safe, isolated where needed
---
## Architecture
```
+------------------------------------------------------------------+
| Application |
| SET ruvector.tenant_id = 'acme-corp'; |
| SELECT * FROM embeddings ORDER BY vec <-> $q LIMIT 10; |
+------------------------------------------------------------------+
|
+------------------------------------------------------------------+
| Tenant Context Layer |
| - Validates tenant_id |
| - Injects tenant filter into all operations |
| - Routes to tenant-specific resources |
+------------------------------------------------------------------+
|
+---------------+---------------+
| |
+--------v--------+ +---------v---------+
| Shared Index | | Tenant Indexes |
| (small tenants)| | (large tenants) |
+--------+--------+ +---------+---------+
| |
+---------------+---------------+
|
+------------------------------------------------------------------+
| Per-Tenant Integrity |
| - Separate contracted graphs |
| - Independent state machines |
| - Isolated throttling policies |
+------------------------------------------------------------------+
```
---
## SQL Interface
### Setting Tenant Context
```sql
-- Set tenant for session (required before any operation)
SET ruvector.tenant_id = 'acme-corp';
-- Or per-transaction
BEGIN;
SET LOCAL ruvector.tenant_id = 'acme-corp';
-- ... operations ...
COMMIT;
-- Verify current tenant
SELECT current_setting('ruvector.tenant_id');
```
### Tenant-Transparent Operations
```sql
-- Once tenant is set, all operations are automatically scoped
SET ruvector.tenant_id = 'acme-corp';
-- Insert only sees/affects acme-corp data
INSERT INTO embeddings (content, vec) VALUES ('doc', $embedding);
-- Search only returns acme-corp results
SELECT * FROM embeddings ORDER BY vec <-> $query LIMIT 10;
-- Delete only affects acme-corp
DELETE FROM embeddings WHERE id = 123;
```
### Admin Operations (Cross-Tenant)
```sql
-- Superuser can query across tenants
SET ruvector.tenant_id = '*'; -- Wildcard (admin only)
-- View all tenants
SELECT * FROM ruvector_tenants();
-- View tenant stats
SELECT * FROM ruvector_tenant_stats('acme-corp');
-- Migrate tenant to dedicated index
SELECT ruvector_tenant_isolate('acme-corp');
```
---
## Schema Design
### Tenant Registry
```sql
CREATE TABLE ruvector.tenants (
id TEXT PRIMARY KEY,
display_name TEXT,
-- Resource limits
max_vectors BIGINT DEFAULT 1000000,
max_collections INTEGER DEFAULT 10,
max_qps INTEGER DEFAULT 100,
-- Isolation level
isolation_level TEXT DEFAULT 'shared' CHECK (isolation_level IN (
'shared', -- Shared index with tenant filter
'partition', -- Dedicated partition in shared index
'dedicated' -- Separate physical index
)),
-- Integrity settings
integrity_enabled BOOLEAN DEFAULT true,
integrity_policy_id INTEGER REFERENCES ruvector.integrity_policies(id),
-- Metadata
metadata JSONB DEFAULT '{}'::jsonb,
created_at TIMESTAMPTZ DEFAULT NOW(),
suspended_at TIMESTAMPTZ, -- Non-null = suspended
-- Stats (updated by background worker)
vector_count BIGINT DEFAULT 0,
storage_bytes BIGINT DEFAULT 0,
last_access TIMESTAMPTZ
);
CREATE INDEX idx_tenants_isolation ON ruvector.tenants(isolation_level);
CREATE INDEX idx_tenants_suspended ON ruvector.tenants(suspended_at) WHERE suspended_at IS NOT NULL;
```
### Tenant-Aware Collections
```sql
-- Collections can be tenant-specific or shared
CREATE TABLE ruvector.collections (
id SERIAL PRIMARY KEY,
name TEXT NOT NULL,
tenant_id TEXT REFERENCES ruvector.tenants(id), -- NULL = shared
-- ... other columns from 01-sql-schema.md ...
UNIQUE (name, tenant_id) -- Same name allowed for different tenants
);
-- Tenant-scoped view
CREATE VIEW ruvector.my_collections AS
SELECT * FROM ruvector.collections
WHERE tenant_id = current_setting('ruvector.tenant_id', true)
OR tenant_id IS NULL; -- Shared collections visible to all
```
### Tenant Column in Data Tables
```sql
-- User tables include tenant_id column
CREATE TABLE embeddings (
id BIGSERIAL PRIMARY KEY,
tenant_id TEXT NOT NULL DEFAULT current_setting('ruvector.tenant_id'),
content TEXT,
vec vector(1536),
created_at TIMESTAMPTZ DEFAULT NOW(),
CONSTRAINT fk_tenant FOREIGN KEY (tenant_id)
REFERENCES ruvector.tenants(id) ON DELETE CASCADE
);
-- Partial index per tenant (for dedicated isolation)
CREATE INDEX idx_embeddings_vec_tenant_acme
ON embeddings USING ruhnsw (vec vector_cosine_ops)
WHERE tenant_id = 'acme-corp';
-- Or composite index for shared isolation
CREATE INDEX idx_embeddings_vec_shared
ON embeddings USING ruhnsw (vec vector_cosine_ops);
-- Engine internally filters by tenant_id
```
---
## Row-Level Security Integration
### RLS Policies
```sql
-- Enable RLS on data tables
ALTER TABLE embeddings ENABLE ROW LEVEL SECURITY;
-- Tenant isolation policy
CREATE POLICY tenant_isolation ON embeddings
USING (tenant_id = current_setting('ruvector.tenant_id', true))
WITH CHECK (tenant_id = current_setting('ruvector.tenant_id', true));
-- Admin bypass policy
CREATE POLICY admin_access ON embeddings
FOR ALL
TO ruvector_admin
USING (true)
WITH CHECK (true);
```
### Automatic Policy Creation
```sql
-- Helper function to set up RLS for a table
CREATE FUNCTION ruvector_enable_tenant_rls(
p_table_name TEXT,
p_tenant_column TEXT DEFAULT 'tenant_id'
) RETURNS void AS $$
BEGIN
-- Enable RLS
EXECUTE format('ALTER TABLE %I ENABLE ROW LEVEL SECURITY', p_table_name);
-- Create isolation policy
EXECUTE format(
'CREATE POLICY tenant_isolation ON %I
USING (%I = current_setting(''ruvector.tenant_id'', true))
WITH CHECK (%I = current_setting(''ruvector.tenant_id'', true))',
p_table_name, p_tenant_column, p_tenant_column
);
-- Create admin bypass
EXECUTE format(
'CREATE POLICY admin_bypass ON %I FOR ALL TO ruvector_admin USING (true)',
p_table_name
);
END;
$$ LANGUAGE plpgsql;
-- Usage
SELECT ruvector_enable_tenant_rls('embeddings');
SELECT ruvector_enable_tenant_rls('documents');
```
---
## Isolation Levels
### Shared (Default)
All tenants share one index. Engine filters by tenant_id.
```
Pros:
+ Most memory-efficient
+ Fastest for small tenants
+ Simple management
Cons:
- Some cross-tenant cache pollution
- Shared integrity state
Best for: < 100K vectors per tenant
```
### Partition
Tenants get dedicated partitions within shared index structure.
```
Pros:
+ Better cache isolation
+ Per-partition integrity
+ Easy promotion to dedicated
Cons:
- Some overhead per partition
- Still shares top-level structure
Best for: 100K - 10M vectors per tenant
```
### Dedicated
Tenant gets completely separate physical index.
```
Pros:
+ Complete isolation
+ Independent scaling
+ Custom index parameters
Cons:
- Higher memory overhead
+ More management complexity
Best for: > 10M vectors, enterprise tenants, compliance requirements
```
### Automatic Promotion
```sql
-- Configure auto-promotion thresholds
SELECT ruvector_tenant_set_policy('{
"auto_promote_to_partition": 100000, -- vectors
"auto_promote_to_dedicated": 10000000,
"check_interval": "1 hour"
}'::jsonb);
```
```rust
// Background worker checks and promotes
pub fn check_tenant_promotion(tenant_id: &str) -> Option<IsolationLevel> {
let stats = get_tenant_stats(tenant_id)?;
let policy = get_promotion_policy()?;
if stats.vector_count > policy.dedicated_threshold {
Some(IsolationLevel::Dedicated)
} else if stats.vector_count > policy.partition_threshold {
Some(IsolationLevel::Partition)
} else {
None
}
}
```
---
## Per-Tenant Integrity
### Separate Contracted Graphs
```sql
-- Each tenant gets its own contracted graph
CREATE TABLE ruvector.tenant_contracted_graph (
tenant_id TEXT NOT NULL REFERENCES ruvector.tenants(id),
collection_id INTEGER NOT NULL,
node_type TEXT NOT NULL,
node_id BIGINT NOT NULL,
-- ... same as contracted_graph ...
PRIMARY KEY (tenant_id, collection_id, node_type, node_id)
);
```
### Independent State Machines
```rust
// Per-tenant integrity state
pub struct TenantIntegrityState {
tenant_id: String,
state: IntegrityState,
lambda_cut: f32,
consecutive_samples: u32,
last_transition: Instant,
cooldown_until: Option<Instant>,
}
// Tenant stress doesn't affect other tenants
pub fn check_tenant_gate(tenant_id: &str, operation: &str) -> GateResult {
let state = get_tenant_integrity_state(tenant_id);
apply_policy(state, operation)
}
```
### Tenant-Specific Policies
```sql
-- Each tenant can have custom thresholds
INSERT INTO ruvector.integrity_policies (tenant_id, name, threshold_high, threshold_low)
VALUES
('acme-corp', 'enterprise', 0.6, 0.3), -- Stricter
('startup-xyz', 'standard', 0.4, 0.15); -- Default
```
---
## Resource Quotas
### Quota Enforcement
```sql
-- Quota table
CREATE TABLE ruvector.tenant_quotas (
tenant_id TEXT PRIMARY KEY REFERENCES ruvector.tenants(id),
max_vectors BIGINT NOT NULL DEFAULT 1000000,
max_storage_gb REAL NOT NULL DEFAULT 10.0,
max_qps INTEGER NOT NULL DEFAULT 100,
max_concurrent INTEGER NOT NULL DEFAULT 10,
-- Current usage (updated by triggers/workers)
current_vectors BIGINT DEFAULT 0,
current_storage_gb REAL DEFAULT 0,
-- Rate limiting state
request_count INTEGER DEFAULT 0,
window_start TIMESTAMPTZ DEFAULT NOW()
);
-- Check quota before insert
CREATE FUNCTION ruvector_check_quota() RETURNS TRIGGER AS $$
DECLARE
v_quota RECORD;
BEGIN
SELECT * INTO v_quota
FROM ruvector.tenant_quotas
WHERE tenant_id = NEW.tenant_id;
IF v_quota.current_vectors >= v_quota.max_vectors THEN
RAISE EXCEPTION 'Tenant % has exceeded vector quota', NEW.tenant_id;
END IF;
RETURN NEW;
END;
$$ LANGUAGE plpgsql;
CREATE TRIGGER check_quota_before_insert
BEFORE INSERT ON embeddings
FOR EACH ROW EXECUTE FUNCTION ruvector_check_quota();
```
### Rate Limiting
```rust
// Token bucket rate limiter per tenant
pub struct TenantRateLimiter {
buckets: DashMap<String, TokenBucket>,
}
impl TenantRateLimiter {
pub fn check(&self, tenant_id: &str, tokens: u32) -> RateLimitResult {
let bucket = self.buckets.entry(tenant_id.to_string())
.or_insert_with(|| TokenBucket::new(
get_tenant_qps_limit(tenant_id),
));
if bucket.try_acquire(tokens) {
RateLimitResult::Allowed
} else {
RateLimitResult::Limited {
retry_after_ms: bucket.time_to_refill(tokens),
}
}
}
}
```
### Fair Scheduling
```rust
// Weighted fair queue for search requests
pub struct FairScheduler {
queues: HashMap<String, VecDeque<SearchRequest>>,
weights: HashMap<String, f32>, // Based on tier/quota
}
impl FairScheduler {
pub fn next(&mut self) -> Option<SearchRequest> {
// Weighted round-robin across tenants
// Prevents one tenant from monopolizing resources
let total_weight: f32 = self.weights.values().sum();
for (tenant_id, queue) in &mut self.queues {
let weight = self.weights.get(tenant_id).unwrap_or(&1.0);
let share = weight / total_weight;
// Probability of selecting this tenant's request
if rand::random::<f32>() < share {
if let Some(req) = queue.pop_front() {
return Some(req);
}
}
}
// Fallback: any available request
self.queues.values_mut()
.find_map(|q| q.pop_front())
}
}
```
---
## Tenant Lifecycle
### Create Tenant
```sql
SELECT ruvector_tenant_create('new-customer', '{
"display_name": "New Customer Inc.",
"max_vectors": 5000000,
"max_qps": 200,
"isolation_level": "shared",
"integrity_enabled": true
}'::jsonb);
```
### Suspend Tenant
```sql
-- Suspend (stops all operations, keeps data)
SELECT ruvector_tenant_suspend('bad-actor');
-- Resume
SELECT ruvector_tenant_resume('bad-actor');
```
### Delete Tenant
```sql
-- Soft delete (marks for cleanup)
SELECT ruvector_tenant_delete('churned-customer');
-- Hard delete (immediate, for compliance)
SELECT ruvector_tenant_delete('churned-customer', hard := true);
```
### Migrate Isolation Level
```sql
-- Promote to dedicated (online, no downtime)
SELECT ruvector_tenant_migrate('enterprise-customer', 'dedicated');
-- Status check
SELECT * FROM ruvector_tenant_migration_status('enterprise-customer');
```
---
## Shared Memory Layout
```rust
// Per-tenant state in shared memory
#[repr(C)]
pub struct TenantSharedState {
tenant_id_hash: u64, // Fast lookup key
integrity_state: u8, // 0=normal, 1=stress, 2=critical
lambda_cut: f32, // Current mincut value
request_count: AtomicU32, // For rate limiting
last_request_epoch: AtomicU64, // Rate limit window
flags: AtomicU32, // Suspended, migrating, etc.
}
// Tenant lookup table
pub struct TenantRegistry {
states: [TenantSharedState; MAX_TENANTS], // Fixed array in shmem
index: HashMap<String, usize>, // Heap-based lookup
}
```
---
## Monitoring
### Per-Tenant Metrics
```sql
-- Tenant dashboard
SELECT
t.id,
t.display_name,
t.isolation_level,
tq.current_vectors,
tq.max_vectors,
ROUND(100.0 * tq.current_vectors / tq.max_vectors, 1) AS usage_pct,
ts.integrity_state,
ts.lambda_cut,
ts.avg_search_latency_ms,
ts.searches_last_hour
FROM ruvector.tenants t
JOIN ruvector.tenant_quotas tq ON t.id = tq.tenant_id
JOIN ruvector.tenant_stats ts ON t.id = ts.tenant_id
ORDER BY tq.current_vectors DESC;
```
### Prometheus Metrics
```
# Per-tenant metrics
ruvector_tenant_vectors{tenant="acme-corp"} 1234567
ruvector_tenant_integrity_state{tenant="acme-corp"} 1
ruvector_tenant_lambda_cut{tenant="acme-corp"} 0.72
ruvector_tenant_search_latency_p99{tenant="acme-corp"} 15.2
ruvector_tenant_qps{tenant="acme-corp"} 45.3
ruvector_tenant_quota_usage{tenant="acme-corp",resource="vectors"} 0.62
```
---
## Security Considerations
### Tenant ID Validation
```rust
// Validate tenant_id before any operation
pub fn validate_tenant_context() -> Result<String, Error> {
let tenant_id = get_guc("ruvector.tenant_id")?;
// Check not empty
if tenant_id.is_empty() {
return Err(Error::NoTenantContext);
}
// Check tenant exists and not suspended
let tenant = get_tenant(&tenant_id)?;
if tenant.suspended_at.is_some() {
return Err(Error::TenantSuspended);
}
Ok(tenant_id)
}
```
### Audit Logging
```sql
-- Tenant operations audit log
CREATE TABLE ruvector.tenant_audit_log (
id BIGSERIAL PRIMARY KEY,
tenant_id TEXT NOT NULL,
operation TEXT NOT NULL, -- search, insert, delete, etc.
user_id TEXT, -- Application user
details JSONB,
ip_address INET,
created_at TIMESTAMPTZ DEFAULT NOW()
);
-- Enabled via GUC
SET ruvector.audit_enabled = true;
```
### Cross-Tenant Prevention
```rust
// Engine-level enforcement (defense in depth)
pub fn execute_search(request: &SearchRequest) -> Result<SearchResults, Error> {
let context_tenant = validate_tenant_context()?;
// Double-check request matches context
if let Some(req_tenant) = &request.tenant_id {
if req_tenant != &context_tenant {
// Log security event
log_security_event("tenant_mismatch", &context_tenant, req_tenant);
return Err(Error::TenantMismatch);
}
}
// Execute with tenant filter
execute_search_internal(request, &context_tenant)
}
```
---
## Testing Requirements
### Isolation Tests
- Tenant A cannot see Tenant B's data
- Tenant A's stress doesn't affect Tenant B's operations
- Suspended tenant cannot perform any operations
### Performance Tests
- Shared isolation: < 5% overhead vs single-tenant
- Dedicated isolation: equivalent to single-tenant
- Rate limiting adds < 1ms latency
### Scale Tests
- 1000+ tenants on shared infrastructure
- 100+ tenants with dedicated isolation
- Tenant migration under load
---
## Example: SaaS Application
```python
# Application code
class VectorService:
def __init__(self, db_pool):
self.pool = db_pool
def search(self, tenant_id: str, query_vec: list, k: int = 10):
with self.pool.connection() as conn:
# Set tenant context
conn.execute("SET ruvector.tenant_id = %s", [tenant_id])
# Search (automatically scoped to tenant)
results = conn.execute("""
SELECT id, content, vec <-> %s AS distance
FROM embeddings
ORDER BY vec <-> %s
LIMIT %s
""", [query_vec, query_vec, k])
return results.fetchall()
def insert(self, tenant_id: str, content: str, vec: list):
with self.pool.connection() as conn:
conn.execute("SET ruvector.tenant_id = %s", [tenant_id])
# Insert (tenant_id auto-populated from context)
conn.execute("""
INSERT INTO embeddings (content, vec)
VALUES (%s, %s)
""", [content, vec])
```

File diff suppressed because it is too large Load Diff