Squashed 'vendor/ruvector/' content from commit b64c2172

git-subtree-dir: vendor/ruvector git-subtree-split: b64c21726f2bb37286d9ee36a7869fef60cc6900
2026-02-28 14:39:40 -05:00
commit d803bfe2b1
7854 changed files with 3522914 additions and 0 deletions
--- a/crates/ruvector-postgres/docs/ivfflat_access_method.md
+++ b/crates/ruvector-postgres/docs/ivfflat_access_method.md
@@ -0,0 +1,304 @@
+# IVFFlat Index Access Method
+
+## Overview
+
+The IVFFlat (Inverted File with Flat quantization) index is a PostgreSQL access method implementation for approximate nearest neighbor (ANN) search. It partitions the vector space into clusters using k-means clustering, enabling fast similarity search by probing only the most relevant clusters.
+
+## Architecture
+
+### Storage Layout
+
+The IVFFlat index uses PostgreSQL's page-based storage with the following structure:
+
+```
+┌─────────────────┬──────────────────────┬─────────────────────┐
+│  Page 0         │  Pages 1-N           │  Pages N+1-M        │
+│  (Metadata)     │  (Centroids)         │  (Inverted Lists)   │
+└─────────────────┴──────────────────────┴─────────────────────┘
+```
+
+#### Page 0: Metadata Page
+```rust
+struct IvfFlatMetaPage {
+    magic: u32,              // 0x49564646 ("IVFF")
+    lists: u32,              // Number of clusters
+    probes: u32,             // Default probes for search
+    dimensions: u32,         // Vector dimensions
+    trained: u32,            // 0=untrained, 1=trained
+    vector_count: u64,       // Total vectors indexed
+    metric: u32,             // Distance metric (0=L2, 1=IP, 2=Cosine, 3=L1)
+    centroid_start_page: u32,// First centroid page
+    lists_start_page: u32,   // First inverted list page
+    reserved: [u32; 16],     // Future expansion
+}
+```
+
+#### Pages 1-N: Centroid Pages
+Each centroid entry contains:
+- Cluster ID
+- Inverted list page reference
+- Vector count in cluster
+- Centroid vector data (dimensions × 4 bytes)
+
+#### Pages N+1-M: Inverted List Pages
+Each vector entry contains:
+- Heap tuple ID (block number + offset)
+- Vector data (dimensions × 4 bytes)
+
+## Index Building
+
+### 1. Training Phase
+
+The index must be trained before use:
+
+```sql
+-- Create index with training
+CREATE INDEX ON items USING ruivfflat (embedding vector_l2_ops)
+  WITH (lists = 100);
+```
+
+Training process:
+1. **Sample Collection**: Up to 50,000 random vectors sampled from the heap
+2. **K-means++ Initialization**: Intelligent centroid seeding for better convergence
+3. **K-means Clustering**: 10 iterations of Lloyd's algorithm
+4. **Centroid Storage**: Trained centroids written to index pages
+
+### 2. Vector Assignment
+
+After training, all vectors are assigned to their nearest centroid:
+- Calculate distance to each centroid
+- Assign to nearest centroid's inverted list
+- Store in inverted list pages
+
+## Search Process
+
+### Query Execution
+
+```sql
+SELECT * FROM items
+ORDER BY embedding <-> '[1,2,3,...]'
+LIMIT 10;
+```
+
+Search algorithm:
+1. **Find Nearest Centroids**: Calculate distance from query to all centroids
+2. **Probe Selection**: Select `probes` nearest centroids
+3. **List Scanning**: Scan inverted lists for selected centroids
+4. **Re-ranking**: Calculate exact distances to all candidates
+5. **Top-K Selection**: Return k nearest vectors
+
+### Performance Tuning
+
+#### Lists Parameter
+
+Controls the number of clusters:
+- **Small values (10-50)**: Faster build, slower search, lower recall
+- **Medium values (100-200)**: Balanced performance
+- **Large values (500-1000)**: Slower build, faster search, higher recall
+
+Rule of thumb: `lists = sqrt(total_vectors)`
+
+#### Probes Parameter
+
+Controls search accuracy vs speed:
+- **Low probes (1-3)**: Fast search, lower recall
+- **Medium probes (5-10)**: Balanced
+- **High probes (20-50)**: Slower search, higher recall
+
+Set dynamically:
+```sql
+SET ruvector.ivfflat_probes = 10;
+```
+
+## Configuration
+
+### GUC Variables
+
+```sql
+-- Set default probes for IVFFlat searches
+SET ruvector.ivfflat_probes = 10;
+
+-- View current setting
+SHOW ruvector.ivfflat_probes;
+```
+
+### Index Options
+
+```sql
+CREATE INDEX ON table USING ruivfflat (column opclass)
+  WITH (lists = value, probes = value);
+```
+
+Available options:
+- `lists`: Number of clusters (default: 100)
+- `probes`: Default probes for searches (default: 1)
+
+## Operator Classes
+
+### Vector L2 (Euclidean)
+```sql
+CREATE INDEX ON items USING ruivfflat (embedding vector_l2_ops)
+  WITH (lists = 100);
+```
+
+### Vector Inner Product
+```sql
+CREATE INDEX ON items USING ruivfflat (embedding vector_ip_ops)
+  WITH (lists = 100);
+```
+
+### Vector Cosine
+```sql
+CREATE INDEX ON items USING ruivfflat (embedding vector_cosine_ops)
+  WITH (lists = 100);
+```
+
+## Performance Characteristics
+
+### Time Complexity
+- **Build**: O(n × k × d × iterations) where n=vectors, k=lists, d=dimensions
+- **Insert**: O(k × d) - find nearest centroid
+- **Search**: O(probes × (n/k) × d) - probe lists and re-rank
+
+### Space Complexity
+- **Index Size**: O(n × d × 4 + k × d × 4)
+- Approximately same size as raw vectors plus centroids
+
+### Recall vs Speed Trade-offs
+
+| Probes | Recall | Speed    | Use Case                    |
+|--------|--------|----------|-----------------------------|
+| 1      | 60-70% | Fastest  | Very fast approximate search|
+| 5      | 80-85% | Fast     | Balanced performance        |
+| 10     | 90-95% | Medium   | High recall applications    |
+| 20+    | 95-99% | Slower   | Near-exact search           |
+
+## Examples
+
+### Basic Usage
+
+```sql
+-- Create table
+CREATE TABLE documents (
+    id serial PRIMARY KEY,
+    content text,
+    embedding vector(1536)
+);
+
+-- Insert vectors
+INSERT INTO documents (content, embedding)
+VALUES
+    ('First document', '[0.1, 0.2, ...]'),
+    ('Second document', '[0.3, 0.4, ...]');
+
+-- Create IVFFlat index
+CREATE INDEX ON documents USING ruivfflat (embedding vector_l2_ops)
+  WITH (lists = 100);
+
+-- Search
+SELECT id, content, embedding <-> '[0.5, 0.6, ...]' AS distance
+FROM documents
+ORDER BY embedding <-> '[0.5, 0.6, ...]'
+LIMIT 10;
+```
+
+### Advanced Configuration
+
+```sql
+-- Large dataset with many lists
+CREATE INDEX ON large_table USING ruivfflat (embedding vector_cosine_ops)
+  WITH (lists = 1000);
+
+-- High-recall search
+SET ruvector.ivfflat_probes = 20;
+SELECT * FROM large_table
+ORDER BY embedding <=> '[...]'
+LIMIT 100;
+```
+
+### Index Statistics
+
+```sql
+-- Get index information
+SELECT * FROM ruvector_ivfflat_stats('documents_embedding_idx');
+
+-- Returns:
+-- lists | probes | dimensions | trained | vector_count | metric
+--------+--------+------------+---------+--------------+-----------
+-- 100   | 1      | 1536       | true    | 1000000     | euclidean
+```
+
+## Comparison with HNSW
+
+| Feature          | IVFFlat           | HNSW                |
+|------------------|-------------------|---------------------|
+| Build Time       | Fast (minutes)    | Slow (hours)        |
+| Search Speed     | Fast              | Faster              |
+| Recall           | 80-95%            | 95-99%              |
+| Memory           | Low               | High                |
+| Incremental Insert| Fast             | Medium              |
+| Best For         | Large static datasets | High-recall queries |
+
+## Maintenance
+
+### Rebuilding Index
+
+After significant data changes, rebuild for better clustering:
+
+```sql
+REINDEX INDEX documents_embedding_idx;
+```
+
+### Monitoring
+
+```sql
+-- Check index size
+SELECT pg_size_pretty(pg_relation_size('documents_embedding_idx'));
+
+-- Check if trained
+SELECT * FROM ruvector_ivfflat_stats('documents_embedding_idx');
+```
+
+## Implementation Details
+
+### Zero-Copy Vector Access
+
+The implementation uses zero-copy techniques:
+- Read vector data directly from heap tuples
+- No intermediate buffer allocation
+- Compare directly with centroids in-place
+
+### Memory Management
+
+- Uses PostgreSQL's palloc/pfree memory contexts
+- Automatic cleanup on transaction end
+- No manual memory management required
+
+### Concurrency
+
+- Safe for concurrent reads
+- Index building is single-threaded
+- Inserts are serialized per cluster
+
+## Limitations
+
+1. **Training Required**: Cannot insert before training completes
+2. **Fixed Clusters**: Number of lists cannot change after build
+3. **No Updates**: Update requires delete + insert
+4. **Memory**: All centroids must fit in memory during search
+
+## Future Enhancements
+
+- [ ] Parallel index building
+- [ ] Incremental training for inserts
+- [ ] Product quantization (IVF-PQ)
+- [ ] GPU acceleration
+- [ ] Adaptive probe selection
+- [ ] Cluster rebalancing
+
+## References
+
+1. [pgvector](https://github.com/pgvector/pgvector) - Original IVFFlat implementation
+2. [FAISS](https://github.com/facebookresearch/faiss) - Facebook AI Similarity Search
+3. "Product Quantization for Nearest Neighbor Search" - Jégou et al., 2011
+4. PostgreSQL Index Access Method Documentation