Files
wifi-densepose/crates/ruvector-postgres/docs/ivfflat_access_method.md
ruv d803bfe2b1 Squashed 'vendor/ruvector/' content from commit b64c2172
git-subtree-dir: vendor/ruvector
git-subtree-split: b64c21726f2bb37286d9ee36a7869fef60cc6900
2026-02-28 14:39:40 -05:00

305 lines
8.5 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# IVFFlat Index Access Method
## Overview
The IVFFlat (Inverted File with Flat quantization) index is a PostgreSQL access method implementation for approximate nearest neighbor (ANN) search. It partitions the vector space into clusters using k-means clustering, enabling fast similarity search by probing only the most relevant clusters.
## Architecture
### Storage Layout
The IVFFlat index uses PostgreSQL's page-based storage with the following structure:
```
┌─────────────────┬──────────────────────┬─────────────────────┐
│ Page 0 │ Pages 1-N │ Pages N+1-M │
│ (Metadata) │ (Centroids) │ (Inverted Lists) │
└─────────────────┴──────────────────────┴─────────────────────┘
```
#### Page 0: Metadata Page
```rust
struct IvfFlatMetaPage {
magic: u32, // 0x49564646 ("IVFF")
lists: u32, // Number of clusters
probes: u32, // Default probes for search
dimensions: u32, // Vector dimensions
trained: u32, // 0=untrained, 1=trained
vector_count: u64, // Total vectors indexed
metric: u32, // Distance metric (0=L2, 1=IP, 2=Cosine, 3=L1)
centroid_start_page: u32,// First centroid page
lists_start_page: u32, // First inverted list page
reserved: [u32; 16], // Future expansion
}
```
#### Pages 1-N: Centroid Pages
Each centroid entry contains:
- Cluster ID
- Inverted list page reference
- Vector count in cluster
- Centroid vector data (dimensions × 4 bytes)
#### Pages N+1-M: Inverted List Pages
Each vector entry contains:
- Heap tuple ID (block number + offset)
- Vector data (dimensions × 4 bytes)
## Index Building
### 1. Training Phase
The index must be trained before use:
```sql
-- Create index with training
CREATE INDEX ON items USING ruivfflat (embedding vector_l2_ops)
WITH (lists = 100);
```
Training process:
1. **Sample Collection**: Up to 50,000 random vectors sampled from the heap
2. **K-means++ Initialization**: Intelligent centroid seeding for better convergence
3. **K-means Clustering**: 10 iterations of Lloyd's algorithm
4. **Centroid Storage**: Trained centroids written to index pages
### 2. Vector Assignment
After training, all vectors are assigned to their nearest centroid:
- Calculate distance to each centroid
- Assign to nearest centroid's inverted list
- Store in inverted list pages
## Search Process
### Query Execution
```sql
SELECT * FROM items
ORDER BY embedding <-> '[1,2,3,...]'
LIMIT 10;
```
Search algorithm:
1. **Find Nearest Centroids**: Calculate distance from query to all centroids
2. **Probe Selection**: Select `probes` nearest centroids
3. **List Scanning**: Scan inverted lists for selected centroids
4. **Re-ranking**: Calculate exact distances to all candidates
5. **Top-K Selection**: Return k nearest vectors
### Performance Tuning
#### Lists Parameter
Controls the number of clusters:
- **Small values (10-50)**: Faster build, slower search, lower recall
- **Medium values (100-200)**: Balanced performance
- **Large values (500-1000)**: Slower build, faster search, higher recall
Rule of thumb: `lists = sqrt(total_vectors)`
#### Probes Parameter
Controls search accuracy vs speed:
- **Low probes (1-3)**: Fast search, lower recall
- **Medium probes (5-10)**: Balanced
- **High probes (20-50)**: Slower search, higher recall
Set dynamically:
```sql
SET ruvector.ivfflat_probes = 10;
```
## Configuration
### GUC Variables
```sql
-- Set default probes for IVFFlat searches
SET ruvector.ivfflat_probes = 10;
-- View current setting
SHOW ruvector.ivfflat_probes;
```
### Index Options
```sql
CREATE INDEX ON table USING ruivfflat (column opclass)
WITH (lists = value, probes = value);
```
Available options:
- `lists`: Number of clusters (default: 100)
- `probes`: Default probes for searches (default: 1)
## Operator Classes
### Vector L2 (Euclidean)
```sql
CREATE INDEX ON items USING ruivfflat (embedding vector_l2_ops)
WITH (lists = 100);
```
### Vector Inner Product
```sql
CREATE INDEX ON items USING ruivfflat (embedding vector_ip_ops)
WITH (lists = 100);
```
### Vector Cosine
```sql
CREATE INDEX ON items USING ruivfflat (embedding vector_cosine_ops)
WITH (lists = 100);
```
## Performance Characteristics
### Time Complexity
- **Build**: O(n × k × d × iterations) where n=vectors, k=lists, d=dimensions
- **Insert**: O(k × d) - find nearest centroid
- **Search**: O(probes × (n/k) × d) - probe lists and re-rank
### Space Complexity
- **Index Size**: O(n × d × 4 + k × d × 4)
- Approximately same size as raw vectors plus centroids
### Recall vs Speed Trade-offs
| Probes | Recall | Speed | Use Case |
|--------|--------|----------|-----------------------------|
| 1 | 60-70% | Fastest | Very fast approximate search|
| 5 | 80-85% | Fast | Balanced performance |
| 10 | 90-95% | Medium | High recall applications |
| 20+ | 95-99% | Slower | Near-exact search |
## Examples
### Basic Usage
```sql
-- Create table
CREATE TABLE documents (
id serial PRIMARY KEY,
content text,
embedding vector(1536)
);
-- Insert vectors
INSERT INTO documents (content, embedding)
VALUES
('First document', '[0.1, 0.2, ...]'),
('Second document', '[0.3, 0.4, ...]');
-- Create IVFFlat index
CREATE INDEX ON documents USING ruivfflat (embedding vector_l2_ops)
WITH (lists = 100);
-- Search
SELECT id, content, embedding <-> '[0.5, 0.6, ...]' AS distance
FROM documents
ORDER BY embedding <-> '[0.5, 0.6, ...]'
LIMIT 10;
```
### Advanced Configuration
```sql
-- Large dataset with many lists
CREATE INDEX ON large_table USING ruivfflat (embedding vector_cosine_ops)
WITH (lists = 1000);
-- High-recall search
SET ruvector.ivfflat_probes = 20;
SELECT * FROM large_table
ORDER BY embedding <=> '[...]'
LIMIT 100;
```
### Index Statistics
```sql
-- Get index information
SELECT * FROM ruvector_ivfflat_stats('documents_embedding_idx');
-- Returns:
-- lists | probes | dimensions | trained | vector_count | metric
--------+--------+------------+---------+--------------+-----------
-- 100 | 1 | 1536 | true | 1000000 | euclidean
```
## Comparison with HNSW
| Feature | IVFFlat | HNSW |
|------------------|-------------------|---------------------|
| Build Time | Fast (minutes) | Slow (hours) |
| Search Speed | Fast | Faster |
| Recall | 80-95% | 95-99% |
| Memory | Low | High |
| Incremental Insert| Fast | Medium |
| Best For | Large static datasets | High-recall queries |
## Maintenance
### Rebuilding Index
After significant data changes, rebuild for better clustering:
```sql
REINDEX INDEX documents_embedding_idx;
```
### Monitoring
```sql
-- Check index size
SELECT pg_size_pretty(pg_relation_size('documents_embedding_idx'));
-- Check if trained
SELECT * FROM ruvector_ivfflat_stats('documents_embedding_idx');
```
## Implementation Details
### Zero-Copy Vector Access
The implementation uses zero-copy techniques:
- Read vector data directly from heap tuples
- No intermediate buffer allocation
- Compare directly with centroids in-place
### Memory Management
- Uses PostgreSQL's palloc/pfree memory contexts
- Automatic cleanup on transaction end
- No manual memory management required
### Concurrency
- Safe for concurrent reads
- Index building is single-threaded
- Inserts are serialized per cluster
## Limitations
1. **Training Required**: Cannot insert before training completes
2. **Fixed Clusters**: Number of lists cannot change after build
3. **No Updates**: Update requires delete + insert
4. **Memory**: All centroids must fit in memory during search
## Future Enhancements
- [ ] Parallel index building
- [ ] Incremental training for inserts
- [ ] Product quantization (IVF-PQ)
- [ ] GPU acceleration
- [ ] Adaptive probe selection
- [ ] Cluster rebalancing
## References
1. [pgvector](https://github.com/pgvector/pgvector) - Original IVFFlat implementation
2. [FAISS](https://github.com/facebookresearch/faiss) - Facebook AI Similarity Search
3. "Product Quantization for Nearest Neighbor Search" - Jégou et al., 2011
4. PostgreSQL Index Access Method Documentation