git-subtree-dir: vendor/ruvector git-subtree-split: b64c21726f2bb37286d9ee36a7869fef60cc6900
8.5 KiB
IVFFlat Index Access Method
Overview
The IVFFlat (Inverted File with Flat quantization) index is a PostgreSQL access method implementation for approximate nearest neighbor (ANN) search. It partitions the vector space into clusters using k-means clustering, enabling fast similarity search by probing only the most relevant clusters.
Architecture
Storage Layout
The IVFFlat index uses PostgreSQL's page-based storage with the following structure:
┌─────────────────┬──────────────────────┬─────────────────────┐
│ Page 0 │ Pages 1-N │ Pages N+1-M │
│ (Metadata) │ (Centroids) │ (Inverted Lists) │
└─────────────────┴──────────────────────┴─────────────────────┘
Page 0: Metadata Page
struct IvfFlatMetaPage {
magic: u32, // 0x49564646 ("IVFF")
lists: u32, // Number of clusters
probes: u32, // Default probes for search
dimensions: u32, // Vector dimensions
trained: u32, // 0=untrained, 1=trained
vector_count: u64, // Total vectors indexed
metric: u32, // Distance metric (0=L2, 1=IP, 2=Cosine, 3=L1)
centroid_start_page: u32,// First centroid page
lists_start_page: u32, // First inverted list page
reserved: [u32; 16], // Future expansion
}
Pages 1-N: Centroid Pages
Each centroid entry contains:
- Cluster ID
- Inverted list page reference
- Vector count in cluster
- Centroid vector data (dimensions × 4 bytes)
Pages N+1-M: Inverted List Pages
Each vector entry contains:
- Heap tuple ID (block number + offset)
- Vector data (dimensions × 4 bytes)
Index Building
1. Training Phase
The index must be trained before use:
-- Create index with training
CREATE INDEX ON items USING ruivfflat (embedding vector_l2_ops)
WITH (lists = 100);
Training process:
- Sample Collection: Up to 50,000 random vectors sampled from the heap
- K-means++ Initialization: Intelligent centroid seeding for better convergence
- K-means Clustering: 10 iterations of Lloyd's algorithm
- Centroid Storage: Trained centroids written to index pages
2. Vector Assignment
After training, all vectors are assigned to their nearest centroid:
- Calculate distance to each centroid
- Assign to nearest centroid's inverted list
- Store in inverted list pages
Search Process
Query Execution
SELECT * FROM items
ORDER BY embedding <-> '[1,2,3,...]'
LIMIT 10;
Search algorithm:
- Find Nearest Centroids: Calculate distance from query to all centroids
- Probe Selection: Select
probesnearest centroids - List Scanning: Scan inverted lists for selected centroids
- Re-ranking: Calculate exact distances to all candidates
- Top-K Selection: Return k nearest vectors
Performance Tuning
Lists Parameter
Controls the number of clusters:
- Small values (10-50): Faster build, slower search, lower recall
- Medium values (100-200): Balanced performance
- Large values (500-1000): Slower build, faster search, higher recall
Rule of thumb: lists = sqrt(total_vectors)
Probes Parameter
Controls search accuracy vs speed:
- Low probes (1-3): Fast search, lower recall
- Medium probes (5-10): Balanced
- High probes (20-50): Slower search, higher recall
Set dynamically:
SET ruvector.ivfflat_probes = 10;
Configuration
GUC Variables
-- Set default probes for IVFFlat searches
SET ruvector.ivfflat_probes = 10;
-- View current setting
SHOW ruvector.ivfflat_probes;
Index Options
CREATE INDEX ON table USING ruivfflat (column opclass)
WITH (lists = value, probes = value);
Available options:
lists: Number of clusters (default: 100)probes: Default probes for searches (default: 1)
Operator Classes
Vector L2 (Euclidean)
CREATE INDEX ON items USING ruivfflat (embedding vector_l2_ops)
WITH (lists = 100);
Vector Inner Product
CREATE INDEX ON items USING ruivfflat (embedding vector_ip_ops)
WITH (lists = 100);
Vector Cosine
CREATE INDEX ON items USING ruivfflat (embedding vector_cosine_ops)
WITH (lists = 100);
Performance Characteristics
Time Complexity
- Build: O(n × k × d × iterations) where n=vectors, k=lists, d=dimensions
- Insert: O(k × d) - find nearest centroid
- Search: O(probes × (n/k) × d) - probe lists and re-rank
Space Complexity
- Index Size: O(n × d × 4 + k × d × 4)
- Approximately same size as raw vectors plus centroids
Recall vs Speed Trade-offs
| Probes | Recall | Speed | Use Case |
|---|---|---|---|
| 1 | 60-70% | Fastest | Very fast approximate search |
| 5 | 80-85% | Fast | Balanced performance |
| 10 | 90-95% | Medium | High recall applications |
| 20+ | 95-99% | Slower | Near-exact search |
Examples
Basic Usage
-- Create table
CREATE TABLE documents (
id serial PRIMARY KEY,
content text,
embedding vector(1536)
);
-- Insert vectors
INSERT INTO documents (content, embedding)
VALUES
('First document', '[0.1, 0.2, ...]'),
('Second document', '[0.3, 0.4, ...]');
-- Create IVFFlat index
CREATE INDEX ON documents USING ruivfflat (embedding vector_l2_ops)
WITH (lists = 100);
-- Search
SELECT id, content, embedding <-> '[0.5, 0.6, ...]' AS distance
FROM documents
ORDER BY embedding <-> '[0.5, 0.6, ...]'
LIMIT 10;
Advanced Configuration
-- Large dataset with many lists
CREATE INDEX ON large_table USING ruivfflat (embedding vector_cosine_ops)
WITH (lists = 1000);
-- High-recall search
SET ruvector.ivfflat_probes = 20;
SELECT * FROM large_table
ORDER BY embedding <=> '[...]'
LIMIT 100;
Index Statistics
-- Get index information
SELECT * FROM ruvector_ivfflat_stats('documents_embedding_idx');
-- Returns:
-- lists | probes | dimensions | trained | vector_count | metric
--------+--------+------------+---------+--------------+-----------
-- 100 | 1 | 1536 | true | 1000000 | euclidean
Comparison with HNSW
| Feature | IVFFlat | HNSW |
|---|---|---|
| Build Time | Fast (minutes) | Slow (hours) |
| Search Speed | Fast | Faster |
| Recall | 80-95% | 95-99% |
| Memory | Low | High |
| Incremental Insert | Fast | Medium |
| Best For | Large static datasets | High-recall queries |
Maintenance
Rebuilding Index
After significant data changes, rebuild for better clustering:
REINDEX INDEX documents_embedding_idx;
Monitoring
-- Check index size
SELECT pg_size_pretty(pg_relation_size('documents_embedding_idx'));
-- Check if trained
SELECT * FROM ruvector_ivfflat_stats('documents_embedding_idx');
Implementation Details
Zero-Copy Vector Access
The implementation uses zero-copy techniques:
- Read vector data directly from heap tuples
- No intermediate buffer allocation
- Compare directly with centroids in-place
Memory Management
- Uses PostgreSQL's palloc/pfree memory contexts
- Automatic cleanup on transaction end
- No manual memory management required
Concurrency
- Safe for concurrent reads
- Index building is single-threaded
- Inserts are serialized per cluster
Limitations
- Training Required: Cannot insert before training completes
- Fixed Clusters: Number of lists cannot change after build
- No Updates: Update requires delete + insert
- Memory: All centroids must fit in memory during search
Future Enhancements
- Parallel index building
- Incremental training for inserts
- Product quantization (IVF-PQ)
- GPU acceleration
- Adaptive probe selection
- Cluster rebalancing