git-subtree-dir: vendor/ruvector git-subtree-split: b64c21726f2bb37286d9ee36a7869fef60cc6900
305 lines
8.5 KiB
Markdown
305 lines
8.5 KiB
Markdown
# IVFFlat Index Access Method
|
||
|
||
## Overview
|
||
|
||
The IVFFlat (Inverted File with Flat quantization) index is a PostgreSQL access method implementation for approximate nearest neighbor (ANN) search. It partitions the vector space into clusters using k-means clustering, enabling fast similarity search by probing only the most relevant clusters.
|
||
|
||
## Architecture
|
||
|
||
### Storage Layout
|
||
|
||
The IVFFlat index uses PostgreSQL's page-based storage with the following structure:
|
||
|
||
```
|
||
┌─────────────────┬──────────────────────┬─────────────────────┐
|
||
│ Page 0 │ Pages 1-N │ Pages N+1-M │
|
||
│ (Metadata) │ (Centroids) │ (Inverted Lists) │
|
||
└─────────────────┴──────────────────────┴─────────────────────┘
|
||
```
|
||
|
||
#### Page 0: Metadata Page
|
||
```rust
|
||
struct IvfFlatMetaPage {
|
||
magic: u32, // 0x49564646 ("IVFF")
|
||
lists: u32, // Number of clusters
|
||
probes: u32, // Default probes for search
|
||
dimensions: u32, // Vector dimensions
|
||
trained: u32, // 0=untrained, 1=trained
|
||
vector_count: u64, // Total vectors indexed
|
||
metric: u32, // Distance metric (0=L2, 1=IP, 2=Cosine, 3=L1)
|
||
centroid_start_page: u32,// First centroid page
|
||
lists_start_page: u32, // First inverted list page
|
||
reserved: [u32; 16], // Future expansion
|
||
}
|
||
```
|
||
|
||
#### Pages 1-N: Centroid Pages
|
||
Each centroid entry contains:
|
||
- Cluster ID
|
||
- Inverted list page reference
|
||
- Vector count in cluster
|
||
- Centroid vector data (dimensions × 4 bytes)
|
||
|
||
#### Pages N+1-M: Inverted List Pages
|
||
Each vector entry contains:
|
||
- Heap tuple ID (block number + offset)
|
||
- Vector data (dimensions × 4 bytes)
|
||
|
||
## Index Building
|
||
|
||
### 1. Training Phase
|
||
|
||
The index must be trained before use:
|
||
|
||
```sql
|
||
-- Create index with training
|
||
CREATE INDEX ON items USING ruivfflat (embedding vector_l2_ops)
|
||
WITH (lists = 100);
|
||
```
|
||
|
||
Training process:
|
||
1. **Sample Collection**: Up to 50,000 random vectors sampled from the heap
|
||
2. **K-means++ Initialization**: Intelligent centroid seeding for better convergence
|
||
3. **K-means Clustering**: 10 iterations of Lloyd's algorithm
|
||
4. **Centroid Storage**: Trained centroids written to index pages
|
||
|
||
### 2. Vector Assignment
|
||
|
||
After training, all vectors are assigned to their nearest centroid:
|
||
- Calculate distance to each centroid
|
||
- Assign to nearest centroid's inverted list
|
||
- Store in inverted list pages
|
||
|
||
## Search Process
|
||
|
||
### Query Execution
|
||
|
||
```sql
|
||
SELECT * FROM items
|
||
ORDER BY embedding <-> '[1,2,3,...]'
|
||
LIMIT 10;
|
||
```
|
||
|
||
Search algorithm:
|
||
1. **Find Nearest Centroids**: Calculate distance from query to all centroids
|
||
2. **Probe Selection**: Select `probes` nearest centroids
|
||
3. **List Scanning**: Scan inverted lists for selected centroids
|
||
4. **Re-ranking**: Calculate exact distances to all candidates
|
||
5. **Top-K Selection**: Return k nearest vectors
|
||
|
||
### Performance Tuning
|
||
|
||
#### Lists Parameter
|
||
|
||
Controls the number of clusters:
|
||
- **Small values (10-50)**: Faster build, slower search, lower recall
|
||
- **Medium values (100-200)**: Balanced performance
|
||
- **Large values (500-1000)**: Slower build, faster search, higher recall
|
||
|
||
Rule of thumb: `lists = sqrt(total_vectors)`
|
||
|
||
#### Probes Parameter
|
||
|
||
Controls search accuracy vs speed:
|
||
- **Low probes (1-3)**: Fast search, lower recall
|
||
- **Medium probes (5-10)**: Balanced
|
||
- **High probes (20-50)**: Slower search, higher recall
|
||
|
||
Set dynamically:
|
||
```sql
|
||
SET ruvector.ivfflat_probes = 10;
|
||
```
|
||
|
||
## Configuration
|
||
|
||
### GUC Variables
|
||
|
||
```sql
|
||
-- Set default probes for IVFFlat searches
|
||
SET ruvector.ivfflat_probes = 10;
|
||
|
||
-- View current setting
|
||
SHOW ruvector.ivfflat_probes;
|
||
```
|
||
|
||
### Index Options
|
||
|
||
```sql
|
||
CREATE INDEX ON table USING ruivfflat (column opclass)
|
||
WITH (lists = value, probes = value);
|
||
```
|
||
|
||
Available options:
|
||
- `lists`: Number of clusters (default: 100)
|
||
- `probes`: Default probes for searches (default: 1)
|
||
|
||
## Operator Classes
|
||
|
||
### Vector L2 (Euclidean)
|
||
```sql
|
||
CREATE INDEX ON items USING ruivfflat (embedding vector_l2_ops)
|
||
WITH (lists = 100);
|
||
```
|
||
|
||
### Vector Inner Product
|
||
```sql
|
||
CREATE INDEX ON items USING ruivfflat (embedding vector_ip_ops)
|
||
WITH (lists = 100);
|
||
```
|
||
|
||
### Vector Cosine
|
||
```sql
|
||
CREATE INDEX ON items USING ruivfflat (embedding vector_cosine_ops)
|
||
WITH (lists = 100);
|
||
```
|
||
|
||
## Performance Characteristics
|
||
|
||
### Time Complexity
|
||
- **Build**: O(n × k × d × iterations) where n=vectors, k=lists, d=dimensions
|
||
- **Insert**: O(k × d) - find nearest centroid
|
||
- **Search**: O(probes × (n/k) × d) - probe lists and re-rank
|
||
|
||
### Space Complexity
|
||
- **Index Size**: O(n × d × 4 + k × d × 4)
|
||
- Approximately same size as raw vectors plus centroids
|
||
|
||
### Recall vs Speed Trade-offs
|
||
|
||
| Probes | Recall | Speed | Use Case |
|
||
|--------|--------|----------|-----------------------------|
|
||
| 1 | 60-70% | Fastest | Very fast approximate search|
|
||
| 5 | 80-85% | Fast | Balanced performance |
|
||
| 10 | 90-95% | Medium | High recall applications |
|
||
| 20+ | 95-99% | Slower | Near-exact search |
|
||
|
||
## Examples
|
||
|
||
### Basic Usage
|
||
|
||
```sql
|
||
-- Create table
|
||
CREATE TABLE documents (
|
||
id serial PRIMARY KEY,
|
||
content text,
|
||
embedding vector(1536)
|
||
);
|
||
|
||
-- Insert vectors
|
||
INSERT INTO documents (content, embedding)
|
||
VALUES
|
||
('First document', '[0.1, 0.2, ...]'),
|
||
('Second document', '[0.3, 0.4, ...]');
|
||
|
||
-- Create IVFFlat index
|
||
CREATE INDEX ON documents USING ruivfflat (embedding vector_l2_ops)
|
||
WITH (lists = 100);
|
||
|
||
-- Search
|
||
SELECT id, content, embedding <-> '[0.5, 0.6, ...]' AS distance
|
||
FROM documents
|
||
ORDER BY embedding <-> '[0.5, 0.6, ...]'
|
||
LIMIT 10;
|
||
```
|
||
|
||
### Advanced Configuration
|
||
|
||
```sql
|
||
-- Large dataset with many lists
|
||
CREATE INDEX ON large_table USING ruivfflat (embedding vector_cosine_ops)
|
||
WITH (lists = 1000);
|
||
|
||
-- High-recall search
|
||
SET ruvector.ivfflat_probes = 20;
|
||
SELECT * FROM large_table
|
||
ORDER BY embedding <=> '[...]'
|
||
LIMIT 100;
|
||
```
|
||
|
||
### Index Statistics
|
||
|
||
```sql
|
||
-- Get index information
|
||
SELECT * FROM ruvector_ivfflat_stats('documents_embedding_idx');
|
||
|
||
-- Returns:
|
||
-- lists | probes | dimensions | trained | vector_count | metric
|
||
--------+--------+------------+---------+--------------+-----------
|
||
-- 100 | 1 | 1536 | true | 1000000 | euclidean
|
||
```
|
||
|
||
## Comparison with HNSW
|
||
|
||
| Feature | IVFFlat | HNSW |
|
||
|------------------|-------------------|---------------------|
|
||
| Build Time | Fast (minutes) | Slow (hours) |
|
||
| Search Speed | Fast | Faster |
|
||
| Recall | 80-95% | 95-99% |
|
||
| Memory | Low | High |
|
||
| Incremental Insert| Fast | Medium |
|
||
| Best For | Large static datasets | High-recall queries |
|
||
|
||
## Maintenance
|
||
|
||
### Rebuilding Index
|
||
|
||
After significant data changes, rebuild for better clustering:
|
||
|
||
```sql
|
||
REINDEX INDEX documents_embedding_idx;
|
||
```
|
||
|
||
### Monitoring
|
||
|
||
```sql
|
||
-- Check index size
|
||
SELECT pg_size_pretty(pg_relation_size('documents_embedding_idx'));
|
||
|
||
-- Check if trained
|
||
SELECT * FROM ruvector_ivfflat_stats('documents_embedding_idx');
|
||
```
|
||
|
||
## Implementation Details
|
||
|
||
### Zero-Copy Vector Access
|
||
|
||
The implementation uses zero-copy techniques:
|
||
- Read vector data directly from heap tuples
|
||
- No intermediate buffer allocation
|
||
- Compare directly with centroids in-place
|
||
|
||
### Memory Management
|
||
|
||
- Uses PostgreSQL's palloc/pfree memory contexts
|
||
- Automatic cleanup on transaction end
|
||
- No manual memory management required
|
||
|
||
### Concurrency
|
||
|
||
- Safe for concurrent reads
|
||
- Index building is single-threaded
|
||
- Inserts are serialized per cluster
|
||
|
||
## Limitations
|
||
|
||
1. **Training Required**: Cannot insert before training completes
|
||
2. **Fixed Clusters**: Number of lists cannot change after build
|
||
3. **No Updates**: Update requires delete + insert
|
||
4. **Memory**: All centroids must fit in memory during search
|
||
|
||
## Future Enhancements
|
||
|
||
- [ ] Parallel index building
|
||
- [ ] Incremental training for inserts
|
||
- [ ] Product quantization (IVF-PQ)
|
||
- [ ] GPU acceleration
|
||
- [ ] Adaptive probe selection
|
||
- [ ] Cluster rebalancing
|
||
|
||
## References
|
||
|
||
1. [pgvector](https://github.com/pgvector/pgvector) - Original IVFFlat implementation
|
||
2. [FAISS](https://github.com/facebookresearch/faiss) - Facebook AI Similarity Search
|
||
3. "Product Quantization for Nearest Neighbor Search" - Jégou et al., 2011
|
||
4. PostgreSQL Index Access Method Documentation
|