Files
wifi-densepose/crates/ruvector-postgres/docs/ivfflat_access_method.md
ruv d803bfe2b1 Squashed 'vendor/ruvector/' content from commit b64c2172
git-subtree-dir: vendor/ruvector
git-subtree-split: b64c21726f2bb37286d9ee36a7869fef60cc6900
2026-02-28 14:39:40 -05:00

8.5 KiB
Raw Blame History

IVFFlat Index Access Method

Overview

The IVFFlat (Inverted File with Flat quantization) index is a PostgreSQL access method implementation for approximate nearest neighbor (ANN) search. It partitions the vector space into clusters using k-means clustering, enabling fast similarity search by probing only the most relevant clusters.

Architecture

Storage Layout

The IVFFlat index uses PostgreSQL's page-based storage with the following structure:

┌─────────────────┬──────────────────────┬─────────────────────┐
│  Page 0         │  Pages 1-N           │  Pages N+1-M        │
│  (Metadata)     │  (Centroids)         │  (Inverted Lists)   │
└─────────────────┴──────────────────────┴─────────────────────┘

Page 0: Metadata Page

struct IvfFlatMetaPage {
    magic: u32,              // 0x49564646 ("IVFF")
    lists: u32,              // Number of clusters
    probes: u32,             // Default probes for search
    dimensions: u32,         // Vector dimensions
    trained: u32,            // 0=untrained, 1=trained
    vector_count: u64,       // Total vectors indexed
    metric: u32,             // Distance metric (0=L2, 1=IP, 2=Cosine, 3=L1)
    centroid_start_page: u32,// First centroid page
    lists_start_page: u32,   // First inverted list page
    reserved: [u32; 16],     // Future expansion
}

Pages 1-N: Centroid Pages

Each centroid entry contains:

  • Cluster ID
  • Inverted list page reference
  • Vector count in cluster
  • Centroid vector data (dimensions × 4 bytes)

Pages N+1-M: Inverted List Pages

Each vector entry contains:

  • Heap tuple ID (block number + offset)
  • Vector data (dimensions × 4 bytes)

Index Building

1. Training Phase

The index must be trained before use:

-- Create index with training
CREATE INDEX ON items USING ruivfflat (embedding vector_l2_ops)
  WITH (lists = 100);

Training process:

  1. Sample Collection: Up to 50,000 random vectors sampled from the heap
  2. K-means++ Initialization: Intelligent centroid seeding for better convergence
  3. K-means Clustering: 10 iterations of Lloyd's algorithm
  4. Centroid Storage: Trained centroids written to index pages

2. Vector Assignment

After training, all vectors are assigned to their nearest centroid:

  • Calculate distance to each centroid
  • Assign to nearest centroid's inverted list
  • Store in inverted list pages

Search Process

Query Execution

SELECT * FROM items
ORDER BY embedding <-> '[1,2,3,...]'
LIMIT 10;

Search algorithm:

  1. Find Nearest Centroids: Calculate distance from query to all centroids
  2. Probe Selection: Select probes nearest centroids
  3. List Scanning: Scan inverted lists for selected centroids
  4. Re-ranking: Calculate exact distances to all candidates
  5. Top-K Selection: Return k nearest vectors

Performance Tuning

Lists Parameter

Controls the number of clusters:

  • Small values (10-50): Faster build, slower search, lower recall
  • Medium values (100-200): Balanced performance
  • Large values (500-1000): Slower build, faster search, higher recall

Rule of thumb: lists = sqrt(total_vectors)

Probes Parameter

Controls search accuracy vs speed:

  • Low probes (1-3): Fast search, lower recall
  • Medium probes (5-10): Balanced
  • High probes (20-50): Slower search, higher recall

Set dynamically:

SET ruvector.ivfflat_probes = 10;

Configuration

GUC Variables

-- Set default probes for IVFFlat searches
SET ruvector.ivfflat_probes = 10;

-- View current setting
SHOW ruvector.ivfflat_probes;

Index Options

CREATE INDEX ON table USING ruivfflat (column opclass)
  WITH (lists = value, probes = value);

Available options:

  • lists: Number of clusters (default: 100)
  • probes: Default probes for searches (default: 1)

Operator Classes

Vector L2 (Euclidean)

CREATE INDEX ON items USING ruivfflat (embedding vector_l2_ops)
  WITH (lists = 100);

Vector Inner Product

CREATE INDEX ON items USING ruivfflat (embedding vector_ip_ops)
  WITH (lists = 100);

Vector Cosine

CREATE INDEX ON items USING ruivfflat (embedding vector_cosine_ops)
  WITH (lists = 100);

Performance Characteristics

Time Complexity

  • Build: O(n × k × d × iterations) where n=vectors, k=lists, d=dimensions
  • Insert: O(k × d) - find nearest centroid
  • Search: O(probes × (n/k) × d) - probe lists and re-rank

Space Complexity

  • Index Size: O(n × d × 4 + k × d × 4)
  • Approximately same size as raw vectors plus centroids

Recall vs Speed Trade-offs

Probes Recall Speed Use Case
1 60-70% Fastest Very fast approximate search
5 80-85% Fast Balanced performance
10 90-95% Medium High recall applications
20+ 95-99% Slower Near-exact search

Examples

Basic Usage

-- Create table
CREATE TABLE documents (
    id serial PRIMARY KEY,
    content text,
    embedding vector(1536)
);

-- Insert vectors
INSERT INTO documents (content, embedding)
VALUES
    ('First document', '[0.1, 0.2, ...]'),
    ('Second document', '[0.3, 0.4, ...]');

-- Create IVFFlat index
CREATE INDEX ON documents USING ruivfflat (embedding vector_l2_ops)
  WITH (lists = 100);

-- Search
SELECT id, content, embedding <-> '[0.5, 0.6, ...]' AS distance
FROM documents
ORDER BY embedding <-> '[0.5, 0.6, ...]'
LIMIT 10;

Advanced Configuration

-- Large dataset with many lists
CREATE INDEX ON large_table USING ruivfflat (embedding vector_cosine_ops)
  WITH (lists = 1000);

-- High-recall search
SET ruvector.ivfflat_probes = 20;
SELECT * FROM large_table
ORDER BY embedding <=> '[...]'
LIMIT 100;

Index Statistics

-- Get index information
SELECT * FROM ruvector_ivfflat_stats('documents_embedding_idx');

-- Returns:
-- lists | probes | dimensions | trained | vector_count | metric
--------+--------+------------+---------+--------------+-----------
-- 100   | 1      | 1536       | true    | 1000000     | euclidean

Comparison with HNSW

Feature IVFFlat HNSW
Build Time Fast (minutes) Slow (hours)
Search Speed Fast Faster
Recall 80-95% 95-99%
Memory Low High
Incremental Insert Fast Medium
Best For Large static datasets High-recall queries

Maintenance

Rebuilding Index

After significant data changes, rebuild for better clustering:

REINDEX INDEX documents_embedding_idx;

Monitoring

-- Check index size
SELECT pg_size_pretty(pg_relation_size('documents_embedding_idx'));

-- Check if trained
SELECT * FROM ruvector_ivfflat_stats('documents_embedding_idx');

Implementation Details

Zero-Copy Vector Access

The implementation uses zero-copy techniques:

  • Read vector data directly from heap tuples
  • No intermediate buffer allocation
  • Compare directly with centroids in-place

Memory Management

  • Uses PostgreSQL's palloc/pfree memory contexts
  • Automatic cleanup on transaction end
  • No manual memory management required

Concurrency

  • Safe for concurrent reads
  • Index building is single-threaded
  • Inserts are serialized per cluster

Limitations

  1. Training Required: Cannot insert before training completes
  2. Fixed Clusters: Number of lists cannot change after build
  3. No Updates: Update requires delete + insert
  4. Memory: All centroids must fit in memory during search

Future Enhancements

  • Parallel index building
  • Incremental training for inserts
  • Product quantization (IVF-PQ)
  • GPU acceleration
  • Adaptive probe selection
  • Cluster rebalancing

References

  1. pgvector - Original IVFFlat implementation
  2. FAISS - Facebook AI Similarity Search
  3. "Product Quantization for Nearest Neighbor Search" - Jégou et al., 2011
  4. PostgreSQL Index Access Method Documentation