# HNSW Index Implementation ## Overview This document describes the HNSW (Hierarchical Navigable Small World) index implementation as a PostgreSQL Access Method for the RuVector extension. ## What is HNSW? HNSW is a graph-based algorithm for approximate nearest neighbor (ANN) search in high-dimensional spaces. It provides: - **Logarithmic search complexity**: O(log N) average case - **High recall**: >95% recall achievable with proper parameters - **Incremental updates**: Supports efficient insertions and deletions - **Multi-layer graph structure**: Hierarchical organization for fast traversal ## Architecture ### Page-Based Storage The HNSW index stores data in PostgreSQL pages for durability and memory management: ``` Page 0 (Metadata): ├─ Magic number: 0x484E5357 ("HNSW") ├─ Version: 1 ├─ Dimensions: Vector dimensionality ├─ Parameters: m, m0, ef_construction ├─ Entry point: Block number of top-level node ├─ Max layer: Highest layer in the graph └─ Metric: Distance metric (L2/Cosine/IP) Page 1+ (Node Pages): ├─ Node Header: │ ├─ Page type: HNSW_PAGE_NODE │ ├─ Max layer: Highest layer for this node │ └─ Item pointer: TID of heap tuple ├─ Vector data: [f32; dimensions] ├─ Layer 0 neighbors: [BlockNumber; m0] └─ Layer 1+ neighbors: [[BlockNumber; m]; max_layer] ``` ### Access Method Callbacks The implementation provides all required PostgreSQL index AM callbacks: 1. **`ambuild`** - Builds index from table data 2. **`ambuildempty`** - Creates empty index structure 3. **`aminsert`** - Inserts a single vector 4. **`ambulkdelete`** - Bulk deletion support 5. **`amvacuumcleanup`** - Vacuum cleanup operations 6. **`amcostestimate`** - Query cost estimation 7. **`amgettuple`** - Sequential tuple retrieval 8. **`amgetbitmap`** - Bitmap scan support 9. **`amcanreturn`** - Index-only scan capability 10. **`amoptions`** - Index option parsing ## Usage ### Creating an HNSW Index ```sql -- Basic index creation (L2 distance, default parameters) CREATE INDEX ON items USING hnsw (embedding hnsw_l2_ops); -- With custom parameters CREATE INDEX ON items USING hnsw (embedding hnsw_l2_ops) WITH (m = 32, ef_construction = 128); -- Cosine distance CREATE INDEX ON items USING hnsw (embedding hnsw_cosine_ops); -- Inner product CREATE INDEX ON items USING hnsw (embedding hnsw_ip_ops); ``` ### Querying ```sql -- Find 10 nearest neighbors using L2 distance SELECT id, embedding <-> ARRAY[0.1, 0.2, 0.3]::real[] AS distance FROM items ORDER BY embedding <-> ARRAY[0.1, 0.2, 0.3]::real[] LIMIT 10; -- Find 10 nearest neighbors using cosine distance SELECT id, embedding <=> ARRAY[0.1, 0.2, 0.3]::real[] AS distance FROM items ORDER BY embedding <=> ARRAY[0.1, 0.2, 0.3]::real[] LIMIT 10; -- Find vectors with largest inner product SELECT id, embedding <#> ARRAY[0.1, 0.2, 0.3]::real[] AS neg_ip FROM items ORDER BY embedding <#> ARRAY[0.1, 0.2, 0.3]::real[] LIMIT 10; ``` ## Parameters ### Index Build Parameters | Parameter | Type | Default | Range | Description | |-----------|------|---------|-------|-------------| | `m` | integer | 16 | 2-128 | Maximum connections per layer | | `ef_construction` | integer | 64 | 4-1000 | Size of dynamic candidate list during build | | `metric` | string | 'l2' | l2/cosine/ip | Distance metric | **Parameter Tuning Guidelines:** - **`m`**: Higher values improve recall but increase memory usage - Low (8-16): Fast build, lower memory, good for small datasets - Medium (16-32): Balanced performance - High (32-64): Better recall, slower build, more memory - **`ef_construction`**: Higher values improve index quality but slow down build - Low (32-64): Fast build, may sacrifice recall - Medium (64-128): Balanced - High (128-500): Best quality, slow build ### Query-Time Parameters | Parameter | Type | Default | Description | |-----------|------|---------|-------------| | `ruvector.ef_search` | integer | 40 | Size of dynamic candidate list during search | **Setting ef_search:** ```sql -- Global setting (postgresql.conf or ALTER SYSTEM) ALTER SYSTEM SET ruvector.ef_search = 100; -- Session setting (per-connection) SET ruvector.ef_search = 100; -- Query with increased recall SET LOCAL ruvector.ef_search = 200; SELECT ... ORDER BY embedding <-> query LIMIT 10; ``` ## Distance Metrics ### L2 (Euclidean) Distance - **Operator**: `<->` - **Formula**: `√(Σ(a[i] - b[i])²)` - **Use case**: General-purpose distance - **Range**: [0, ∞) ```sql CREATE INDEX ON items USING hnsw (embedding hnsw_l2_ops); SELECT * FROM items ORDER BY embedding <-> query_vector LIMIT 10; ``` ### Cosine Distance - **Operator**: `<=>` - **Formula**: `1 - (a·b)/(||a||·||b||)` - **Use case**: Direction similarity (text embeddings) - **Range**: [0, 2] ```sql CREATE INDEX ON items USING hnsw (embedding hnsw_cosine_ops); SELECT * FROM items ORDER BY embedding <=> query_vector LIMIT 10; ``` ### Inner Product - **Operator**: `<#>` - **Formula**: `-Σ(a[i] * b[i])` - **Use case**: Maximum similarity (normalized vectors) - **Range**: (-∞, ∞) ```sql CREATE INDEX ON items USING hnsw (embedding hnsw_ip_ops); SELECT * FROM items ORDER BY embedding <#> query_vector LIMIT 10; ``` ## Performance ### Build Performance - **Time Complexity**: O(N log N) with high probability - **Space Complexity**: O(N * M * L) where L is average layer count - **Typical Build Rate**: 1000-10000 vectors/sec (depends on dimensions) ### Query Performance - **Time Complexity**: O(ef_search * log N) - **Typical Query Time**: - <1ms for 100K vectors (128D) - <5ms for 1M vectors (128D) - <10ms for 10M vectors (128D) ### Memory Usage ``` Memory per vector ≈ dimensions * 4 bytes + m * 8 bytes * average_layers Average layers ≈ log₂(N) / log₂(m) Example (1M vectors, 128D, m=16): - Vector data: 1M * 128 * 4 = 512 MB - Graph edges: 1M * 16 * 8 * 4 = 512 MB - Total: ~1 GB ``` ## Operator Classes ### hnsw_l2_ops For L2 (Euclidean) distance on `real[]` vectors. ```sql CREATE OPERATOR CLASS hnsw_l2_ops FOR TYPE real[] USING hnsw FAMILY hnsw_l2_ops AS OPERATOR 1 <-> (real[], real[]) FOR ORDER BY float_ops, FUNCTION 1 l2_distance_arr(real[], real[]); ``` ### hnsw_cosine_ops For cosine distance on `real[]` vectors. ```sql CREATE OPERATOR CLASS hnsw_cosine_ops FOR TYPE real[] USING hnsw FAMILY hnsw_cosine_ops AS OPERATOR 1 <=> (real[], real[]) FOR ORDER BY float_ops, FUNCTION 1 cosine_distance_arr(real[], real[]); ``` ### hnsw_ip_ops For inner product on `real[]` vectors. ```sql CREATE OPERATOR CLASS hnsw_ip_ops FOR TYPE real[] USING hnsw FAMILY hnsw_ip_ops AS OPERATOR 1 <#> (real[], real[]) FOR ORDER BY float_ops, FUNCTION 1 neg_inner_product_arr(real[], real[]); ``` ## Monitoring and Maintenance ### Index Statistics ```sql -- View memory usage SELECT ruvector_memory_stats(); -- Check index size SELECT pg_size_pretty(pg_relation_size('items_embedding_idx')); -- View index definition SELECT indexdef FROM pg_indexes WHERE indexname = 'items_embedding_idx'; ``` ### Index Maintenance ```sql -- Perform maintenance (optimize connections, rebuild degraded nodes) SELECT ruvector_index_maintenance('items_embedding_idx'); -- Vacuum to reclaim space after deletes VACUUM items; -- Rebuild index if heavily modified REINDEX INDEX items_embedding_idx; ``` ### Query Plan Analysis ```sql -- Analyze query execution EXPLAIN (ANALYZE, BUFFERS) SELECT id, embedding <-> query AS distance FROM items ORDER BY embedding <-> query LIMIT 10; ``` ## Best Practices ### 1. Index Creation - Build indexes on stable data when possible - Use higher `ef_construction` for better quality - Consider using `maintenance_work_mem` for large builds: ```sql SET maintenance_work_mem = '2GB'; CREATE INDEX ...; ``` ### 2. Query Optimization - Adjust `ef_search` based on recall requirements - Use prepared statements for repeated queries - Consider query result caching for common queries ### 3. Data Management - Normalize vectors for cosine similarity - Batch inserts when possible - Schedule index maintenance during low-traffic periods ### 4. Monitoring - Track index size growth - Monitor query performance metrics - Set up alerts for memory usage ## Limitations ### Current Version - **Single column only**: Multi-column indexes not supported - **No parallel scans**: Query parallelism not yet implemented - **No index-only scans**: Must access heap tuples - **Array type only**: Custom vector type support coming soon ### PostgreSQL Version Requirements - PostgreSQL 14+ - pgrx 0.12+ ## Troubleshooting ### Index Build Fails **Problem**: Out of memory during index build **Solution**: Increase `maintenance_work_mem` or reduce `ef_construction` ```sql SET maintenance_work_mem = '4GB'; ``` ### Slow Queries **Problem**: Queries are slower than expected **Solution**: Increase `ef_search` or rebuild index with higher `m` ```sql SET ruvector.ef_search = 100; ``` ### Low Recall **Problem**: Not finding correct nearest neighbors **Solution**: Increase `ef_search` or rebuild with higher `ef_construction` ```sql REINDEX INDEX items_embedding_idx; ``` ## Comparison with Other Methods | Feature | HNSW | IVFFlat | Brute Force | |---------|------|---------|-------------| | Search Time | O(log N) | O(√N) | O(N) | | Build Time | O(N log N) | O(N) | O(1) | | Memory | High | Medium | Low | | Recall | >95% | >90% | 100% | | Updates | Good | Poor | Excellent | ## Future Enhancements - [ ] Parallel index scans - [ ] Custom vector type support - [ ] Index-only scans - [ ] Dynamic parameter tuning - [ ] Graph compression - [ ] Multi-column indexes - [ ] Distributed HNSW ## References 1. Malkov, Y. A., & Yashunin, D. A. (2018). "Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs." IEEE transactions on pattern analysis and machine intelligence. 2. PostgreSQL Index Access Method documentation: https://www.postgresql.org/docs/current/indexam.html 3. pgrx documentation: https://github.com/pgcentralfoundation/pgrx ## License MIT License - See LICENSE file for details.