Squashed 'vendor/ruvector/' content from commit b64c2172

git-subtree-dir: vendor/ruvector git-subtree-split: b64c21726f2bb37286d9ee36a7869fef60cc6900
2026-02-28 14:39:40 -05:00
commit d803bfe2b1
7854 changed files with 3522914 additions and 0 deletions
--- a/docs/hnsw/HNSW_INDEX.md
+++ b/docs/hnsw/HNSW_INDEX.md
@@ -0,0 +1,386 @@
+# HNSW Index Implementation
+
+## Overview
+
+This document describes the HNSW (Hierarchical Navigable Small World) index implementation as a PostgreSQL Access Method for the RuVector extension.
+
+## What is HNSW?
+
+HNSW is a graph-based algorithm for approximate nearest neighbor (ANN) search in high-dimensional spaces. It provides:
+
+- **Logarithmic search complexity**: O(log N) average case
+- **High recall**: >95% recall achievable with proper parameters
+- **Incremental updates**: Supports efficient insertions and deletions
+- **Multi-layer graph structure**: Hierarchical organization for fast traversal
+
+## Architecture
+
+### Page-Based Storage
+
+The HNSW index stores data in PostgreSQL pages for durability and memory management:
+
+```
+Page 0 (Metadata):
+├─ Magic number: 0x484E5357 ("HNSW")
+├─ Version: 1
+├─ Dimensions: Vector dimensionality
+├─ Parameters: m, m0, ef_construction
+├─ Entry point: Block number of top-level node
+├─ Max layer: Highest layer in the graph
+└─ Metric: Distance metric (L2/Cosine/IP)
+
+Page 1+ (Node Pages):
+├─ Node Header:
+│  ├─ Page type: HNSW_PAGE_NODE
+│  ├─ Max layer: Highest layer for this node
+│  └─ Item pointer: TID of heap tuple
+├─ Vector data: [f32; dimensions]
+├─ Layer 0 neighbors: [BlockNumber; m0]
+└─ Layer 1+ neighbors: [[BlockNumber; m]; max_layer]
+```
+
+### Access Method Callbacks
+
+The implementation provides all required PostgreSQL index AM callbacks:
+
+1. **`ambuild`** - Builds index from table data
+2. **`ambuildempty`** - Creates empty index structure
+3. **`aminsert`** - Inserts a single vector
+4. **`ambulkdelete`** - Bulk deletion support
+5. **`amvacuumcleanup`** - Vacuum cleanup operations
+6. **`amcostestimate`** - Query cost estimation
+7. **`amgettuple`** - Sequential tuple retrieval
+8. **`amgetbitmap`** - Bitmap scan support
+9. **`amcanreturn`** - Index-only scan capability
+10. **`amoptions`** - Index option parsing
+
+## Usage
+
+### Creating an HNSW Index
+
+```sql
+-- Basic index creation (L2 distance, default parameters)
+CREATE INDEX ON items USING hnsw (embedding hnsw_l2_ops);
+
+-- With custom parameters
+CREATE INDEX ON items USING hnsw (embedding hnsw_l2_ops)
+    WITH (m = 32, ef_construction = 128);
+
+-- Cosine distance
+CREATE INDEX ON items USING hnsw (embedding hnsw_cosine_ops);
+
+-- Inner product
+CREATE INDEX ON items USING hnsw (embedding hnsw_ip_ops);
+```
+
+### Querying
+
+```sql
+-- Find 10 nearest neighbors using L2 distance
+SELECT id, embedding <-> ARRAY[0.1, 0.2, 0.3]::real[] AS distance
+FROM items
+ORDER BY embedding <-> ARRAY[0.1, 0.2, 0.3]::real[]
+LIMIT 10;
+
+-- Find 10 nearest neighbors using cosine distance
+SELECT id, embedding <=> ARRAY[0.1, 0.2, 0.3]::real[] AS distance
+FROM items
+ORDER BY embedding <=> ARRAY[0.1, 0.2, 0.3]::real[]
+LIMIT 10;
+
+-- Find vectors with largest inner product
+SELECT id, embedding <#> ARRAY[0.1, 0.2, 0.3]::real[] AS neg_ip
+FROM items
+ORDER BY embedding <#> ARRAY[0.1, 0.2, 0.3]::real[]
+LIMIT 10;
+```
+
+## Parameters
+
+### Index Build Parameters
+
+| Parameter | Type | Default | Range | Description |
+|-----------|------|---------|-------|-------------|
+| `m` | integer | 16 | 2-128 | Maximum connections per layer |
+| `ef_construction` | integer | 64 | 4-1000 | Size of dynamic candidate list during build |
+| `metric` | string | 'l2' | l2/cosine/ip | Distance metric |
+
+**Parameter Tuning Guidelines:**
+
+- **`m`**: Higher values improve recall but increase memory usage
+  - Low (8-16): Fast build, lower memory, good for small datasets
+  - Medium (16-32): Balanced performance
+  - High (32-64): Better recall, slower build, more memory
+
+- **`ef_construction`**: Higher values improve index quality but slow down build
+  - Low (32-64): Fast build, may sacrifice recall
+  - Medium (64-128): Balanced
+  - High (128-500): Best quality, slow build
+
+### Query-Time Parameters
+
+| Parameter | Type | Default | Description |
+|-----------|------|---------|-------------|
+| `ruvector.ef_search` | integer | 40 | Size of dynamic candidate list during search |
+
+**Setting ef_search:**
+
+```sql
+-- Global setting (postgresql.conf or ALTER SYSTEM)
+ALTER SYSTEM SET ruvector.ef_search = 100;
+
+-- Session setting (per-connection)
+SET ruvector.ef_search = 100;
+
+-- Query with increased recall
+SET LOCAL ruvector.ef_search = 200;
+SELECT ... ORDER BY embedding <-> query LIMIT 10;
+```
+
+## Distance Metrics
+
+### L2 (Euclidean) Distance
+
+- **Operator**: `<->`
+- **Formula**: `√(Σ(a[i] - b[i])²)`
+- **Use case**: General-purpose distance
+- **Range**: [0, ∞)
+
+```sql
+CREATE INDEX ON items USING hnsw (embedding hnsw_l2_ops);
+SELECT * FROM items ORDER BY embedding <-> query_vector LIMIT 10;
+```
+
+### Cosine Distance
+
+- **Operator**: `<=>`
+- **Formula**: `1 - (a·b)/(||a||·||b||)`
+- **Use case**: Direction similarity (text embeddings)
+- **Range**: [0, 2]
+
+```sql
+CREATE INDEX ON items USING hnsw (embedding hnsw_cosine_ops);
+SELECT * FROM items ORDER BY embedding <=> query_vector LIMIT 10;
+```
+
+### Inner Product
+
+- **Operator**: `<#>`
+- **Formula**: `-Σ(a[i] * b[i])`
+- **Use case**: Maximum similarity (normalized vectors)
+- **Range**: (-∞, ∞)
+
+```sql
+CREATE INDEX ON items USING hnsw (embedding hnsw_ip_ops);
+SELECT * FROM items ORDER BY embedding <#> query_vector LIMIT 10;
+```
+
+## Performance
+
+### Build Performance
+
+- **Time Complexity**: O(N log N) with high probability
+- **Space Complexity**: O(N * M * L) where L is average layer count
+- **Typical Build Rate**: 1000-10000 vectors/sec (depends on dimensions)
+
+### Query Performance
+
+- **Time Complexity**: O(ef_search * log N)
+- **Typical Query Time**:
+  - <1ms for 100K vectors (128D)
+  - <5ms for 1M vectors (128D)
+  - <10ms for 10M vectors (128D)
+
+### Memory Usage
+
+```
+Memory per vector ≈ dimensions * 4 bytes + m * 8 bytes * average_layers
+Average layers ≈ log₂(N) / log₂(m)
+
+Example (1M vectors, 128D, m=16):
+- Vector data: 1M * 128 * 4 = 512 MB
+- Graph edges: 1M * 16 * 8 * 4 = 512 MB
+- Total: ~1 GB
+```
+
+## Operator Classes
+
+### hnsw_l2_ops
+
+For L2 (Euclidean) distance on `real[]` vectors.
+
+```sql
+CREATE OPERATOR CLASS hnsw_l2_ops
+    FOR TYPE real[] USING hnsw
+    FAMILY hnsw_l2_ops AS
+    OPERATOR 1 <-> (real[], real[]) FOR ORDER BY float_ops,
+    FUNCTION 1 l2_distance_arr(real[], real[]);
+```
+
+### hnsw_cosine_ops
+
+For cosine distance on `real[]` vectors.
+
+```sql
+CREATE OPERATOR CLASS hnsw_cosine_ops
+    FOR TYPE real[] USING hnsw
+    FAMILY hnsw_cosine_ops AS
+    OPERATOR 1 <=> (real[], real[]) FOR ORDER BY float_ops,
+    FUNCTION 1 cosine_distance_arr(real[], real[]);
+```
+
+### hnsw_ip_ops
+
+For inner product on `real[]` vectors.
+
+```sql
+CREATE OPERATOR CLASS hnsw_ip_ops
+    FOR TYPE real[] USING hnsw
+    FAMILY hnsw_ip_ops AS
+    OPERATOR 1 <#> (real[], real[]) FOR ORDER BY float_ops,
+    FUNCTION 1 neg_inner_product_arr(real[], real[]);
+```
+
+## Monitoring and Maintenance
+
+### Index Statistics
+
+```sql
+-- View memory usage
+SELECT ruvector_memory_stats();
+
+-- Check index size
+SELECT pg_size_pretty(pg_relation_size('items_embedding_idx'));
+
+-- View index definition
+SELECT indexdef FROM pg_indexes WHERE indexname = 'items_embedding_idx';
+```
+
+### Index Maintenance
+
+```sql
+-- Perform maintenance (optimize connections, rebuild degraded nodes)
+SELECT ruvector_index_maintenance('items_embedding_idx');
+
+-- Vacuum to reclaim space after deletes
+VACUUM items;
+
+-- Rebuild index if heavily modified
+REINDEX INDEX items_embedding_idx;
+```
+
+### Query Plan Analysis
+
+```sql
+-- Analyze query execution
+EXPLAIN (ANALYZE, BUFFERS)
+SELECT id, embedding <-> query AS distance
+FROM items
+ORDER BY embedding <-> query
+LIMIT 10;
+```
+
+## Best Practices
+
+### 1. Index Creation
+
+- Build indexes on stable data when possible
+- Use higher `ef_construction` for better quality
+- Consider using `maintenance_work_mem` for large builds:
+  ```sql
+  SET maintenance_work_mem = '2GB';
+  CREATE INDEX ...;
+  ```
+
+### 2. Query Optimization
+
+- Adjust `ef_search` based on recall requirements
+- Use prepared statements for repeated queries
+- Consider query result caching for common queries
+
+### 3. Data Management
+
+- Normalize vectors for cosine similarity
+- Batch inserts when possible
+- Schedule index maintenance during low-traffic periods
+
+### 4. Monitoring
+
+- Track index size growth
+- Monitor query performance metrics
+- Set up alerts for memory usage
+
+## Limitations
+
+### Current Version
+
+- **Single column only**: Multi-column indexes not supported
+- **No parallel scans**: Query parallelism not yet implemented
+- **No index-only scans**: Must access heap tuples
+- **Array type only**: Custom vector type support coming soon
+
+### PostgreSQL Version Requirements
+
+- PostgreSQL 14+
+- pgrx 0.12+
+
+## Troubleshooting
+
+### Index Build Fails
+
+**Problem**: Out of memory during index build
+**Solution**: Increase `maintenance_work_mem` or reduce `ef_construction`
+
+```sql
+SET maintenance_work_mem = '4GB';
+```
+
+### Slow Queries
+
+**Problem**: Queries are slower than expected
+**Solution**: Increase `ef_search` or rebuild index with higher `m`
+
+```sql
+SET ruvector.ef_search = 100;
+```
+
+### Low Recall
+
+**Problem**: Not finding correct nearest neighbors
+**Solution**: Increase `ef_search` or rebuild with higher `ef_construction`
+
+```sql
+REINDEX INDEX items_embedding_idx;
+```
+
+## Comparison with Other Methods
+
+| Feature | HNSW | IVFFlat | Brute Force |
+|---------|------|---------|-------------|
+| Search Time | O(log N) | O(√N) | O(N) |
+| Build Time | O(N log N) | O(N) | O(1) |
+| Memory | High | Medium | Low |
+| Recall | >95% | >90% | 100% |
+| Updates | Good | Poor | Excellent |
+
+## Future Enhancements
+
+- [ ] Parallel index scans
+- [ ] Custom vector type support
+- [ ] Index-only scans
+- [ ] Dynamic parameter tuning
+- [ ] Graph compression
+- [ ] Multi-column indexes
+- [ ] Distributed HNSW
+
+## References
+
+1. Malkov, Y. A., & Yashunin, D. A. (2018). "Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs." IEEE transactions on pattern analysis and machine intelligence.
+
+2. PostgreSQL Index Access Method documentation: https://www.postgresql.org/docs/current/indexam.html
+
+3. pgrx documentation: https://github.com/pgcentralfoundation/pgrx
+
+## License
+
+MIT License - See LICENSE file for details.