Squashed 'vendor/ruvector/' content from commit b64c2172

git-subtree-dir: vendor/ruvector
git-subtree-split: b64c21726f2bb37286d9ee36a7869fef60cc6900
This commit is contained in:
ruv
2026-02-28 14:39:40 -05:00
commit d803bfe2b1
7854 changed files with 3522914 additions and 0 deletions

View File

@@ -0,0 +1,219 @@
# AgenticDB Embedding Limitation - Fix Summary
## What Was Changed
The AgenticDB module has been updated to make it **crystal clear** that it uses placeholder hash-based embeddings, NOT real semantic embeddings.
## Changes Made
### 1. Updated `/workspaces/ruvector/crates/ruvector-core/src/agenticdb.rs`
#### Module-Level Documentation
Added prominent warning at the top of the file:
```rust
//! # ⚠️ CRITICAL WARNING: PLACEHOLDER EMBEDDINGS
//!
//! **THIS MODULE USES HASH-BASED PLACEHOLDER EMBEDDINGS - NOT REAL SEMANTIC EMBEDDINGS**
//!
//! The `generate_text_embedding()` function creates embeddings using a simple hash function
//! that does NOT understand semantic meaning. Similarity is based on character overlap, NOT meaning.
//!
//! **For Production Use:**
//! - Integrate a real embedding model (sentence-transformers, OpenAI, Anthropic, Cohere)
//! - Use ONNX Runtime, candle, or Python bindings for inference
//! - See `/examples/onnx-embeddings` for a production-ready integration example
```
#### Function Documentation
Expanded `generate_text_embedding()` documentation with:
- Clear examples of what won't work
- Four integration options (ONNX, Candle, API, Python)
- Code examples for each option
- Explicit warning that this is NOT semantic search
### 2. Updated `/workspaces/ruvector/crates/ruvector-core/src/lib.rs`
#### Module Documentation
Updated experimental features section to warn users:
```rust
## Experimental/Incomplete Features - READ BEFORE USE
//!
//! - **AgenticDB**: ⚠️⚠️⚠️ **CRITICAL WARNING** ⚠️⚠️⚠️
//! - Uses PLACEHOLDER hash-based embeddings, NOT real semantic embeddings
//! - "dog" and "cat" will NOT be similar (different characters)
//! - "dog" and "god" WILL be similar (same characters) - **This is wrong!**
//! - **MUST integrate real embedding model for production** (ONNX, Candle, or API)
```
#### Compile-Time Warning
Added a deprecation notice that appears during compilation:
```rust
#[cfg(feature = "storage")]
const _: () = {
#[deprecated(
since = "0.1.0",
note = "AgenticDB uses placeholder hash-based embeddings. For semantic search, integrate a real embedding model (ONNX, Candle, or API). See /examples/onnx-embeddings for production setup."
)]
const AGENTICDB_EMBEDDING_WARNING: () = ();
let _ = AGENTICDB_EMBEDDING_WARNING;
};
```
### 3. Updated `/workspaces/ruvector/docs/guides/AGENTICDB_API.md`
Added prominent warning at the top of the documentation:
```markdown
## ⚠️ CRITICAL LIMITATION: Placeholder Embeddings
**THIS MODULE USES HASH-BASED PLACEHOLDER EMBEDDINGS - NOT REAL SEMANTIC EMBEDDINGS**
### What This Means
- ❌ "dog" and "cat" will NOT be similar (different characters)
- ❌ "happy" and "joyful" will NOT be similar (different characters)
- ❌ "car" and "automobile" will NOT be similar (different characters)
- ✅ "dog" and "god" WILL be similar (same characters) - **This is wrong!**
```
Added warnings to all semantic search functions:
- `retrieve_similar_episodes()`
- `search_skills()`
- Updated future enhancements section
### 4. Created `/workspaces/ruvector/docs/guides/AGENTICDB_EMBEDDINGS_WARNING.md`
Comprehensive guide covering:
- What the limitation means
- Why it exists
- Four integration options with complete code examples:
1. **ONNX Runtime** (Recommended) - Full implementation example
2. **Candle** (Pure Rust) - Native inference example
3. **API-based** (OpenAI, Cohere) - Cloud API examples
4. **Python Bindings** - sentence-transformers integration
- Step-by-step integration instructions
- Performance comparison table
- Verification tests
## What Users See Now
### During Compilation
Users will see a deprecation warning (when using storage feature):
```
warning: use of deprecated constant `_::AGENTICDB_EMBEDDING_WARNING`:
AgenticDB uses placeholder hash-based embeddings. For semantic search,
integrate a real embedding model (ONNX, Candle, or API).
See /examples/onnx-embeddings for production setup.
```
### In Documentation
- Module-level warnings in rustdoc
- Function-level warnings on semantic search functions
- Clear examples of what won't work
- Complete integration guide
### In Code Comments
Every semantic search function now has explicit warnings about the limitation.
## Why This Approach
1. **Honesty First**: Users must understand this is a placeholder before using it
2. **Actionable Guidance**: Four clear paths to integrate real embeddings
3. **Gradual Warnings**:
- Compile-time: Subtle deprecation notice
- Documentation: Prominent warnings
- Runtime: Clear in function docs
4. **Preserve Functionality**: The placeholder still works for testing and API validation
## What Users Need to Do
### For Testing/Development
- Current implementation is fine for:
- API structure testing
- Performance benchmarking (vector operations)
- Development without external dependencies
### For Production
Users MUST choose one of four integration options:
1. **ONNX Runtime** (Recommended ⭐)
- Best balance of performance and compatibility
- ~5-20ms latency per embedding
- Models: all-MiniLM-L6-v2, all-mpnet-base-v2
- See `/examples/onnx-embeddings`
2. **Candle** (Pure Rust)
- No external runtime needed
- ~10-30ms latency
- Full control over model
3. **API-based** (OpenAI, Cohere, Anthropic)
- Fastest to prototype
- ~100-300ms latency (network)
- $0.02-$0.13 per 1M tokens
4. **Python Bindings** (sentence-transformers)
- Leverage existing ML ecosystem
- ~5-20ms latency
- Maximum flexibility
## Verification
Build succeeds with warnings:
```bash
cargo build -p ruvector-core
# Shows deprecation warning for AgenticDB
cargo doc -p ruvector-core --no-deps
# Generates documentation with all warnings visible
```
Tests still pass:
```bash
cargo test -p ruvector-core agenticdb
# All 15+ tests pass (using hash embeddings for testing only)
```
## Files Modified
1. `/workspaces/ruvector/crates/ruvector-core/src/agenticdb.rs` - Module and function warnings
2. `/workspaces/ruvector/crates/ruvector-core/src/lib.rs` - Experimental features warning
3. `/workspaces/ruvector/docs/guides/AGENTICDB_API.md` - Documentation warnings
4. `/workspaces/ruvector/docs/guides/AGENTICDB_EMBEDDINGS_WARNING.md` - New comprehensive guide
## Files Created
1. `/workspaces/ruvector/docs/guides/AGENTICDB_EMBEDDINGS_WARNING.md` - Complete integration guide
2. `/workspaces/ruvector/docs/guides/AGENTICDB_EMBEDDING_FIX_SUMMARY.md` - This summary
## Future Improvements
### Planned (Next Steps)
1. Add feature flag `real-embeddings` that requires integration at compile time
2. Add runtime warning when using placeholder embeddings
3. Create example implementations for all four integration options
4. Add semantic similarity tests that verify real embeddings
### Example: Feature Flag (Future)
```rust
#[cfg(all(feature = "storage", not(feature = "real-embeddings")))]
compile_error!(
"AgenticDB requires 'real-embeddings' feature for production use. \
Current placeholder embeddings do NOT provide semantic search. \
Enable with: cargo build --features real-embeddings"
);
```
## Conclusion
The AgenticDB module is now **honest and transparent** about its limitations:
**Clear warnings** at every level (compile-time, docs, code)
**Actionable guidance** with four integration paths and code examples
**Preserved functionality** for testing and development
**Production-ready paths** clearly documented
Users can no longer accidentally use placeholder embeddings for semantic search without being warned multiple times.