Squashed 'vendor/ruvector/' content from commit b64c2172
git-subtree-dir: vendor/ruvector git-subtree-split: b64c21726f2bb37286d9ee36a7869fef60cc6900
This commit is contained in:
443
npm/packages/ruvector-extensions/docs/EMBEDDINGS.md
Normal file
443
npm/packages/ruvector-extensions/docs/EMBEDDINGS.md
Normal file
@@ -0,0 +1,443 @@
|
||||
# Embeddings Integration Module
|
||||
|
||||
Comprehensive embeddings integration for ruvector-extensions, supporting multiple providers with a unified interface.
|
||||
|
||||
## Features
|
||||
|
||||
✨ **Multi-Provider Support**
|
||||
- OpenAI (text-embedding-3-small, text-embedding-3-large, ada-002)
|
||||
- Cohere (embed-english-v3.0, embed-multilingual-v3.0)
|
||||
- Anthropic/Voyage (voyage-2)
|
||||
- HuggingFace (local models via transformers.js)
|
||||
|
||||
⚡ **Automatic Batch Processing**
|
||||
- Intelligent batching based on provider limits
|
||||
- Automatic retry logic with exponential backoff
|
||||
- Progress tracking for large datasets
|
||||
|
||||
🔒 **Type-Safe & Production-Ready**
|
||||
- Full TypeScript support
|
||||
- Comprehensive error handling
|
||||
- JSDoc documentation
|
||||
- Configurable retry strategies
|
||||
|
||||
## Installation
|
||||
|
||||
```bash
|
||||
npm install ruvector-extensions
|
||||
|
||||
# Install provider SDKs (optional - based on what you use)
|
||||
npm install openai # For OpenAI
|
||||
npm install cohere-ai # For Cohere
|
||||
npm install @anthropic-ai/sdk # For Anthropic
|
||||
npm install @xenova/transformers # For local HuggingFace models
|
||||
```
|
||||
|
||||
## Quick Start
|
||||
|
||||
### OpenAI Embeddings
|
||||
|
||||
```typescript
|
||||
import { OpenAIEmbeddings } from 'ruvector-extensions';
|
||||
|
||||
const openai = new OpenAIEmbeddings({
|
||||
apiKey: process.env.OPENAI_API_KEY,
|
||||
model: 'text-embedding-3-small', // 1536 dimensions
|
||||
});
|
||||
|
||||
// Embed single text
|
||||
const embedding = await openai.embedText('Hello, world!');
|
||||
|
||||
// Embed multiple texts (automatic batching)
|
||||
const result = await openai.embedTexts([
|
||||
'Machine learning is fascinating',
|
||||
'Deep learning uses neural networks',
|
||||
'Natural language processing is important',
|
||||
]);
|
||||
|
||||
console.log('Embeddings:', result.embeddings.length);
|
||||
console.log('Tokens used:', result.totalTokens);
|
||||
```
|
||||
|
||||
### Custom Dimensions (OpenAI)
|
||||
|
||||
```typescript
|
||||
const openai = new OpenAIEmbeddings({
|
||||
apiKey: process.env.OPENAI_API_KEY,
|
||||
model: 'text-embedding-3-large',
|
||||
dimensions: 1024, // Reduce from 3072 to 1024
|
||||
});
|
||||
|
||||
const embedding = await openai.embedText('Custom dimension embedding');
|
||||
console.log('Dimension:', embedding.length); // 1024
|
||||
```
|
||||
|
||||
### Cohere Embeddings
|
||||
|
||||
```typescript
|
||||
import { CohereEmbeddings } from 'ruvector-extensions';
|
||||
|
||||
// For document storage
|
||||
const documentEmbedder = new CohereEmbeddings({
|
||||
apiKey: process.env.COHERE_API_KEY,
|
||||
model: 'embed-english-v3.0',
|
||||
inputType: 'search_document',
|
||||
});
|
||||
|
||||
// For search queries
|
||||
const queryEmbedder = new CohereEmbeddings({
|
||||
apiKey: process.env.COHERE_API_KEY,
|
||||
model: 'embed-english-v3.0',
|
||||
inputType: 'search_query',
|
||||
});
|
||||
|
||||
const docs = await documentEmbedder.embedTexts([
|
||||
'The Eiffel Tower is in Paris',
|
||||
'The Statue of Liberty is in New York',
|
||||
]);
|
||||
|
||||
const query = await queryEmbedder.embedText('famous landmarks in France');
|
||||
```
|
||||
|
||||
### Anthropic/Voyage Embeddings
|
||||
|
||||
```typescript
|
||||
import { AnthropicEmbeddings } from 'ruvector-extensions';
|
||||
|
||||
const anthropic = new AnthropicEmbeddings({
|
||||
apiKey: process.env.VOYAGE_API_KEY,
|
||||
model: 'voyage-2',
|
||||
inputType: 'document',
|
||||
});
|
||||
|
||||
const result = await anthropic.embedTexts([
|
||||
'Anthropic develops Claude AI',
|
||||
'Voyage AI provides embedding models',
|
||||
]);
|
||||
```
|
||||
|
||||
### Local HuggingFace Embeddings
|
||||
|
||||
```typescript
|
||||
import { HuggingFaceEmbeddings } from 'ruvector-extensions';
|
||||
|
||||
// No API key needed - runs locally!
|
||||
const hf = new HuggingFaceEmbeddings({
|
||||
model: 'Xenova/all-MiniLM-L6-v2',
|
||||
normalize: true,
|
||||
batchSize: 32,
|
||||
});
|
||||
|
||||
const result = await hf.embedTexts([
|
||||
'Local embeddings are fast',
|
||||
'No API calls required',
|
||||
'Privacy-friendly solution',
|
||||
]);
|
||||
```
|
||||
|
||||
## VectorDB Integration
|
||||
|
||||
### Insert Documents
|
||||
|
||||
```typescript
|
||||
import { VectorDB } from 'ruvector';
|
||||
import { OpenAIEmbeddings, embedAndInsert } from 'ruvector-extensions';
|
||||
|
||||
const openai = new OpenAIEmbeddings({
|
||||
apiKey: process.env.OPENAI_API_KEY,
|
||||
});
|
||||
|
||||
const db = new VectorDB({ dimension: openai.getDimension() });
|
||||
|
||||
const documents = [
|
||||
{
|
||||
id: 'doc1',
|
||||
text: 'Machine learning enables computers to learn from data',
|
||||
metadata: { category: 'AI', author: 'John Doe' },
|
||||
},
|
||||
{
|
||||
id: 'doc2',
|
||||
text: 'Deep learning uses neural networks',
|
||||
metadata: { category: 'AI', author: 'Jane Smith' },
|
||||
},
|
||||
];
|
||||
|
||||
const ids = await embedAndInsert(db, openai, documents, {
|
||||
overwrite: true,
|
||||
onProgress: (current, total) => {
|
||||
console.log(`Progress: ${current}/${total}`);
|
||||
},
|
||||
});
|
||||
|
||||
console.log('Inserted IDs:', ids);
|
||||
```
|
||||
|
||||
### Search Documents
|
||||
|
||||
```typescript
|
||||
import { embedAndSearch } from 'ruvector-extensions';
|
||||
|
||||
const results = await embedAndSearch(
|
||||
db,
|
||||
openai,
|
||||
'What is deep learning?',
|
||||
{
|
||||
topK: 5,
|
||||
threshold: 0.7,
|
||||
filter: { category: 'AI' },
|
||||
}
|
||||
);
|
||||
|
||||
console.log('Search results:', results);
|
||||
```
|
||||
|
||||
## Advanced Features
|
||||
|
||||
### Custom Retry Configuration
|
||||
|
||||
```typescript
|
||||
const openai = new OpenAIEmbeddings({
|
||||
apiKey: process.env.OPENAI_API_KEY,
|
||||
retryConfig: {
|
||||
maxRetries: 5,
|
||||
initialDelay: 2000, // 2 seconds
|
||||
maxDelay: 30000, // 30 seconds
|
||||
backoffMultiplier: 2, // Exponential backoff
|
||||
},
|
||||
});
|
||||
```
|
||||
|
||||
### Batch Processing Large Datasets
|
||||
|
||||
```typescript
|
||||
// Automatically handles batching based on provider limits
|
||||
const largeDataset = Array.from({ length: 10000 }, (_, i) =>
|
||||
`Document ${i}: Sample text for embedding`
|
||||
);
|
||||
|
||||
const result = await openai.embedTexts(largeDataset);
|
||||
console.log(`Processed ${result.embeddings.length} documents`);
|
||||
console.log(`Total tokens: ${result.totalTokens}`);
|
||||
```
|
||||
|
||||
### Error Handling
|
||||
|
||||
```typescript
|
||||
try {
|
||||
const result = await openai.embedTexts(['Test text']);
|
||||
console.log('Success!');
|
||||
} catch (error) {
|
||||
if (error.retryable) {
|
||||
console.log('Temporary error - can retry');
|
||||
} else {
|
||||
console.log('Permanent error - fix required');
|
||||
}
|
||||
console.error('Error:', error.message);
|
||||
}
|
||||
```
|
||||
|
||||
### Progress Tracking
|
||||
|
||||
```typescript
|
||||
const progressBar = (current: number, total: number) => {
|
||||
const percentage = Math.round((current / total) * 100);
|
||||
console.log(`[${percentage}%] ${current}/${total}`);
|
||||
};
|
||||
|
||||
await embedAndInsert(db, openai, documents, {
|
||||
onProgress: progressBar,
|
||||
});
|
||||
```
|
||||
|
||||
## Provider Comparison
|
||||
|
||||
| Provider | Dimension | Max Batch | API Required | Local |
|
||||
|----------|-----------|-----------|--------------|-------|
|
||||
| OpenAI text-embedding-3-small | 1536 | 2048 | ✅ | ❌ |
|
||||
| OpenAI text-embedding-3-large | 3072 (configurable) | 2048 | ✅ | ❌ |
|
||||
| Cohere embed-v3.0 | 1024 | 96 | ✅ | ❌ |
|
||||
| Anthropic/Voyage | 1024 | 128 | ✅ | ❌ |
|
||||
| HuggingFace (local) | 384 (model-dependent) | Configurable | ❌ | ✅ |
|
||||
|
||||
## API Reference
|
||||
|
||||
### `EmbeddingProvider` (Abstract Base Class)
|
||||
|
||||
```typescript
|
||||
abstract class EmbeddingProvider {
|
||||
// Get maximum batch size
|
||||
abstract getMaxBatchSize(): number;
|
||||
|
||||
// Get embedding dimension
|
||||
abstract getDimension(): number;
|
||||
|
||||
// Embed single text
|
||||
async embedText(text: string): Promise<number[]>;
|
||||
|
||||
// Embed multiple texts
|
||||
abstract embedTexts(texts: string[]): Promise<BatchEmbeddingResult>;
|
||||
}
|
||||
```
|
||||
|
||||
### `OpenAIEmbeddingsConfig`
|
||||
|
||||
```typescript
|
||||
interface OpenAIEmbeddingsConfig {
|
||||
apiKey: string;
|
||||
model?: string; // Default: 'text-embedding-3-small'
|
||||
dimensions?: number; // Only for text-embedding-3-* models
|
||||
organization?: string;
|
||||
baseURL?: string;
|
||||
retryConfig?: Partial<RetryConfig>;
|
||||
}
|
||||
```
|
||||
|
||||
### `CohereEmbeddingsConfig`
|
||||
|
||||
```typescript
|
||||
interface CohereEmbeddingsConfig {
|
||||
apiKey: string;
|
||||
model?: string; // Default: 'embed-english-v3.0'
|
||||
inputType?: 'search_document' | 'search_query' | 'classification' | 'clustering';
|
||||
truncate?: 'NONE' | 'START' | 'END';
|
||||
retryConfig?: Partial<RetryConfig>;
|
||||
}
|
||||
```
|
||||
|
||||
### `AnthropicEmbeddingsConfig`
|
||||
|
||||
```typescript
|
||||
interface AnthropicEmbeddingsConfig {
|
||||
apiKey: string; // Voyage API key
|
||||
model?: string; // Default: 'voyage-2'
|
||||
inputType?: 'document' | 'query';
|
||||
retryConfig?: Partial<RetryConfig>;
|
||||
}
|
||||
```
|
||||
|
||||
### `HuggingFaceEmbeddingsConfig`
|
||||
|
||||
```typescript
|
||||
interface HuggingFaceEmbeddingsConfig {
|
||||
model?: string; // Default: 'Xenova/all-MiniLM-L6-v2'
|
||||
device?: 'cpu' | 'cuda';
|
||||
normalize?: boolean; // Default: true
|
||||
batchSize?: number; // Default: 32
|
||||
retryConfig?: Partial<RetryConfig>;
|
||||
}
|
||||
```
|
||||
|
||||
### `embedAndInsert`
|
||||
|
||||
```typescript
|
||||
async function embedAndInsert(
|
||||
db: VectorDB,
|
||||
provider: EmbeddingProvider,
|
||||
documents: DocumentToEmbed[],
|
||||
options?: {
|
||||
overwrite?: boolean;
|
||||
onProgress?: (current: number, total: number) => void;
|
||||
}
|
||||
): Promise<string[]>;
|
||||
```
|
||||
|
||||
### `embedAndSearch`
|
||||
|
||||
```typescript
|
||||
async function embedAndSearch(
|
||||
db: VectorDB,
|
||||
provider: EmbeddingProvider,
|
||||
query: string,
|
||||
options?: {
|
||||
topK?: number;
|
||||
threshold?: number;
|
||||
filter?: Record<string, unknown>;
|
||||
}
|
||||
): Promise<any[]>;
|
||||
```
|
||||
|
||||
## Best Practices
|
||||
|
||||
1. **Choose the Right Provider**
|
||||
- OpenAI: Best general-purpose, flexible dimensions
|
||||
- Cohere: Optimized for search, separate document/query embeddings
|
||||
- Anthropic/Voyage: High quality, good for semantic search
|
||||
- HuggingFace: Privacy-focused, no API costs, offline support
|
||||
|
||||
2. **Batch Processing**
|
||||
- Let the library handle batching automatically
|
||||
- Use progress callbacks for large datasets
|
||||
- Consider memory usage for very large datasets
|
||||
|
||||
3. **Error Handling**
|
||||
- Configure retry logic for production environments
|
||||
- Handle rate limits gracefully
|
||||
- Log errors with context for debugging
|
||||
|
||||
4. **Performance**
|
||||
- Use custom dimensions (OpenAI) to reduce storage
|
||||
- Cache embeddings when possible
|
||||
- Consider local models for high-volume use cases
|
||||
|
||||
5. **Security**
|
||||
- Store API keys in environment variables
|
||||
- Never commit API keys to version control
|
||||
- Use key rotation for production systems
|
||||
|
||||
## Examples
|
||||
|
||||
See [src/examples/embeddings-example.ts](../src/examples/embeddings-example.ts) for comprehensive examples including:
|
||||
|
||||
- Basic usage for all providers
|
||||
- Batch processing
|
||||
- Error handling
|
||||
- VectorDB integration
|
||||
- Progress tracking
|
||||
- Provider comparison
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### "Module not found" errors
|
||||
|
||||
Make sure you've installed the required provider SDK:
|
||||
|
||||
```bash
|
||||
npm install openai # For OpenAI
|
||||
npm install cohere-ai # For Cohere
|
||||
npm install @xenova/transformers # For HuggingFace
|
||||
```
|
||||
|
||||
### Rate limit errors
|
||||
|
||||
Configure retry logic with longer delays:
|
||||
|
||||
```typescript
|
||||
const provider = new OpenAIEmbeddings({
|
||||
apiKey: '...',
|
||||
retryConfig: {
|
||||
maxRetries: 5,
|
||||
initialDelay: 5000,
|
||||
maxDelay: 60000,
|
||||
},
|
||||
});
|
||||
```
|
||||
|
||||
### Dimension mismatches
|
||||
|
||||
Ensure VectorDB dimension matches provider dimension:
|
||||
|
||||
```typescript
|
||||
const db = new VectorDB({
|
||||
dimension: provider.getDimension()
|
||||
});
|
||||
```
|
||||
|
||||
## License
|
||||
|
||||
MIT © ruv.io Team
|
||||
|
||||
## Support
|
||||
|
||||
- GitHub Issues: https://github.com/ruvnet/ruvector/issues
|
||||
- Documentation: https://github.com/ruvnet/ruvector
|
||||
- Email: info@ruv.io
|
||||
Reference in New Issue
Block a user