Squashed 'vendor/ruvector/' content from commit b64c2172

git-subtree-dir: vendor/ruvector
git-subtree-split: b64c21726f2bb37286d9ee36a7869fef60cc6900
This commit is contained in:
ruv
2026-02-28 14:39:40 -05:00
commit d803bfe2b1
7854 changed files with 3522914 additions and 0 deletions

View File

@@ -0,0 +1,239 @@
# bioRxiv and medRxiv Preprint API Clients
This module provides async clients for fetching preprints from **bioRxiv.org** (life sciences) and **medRxiv.org** (medical sciences), converting them to `SemanticVector` format for RuVector discovery.
## Features
- **Free API access** - No authentication required
- **Rate limiting** - Automatic 1 req/sec rate limiting (conservative)
- **Pagination support** - Handles large result sets automatically
- **Retry logic** - Built-in retry for transient failures
- **Domain separation** - bioRxiv → `Domain::Research`, medRxiv → `Domain::Medical`
- **Rich metadata** - DOI, authors, categories, publication status
## API Details
- **Base URL**: `https://api.biorxiv.org/details/[server]/[interval]/[cursor]`
- **Servers**: `biorxiv` or `medrxiv`
- **Interval**: Date range like `2024-01-01/2024-12-31`
- **Response**: JSON with collection array
## BiorxivClient (Life Sciences)
### Methods
```rust
use ruvector_data_framework::BiorxivClient;
let client = BiorxivClient::new();
// Get recent preprints (last N days)
let recent = client.search_recent(7, 100).await?;
// Search by date range
let start = NaiveDate::from_ymd_opt(2024, 1, 1).unwrap();
let end = NaiveDate::from_ymd_opt(2024, 12, 31).unwrap();
let papers = client.search_by_date_range(start, end, Some(200)).await?;
// Search by category
let neuro = client.search_by_category("neuroscience", 100).await?;
```
### Categories
- `neuroscience` - Neural systems and computation
- `genomics` - Genome sequencing and analysis
- `bioinformatics` - Computational biology
- `cancer-biology` - Oncology research
- `immunology` - Immune system studies
- `microbiology` - Microorganisms
- `molecular-biology` - Molecular mechanisms
- `cell-biology` - Cellular processes
- `biochemistry` - Chemical processes
- `evolutionary-biology` - Evolution and phylogenetics
- `ecology` - Ecosystems and populations
- `genetics` - Heredity and variation
- `developmental-biology` - Organism development
- `synthetic-biology` - Engineered biological systems
- `systems-biology` - System-level understanding
## MedrxivClient (Medical Sciences)
### Methods
```rust
use ruvector_data_framework::MedrxivClient;
let client = MedrxivClient::new();
// Get recent medical preprints
let recent = client.search_recent(7, 100).await?;
// Search by date range
let papers = client.search_by_date_range(start, end, Some(200)).await?;
// Search COVID-19 related papers
let covid = client.search_covid(100).await?;
// Search clinical research
let clinical = client.search_clinical(50).await?;
```
### Specialized Searches
- **COVID-19**: Filters for "covid", "sars-cov-2", "coronavirus", "pandemic" keywords
- **Clinical Research**: Filters for "clinical", "trial", "patient", "treatment", "therapy", "diagnosis"
## SemanticVector Output
Both clients convert preprints to `SemanticVector` with:
```rust
SemanticVector {
id: "doi:10.1101/2024.01.01.123456",
embedding: Vec<f32>, // Generated from title + abstract
domain: Domain::Research, // or Domain::Medical for medRxiv
timestamp: DateTime<Utc>, // Preprint publication date
metadata: {
"doi": "10.1101/2024.01.01.123456",
"title": "Paper title",
"abstract": "Full abstract text",
"authors": "John Doe; Jane Smith",
"category": "Neuroscience",
"server": "biorxiv",
"published_status": "preprint" or journal name,
"corresponding_author": "John Doe",
"institution": "MIT",
"version": "1",
"type": "new results",
"source": "biorxiv" or "medrxiv"
}
}
```
## Example Usage
See `examples/biorxiv_discovery.rs` for a complete example:
```bash
cargo run --example biorxiv_discovery
```
## Rate Limiting
- **Default**: 1 request per second (conservative)
- **Configurable**: Modify `BIORXIV_RATE_LIMIT_MS` constant if needed
- **Retry logic**: 3 retries with exponential backoff
## Pagination
Both clients handle pagination automatically:
- Fetches up to the specified `limit`
- Uses cursor-based pagination
- Safety limit of 10,000 records per query
- Handles empty result sets gracefully
## Integration with RuVector
Use the generated `SemanticVector`s with:
1. **Vector similarity search**: Find related preprints using HNSW index
2. **Graph coherence analysis**: Detect emerging research trends
3. **Cross-domain discovery**: Find connections between life sciences and medical research
4. **Time-series analysis**: Track research evolution over time
## Error Handling
The clients include comprehensive error handling:
- **Network errors**: Automatic retry with exponential backoff
- **Rate limiting**: Built-in delays between requests
- **Parsing errors**: Graceful handling of malformed responses
- **Empty results**: Returns empty vector instead of error
## Testing
Run the unit tests:
```bash
# Run all tests (excluding integration tests)
cargo test --lib biorxiv_client::tests
# Run integration tests (requires network access)
cargo test --lib biorxiv_client::tests -- --ignored
```
Unit tests cover:
- Client creation
- Embedding dimension configuration
- Record to vector conversion
- Date parsing
- Domain assignment
- Metadata extraction
Integration tests (ignored by default):
- Search recent papers
- Search by category
- COVID-19 search
- Clinical research search
## Dependencies
- `reqwest` - Async HTTP client
- `serde` / `serde_json` - JSON parsing
- `chrono` - Date/time handling
- `tokio` - Async runtime
- `urlencoding` - URL encoding for queries
- `SimpleEmbedder` - Text to vector embedding
## Custom Embedding Dimension
```rust
// Default 384 dimensions
let client = BiorxivClient::new();
// Custom dimension
let client = BiorxivClient::with_embedding_dim(512);
```
## Best Practices
1. **Respect rate limits**: The clients enforce conservative rate limiting
2. **Use date ranges**: For large datasets, query by date ranges
3. **Filter locally**: Use category filters for more specific searches
4. **Handle errors**: Network requests can fail, use proper error handling
5. **Cache results**: Consider caching SemanticVectors for repeated use
6. **Batch processing**: Process results in batches for better performance
## Publication Status
The `published_status` metadata field indicates:
- `"preprint"` - Not yet published in journal
- Journal name - Accepted and published (e.g., "Nature Medicine")
This helps distinguish between preliminary and peer-reviewed research.
## Cross-Domain Analysis
Combine bioRxiv and medRxiv for comprehensive analysis:
```rust
let biorxiv = BiorxivClient::new();
let medrxiv = MedrxivClient::new();
let bio_papers = biorxiv.search_recent(7, 100).await?;
let med_papers = medrxiv.search_recent(7, 100).await?;
let mut all_papers = bio_papers;
all_papers.extend(med_papers);
// Use RuVector's discovery engine to find cross-domain patterns
```
## Resources
- **bioRxiv**: https://www.biorxiv.org/
- **medRxiv**: https://www.medrxiv.org/
- **API Docs**: https://api.biorxiv.org/
- **RuVector**: https://github.com/ruvnet/ruvector