git-subtree-dir: vendor/ruvector git-subtree-split: b64c21726f2bb37286d9ee36a7869fef60cc6900
6.9 KiB
6.9 KiB
bioRxiv and medRxiv Preprint API Clients
This module provides async clients for fetching preprints from bioRxiv.org (life sciences) and medRxiv.org (medical sciences), converting them to SemanticVector format for RuVector discovery.
Features
- Free API access - No authentication required
- Rate limiting - Automatic 1 req/sec rate limiting (conservative)
- Pagination support - Handles large result sets automatically
- Retry logic - Built-in retry for transient failures
- Domain separation - bioRxiv →
Domain::Research, medRxiv →Domain::Medical - Rich metadata - DOI, authors, categories, publication status
API Details
- Base URL:
https://api.biorxiv.org/details/[server]/[interval]/[cursor] - Servers:
biorxivormedrxiv - Interval: Date range like
2024-01-01/2024-12-31 - Response: JSON with collection array
BiorxivClient (Life Sciences)
Methods
use ruvector_data_framework::BiorxivClient;
let client = BiorxivClient::new();
// Get recent preprints (last N days)
let recent = client.search_recent(7, 100).await?;
// Search by date range
let start = NaiveDate::from_ymd_opt(2024, 1, 1).unwrap();
let end = NaiveDate::from_ymd_opt(2024, 12, 31).unwrap();
let papers = client.search_by_date_range(start, end, Some(200)).await?;
// Search by category
let neuro = client.search_by_category("neuroscience", 100).await?;
Categories
neuroscience- Neural systems and computationgenomics- Genome sequencing and analysisbioinformatics- Computational biologycancer-biology- Oncology researchimmunology- Immune system studiesmicrobiology- Microorganismsmolecular-biology- Molecular mechanismscell-biology- Cellular processesbiochemistry- Chemical processesevolutionary-biology- Evolution and phylogeneticsecology- Ecosystems and populationsgenetics- Heredity and variationdevelopmental-biology- Organism developmentsynthetic-biology- Engineered biological systemssystems-biology- System-level understanding
MedrxivClient (Medical Sciences)
Methods
use ruvector_data_framework::MedrxivClient;
let client = MedrxivClient::new();
// Get recent medical preprints
let recent = client.search_recent(7, 100).await?;
// Search by date range
let papers = client.search_by_date_range(start, end, Some(200)).await?;
// Search COVID-19 related papers
let covid = client.search_covid(100).await?;
// Search clinical research
let clinical = client.search_clinical(50).await?;
Specialized Searches
- COVID-19: Filters for "covid", "sars-cov-2", "coronavirus", "pandemic" keywords
- Clinical Research: Filters for "clinical", "trial", "patient", "treatment", "therapy", "diagnosis"
SemanticVector Output
Both clients convert preprints to SemanticVector with:
SemanticVector {
id: "doi:10.1101/2024.01.01.123456",
embedding: Vec<f32>, // Generated from title + abstract
domain: Domain::Research, // or Domain::Medical for medRxiv
timestamp: DateTime<Utc>, // Preprint publication date
metadata: {
"doi": "10.1101/2024.01.01.123456",
"title": "Paper title",
"abstract": "Full abstract text",
"authors": "John Doe; Jane Smith",
"category": "Neuroscience",
"server": "biorxiv",
"published_status": "preprint" or journal name,
"corresponding_author": "John Doe",
"institution": "MIT",
"version": "1",
"type": "new results",
"source": "biorxiv" or "medrxiv"
}
}
Example Usage
See examples/biorxiv_discovery.rs for a complete example:
cargo run --example biorxiv_discovery
Rate Limiting
- Default: 1 request per second (conservative)
- Configurable: Modify
BIORXIV_RATE_LIMIT_MSconstant if needed - Retry logic: 3 retries with exponential backoff
Pagination
Both clients handle pagination automatically:
- Fetches up to the specified
limit - Uses cursor-based pagination
- Safety limit of 10,000 records per query
- Handles empty result sets gracefully
Integration with RuVector
Use the generated SemanticVectors with:
- Vector similarity search: Find related preprints using HNSW index
- Graph coherence analysis: Detect emerging research trends
- Cross-domain discovery: Find connections between life sciences and medical research
- Time-series analysis: Track research evolution over time
Error Handling
The clients include comprehensive error handling:
- Network errors: Automatic retry with exponential backoff
- Rate limiting: Built-in delays between requests
- Parsing errors: Graceful handling of malformed responses
- Empty results: Returns empty vector instead of error
Testing
Run the unit tests:
# Run all tests (excluding integration tests)
cargo test --lib biorxiv_client::tests
# Run integration tests (requires network access)
cargo test --lib biorxiv_client::tests -- --ignored
Unit tests cover:
- Client creation
- Embedding dimension configuration
- Record to vector conversion
- Date parsing
- Domain assignment
- Metadata extraction
Integration tests (ignored by default):
- Search recent papers
- Search by category
- COVID-19 search
- Clinical research search
Dependencies
reqwest- Async HTTP clientserde/serde_json- JSON parsingchrono- Date/time handlingtokio- Async runtimeurlencoding- URL encoding for queriesSimpleEmbedder- Text to vector embedding
Custom Embedding Dimension
// Default 384 dimensions
let client = BiorxivClient::new();
// Custom dimension
let client = BiorxivClient::with_embedding_dim(512);
Best Practices
- Respect rate limits: The clients enforce conservative rate limiting
- Use date ranges: For large datasets, query by date ranges
- Filter locally: Use category filters for more specific searches
- Handle errors: Network requests can fail, use proper error handling
- Cache results: Consider caching SemanticVectors for repeated use
- Batch processing: Process results in batches for better performance
Publication Status
The published_status metadata field indicates:
"preprint"- Not yet published in journal- Journal name - Accepted and published (e.g., "Nature Medicine")
This helps distinguish between preliminary and peer-reviewed research.
Cross-Domain Analysis
Combine bioRxiv and medRxiv for comprehensive analysis:
let biorxiv = BiorxivClient::new();
let medrxiv = MedrxivClient::new();
let bio_papers = biorxiv.search_recent(7, 100).await?;
let med_papers = medrxiv.search_recent(7, 100).await?;
let mut all_papers = bio_papers;
all_papers.extend(med_papers);
// Use RuVector's discovery engine to find cross-domain patterns
Resources
- bioRxiv: https://www.biorxiv.org/
- medRxiv: https://www.medrxiv.org/
- API Docs: https://api.biorxiv.org/
- RuVector: https://github.com/ruvnet/ruvector