git-subtree-dir: vendor/ruvector git-subtree-split: b64c21726f2bb37286d9ee36a7869fef60cc6900
447 lines
11 KiB
Markdown
447 lines
11 KiB
Markdown
# Genomics and DNA Data API Clients
|
|
|
|
Comprehensive genomics data integration for RuVector's discovery framework, enabling cross-domain pattern detection between genomics, climate, medical, and economic data.
|
|
|
|
## Overview
|
|
|
|
The genomics clients module (`genomics_clients.rs`) provides four specialized API clients for accessing the world's largest genomics databases:
|
|
|
|
1. **NcbiClient** - NCBI Entrez APIs (genes, proteins, nucleotides, SNPs)
|
|
2. **UniProtClient** - UniProt protein knowledge base
|
|
3. **EnsemblClient** - Ensembl genomic annotations
|
|
4. **GwasClient** - GWAS Catalog (genome-wide association studies)
|
|
|
|
All data is automatically converted to `SemanticVector` format with `Domain::Genomics` for seamless integration with RuVector's vector database and coherence analysis.
|
|
|
|
## Features
|
|
|
|
- ✅ **Rate limiting** with exponential backoff (NCBI: 3 req/s without key, 10 req/s with key)
|
|
- ✅ **Retry logic** with configurable attempts
|
|
- ✅ **NCBI API key support** for higher rate limits
|
|
- ✅ **Automatic embedding generation** using SimpleEmbedder (384 dimensions)
|
|
- ✅ **Semantic vector conversion** with rich metadata
|
|
- ✅ **Cross-domain discovery** enabled (Genomics ↔ Climate, Medical, Economic)
|
|
- ✅ **Unit tests** for all clients
|
|
|
|
## Installation
|
|
|
|
The genomics clients are included in the `ruvector-data-framework` crate:
|
|
|
|
```toml
|
|
[dependencies]
|
|
ruvector-data-framework = "0.1.0"
|
|
```
|
|
|
|
## Quick Start
|
|
|
|
```rust
|
|
use ruvector_data_framework::{
|
|
NcbiClient, UniProtClient, EnsemblClient, GwasClient,
|
|
NativeDiscoveryEngine, NativeEngineConfig,
|
|
};
|
|
|
|
#[tokio::main]
|
|
async fn main() -> Result<(), Box<dyn std::error::Error>> {
|
|
// Initialize discovery engine
|
|
let mut engine = NativeDiscoveryEngine::new(NativeEngineConfig::default());
|
|
|
|
// 1. Search for genes related to climate adaptation
|
|
let ncbi = NcbiClient::new(None)?;
|
|
let heat_shock_genes = ncbi.search_genes("heat shock protein", Some("human")).await?;
|
|
|
|
for gene in heat_shock_genes {
|
|
engine.add_vector(gene);
|
|
}
|
|
|
|
// 2. Search for disease-associated proteins
|
|
let uniprot = UniProtClient::new()?;
|
|
let apoe_proteins = uniprot.search_proteins("APOE", 10).await?;
|
|
|
|
for protein in apoe_proteins {
|
|
engine.add_vector(protein);
|
|
}
|
|
|
|
// 3. Get genetic variants
|
|
let ensembl = EnsemblClient::new()?;
|
|
if let Some(gene) = ensembl.get_gene_info("ENSG00000157764").await? {
|
|
engine.add_vector(gene);
|
|
let variants = ensembl.get_variants("ENSG00000157764").await?;
|
|
for variant in variants {
|
|
engine.add_vector(variant);
|
|
}
|
|
}
|
|
|
|
// 4. Search GWAS for disease associations
|
|
let gwas = GwasClient::new()?;
|
|
let diabetes_assocs = gwas.search_associations("diabetes").await?;
|
|
|
|
for assoc in diabetes_assocs {
|
|
engine.add_vector(assoc);
|
|
}
|
|
|
|
// Detect cross-domain patterns
|
|
let patterns = engine.detect_patterns();
|
|
println!("Discovered {} patterns", patterns.len());
|
|
|
|
Ok(())
|
|
}
|
|
```
|
|
|
|
## API Clients
|
|
|
|
### 1. NcbiClient - NCBI Entrez APIs
|
|
|
|
Access genes, proteins, nucleotides, and SNPs from NCBI databases.
|
|
|
|
#### Initialization
|
|
|
|
```rust
|
|
// Without API key (3 requests/second)
|
|
let client = NcbiClient::new(None)?;
|
|
|
|
// With API key (10 requests/second) - recommended
|
|
let client = NcbiClient::new(Some("YOUR_API_KEY".to_string()))?;
|
|
```
|
|
|
|
Get your API key at: https://www.ncbi.nlm.nih.gov/account/
|
|
|
|
#### Methods
|
|
|
|
```rust
|
|
// Search gene database
|
|
let genes = client.search_genes("BRCA1", Some("human")).await?;
|
|
|
|
// Get specific gene by ID
|
|
let gene = client.get_gene("672").await?;
|
|
|
|
// Search proteins
|
|
let proteins = client.search_proteins("kinase").await?;
|
|
|
|
// Search nucleotide sequences
|
|
let sequences = client.search_nucleotide("mitochondrial genome").await?;
|
|
|
|
// Get SNP information by rsID
|
|
let snp = client.get_snp("rs429358").await?; // APOE4 variant
|
|
```
|
|
|
|
#### Vector Format
|
|
|
|
```rust
|
|
SemanticVector {
|
|
id: "GENE:672",
|
|
domain: Domain::Genomics,
|
|
embedding: [384-dimensional vector],
|
|
metadata: {
|
|
"gene_id": "672",
|
|
"symbol": "BRCA1",
|
|
"description": "BRCA1 DNA repair associated",
|
|
"organism": "Homo sapiens",
|
|
"common_name": "human",
|
|
"chromosome": "17",
|
|
"location": "17q21.31",
|
|
"source": "ncbi_gene"
|
|
}
|
|
}
|
|
```
|
|
|
|
### 2. UniProtClient - Protein Database
|
|
|
|
Access comprehensive protein information including function, structure, and pathways.
|
|
|
|
#### Initialization
|
|
|
|
```rust
|
|
let client = UniProtClient::new()?;
|
|
```
|
|
|
|
#### Methods
|
|
|
|
```rust
|
|
// Search proteins
|
|
let proteins = client.search_proteins("p53", 100).await?;
|
|
|
|
// Get protein by accession
|
|
let protein = client.get_protein("P04637").await?; // TP53
|
|
|
|
// Search by organism
|
|
let human_proteins = client.search_by_organism("human").await?;
|
|
|
|
// Search by function (GO term)
|
|
let kinases = client.search_by_function("kinase").await?;
|
|
```
|
|
|
|
#### Vector Format
|
|
|
|
```rust
|
|
SemanticVector {
|
|
id: "UNIPROT:P04637",
|
|
domain: Domain::Genomics,
|
|
embedding: [384-dimensional vector],
|
|
metadata: {
|
|
"accession": "P04637",
|
|
"protein_name": "Cellular tumor antigen p53",
|
|
"organism": "Homo sapiens",
|
|
"genes": "TP53",
|
|
"function": "Acts as a tumor suppressor...",
|
|
"source": "uniprot"
|
|
}
|
|
}
|
|
```
|
|
|
|
### 3. EnsemblClient - Genomic Annotations
|
|
|
|
Access gene information, variants, and homology across species.
|
|
|
|
#### Initialization
|
|
|
|
```rust
|
|
let client = EnsemblClient::new()?;
|
|
```
|
|
|
|
#### Methods
|
|
|
|
```rust
|
|
// Get gene information
|
|
let gene = client.get_gene_info("ENSG00000157764").await?; // BRAF
|
|
|
|
// Get genetic variants for a gene
|
|
let variants = client.get_variants("ENSG00000157764").await?;
|
|
|
|
// Get homologous genes across species
|
|
let homologs = client.get_homologs("ENSG00000157764").await?;
|
|
```
|
|
|
|
#### Vector Format
|
|
|
|
```rust
|
|
SemanticVector {
|
|
id: "ENSEMBL:ENSG00000157764",
|
|
domain: Domain::Genomics,
|
|
embedding: [384-dimensional vector],
|
|
metadata: {
|
|
"ensembl_id": "ENSG00000157764",
|
|
"symbol": "BRAF",
|
|
"description": "B-Raf proto-oncogene, serine/threonine kinase",
|
|
"species": "homo_sapiens",
|
|
"biotype": "protein_coding",
|
|
"chromosome": "7",
|
|
"start": "140719327",
|
|
"end": "140924929",
|
|
"source": "ensembl"
|
|
}
|
|
}
|
|
```
|
|
|
|
### 4. GwasClient - GWAS Catalog
|
|
|
|
Access genome-wide association studies linking genes to diseases and traits.
|
|
|
|
#### Initialization
|
|
|
|
```rust
|
|
let client = GwasClient::new()?;
|
|
```
|
|
|
|
#### Methods
|
|
|
|
```rust
|
|
// Search trait-gene associations
|
|
let associations = client.search_associations("diabetes").await?;
|
|
|
|
// Get study details
|
|
let study = client.get_study("GCST001937").await?;
|
|
|
|
// Search associations by gene
|
|
let gene_assocs = client.search_by_gene("APOE").await?;
|
|
```
|
|
|
|
#### Vector Format
|
|
|
|
```rust
|
|
SemanticVector {
|
|
id: "GWAS:7_140753336_5.0e-8",
|
|
domain: Domain::Genomics,
|
|
embedding: [384-dimensional vector],
|
|
metadata: {
|
|
"trait": "Type 2 diabetes",
|
|
"genes": "BRAF, KIAA1549",
|
|
"risk_allele": "rs7578597-T",
|
|
"pvalue": "5.0e-8",
|
|
"chromosome": "7",
|
|
"position": "140753336",
|
|
"source": "gwas_catalog"
|
|
}
|
|
}
|
|
```
|
|
|
|
## Rate Limits
|
|
|
|
| API | Default Rate | With API Key | Notes |
|
|
|-----|-------------|--------------|-------|
|
|
| NCBI | 3 req/sec | 10 req/sec | API key recommended for production |
|
|
| UniProt | 10 req/sec | - | Conservative limit |
|
|
| Ensembl | 15 req/sec | - | Per their guidelines |
|
|
| GWAS | 10 req/sec | - | Conservative limit |
|
|
|
|
All clients implement:
|
|
- Automatic rate limiting with delays
|
|
- Exponential backoff on 429 errors
|
|
- Configurable retry attempts (default: 3)
|
|
|
|
## Cross-Domain Discovery Examples
|
|
|
|
### 1. Climate ↔ Genomics
|
|
|
|
Discover how environmental factors correlate with gene expression:
|
|
|
|
```rust
|
|
// Fetch heat shock proteins (climate stress response)
|
|
let hsp_genes = ncbi.search_genes("heat shock protein", Some("human")).await?;
|
|
|
|
// Fetch temperature data from NOAA
|
|
let climate_data = noaa_client.fetch_temperature_data("2020-01-01", "2024-01-01").await?;
|
|
|
|
// Add to discovery engine
|
|
for gene in hsp_genes {
|
|
engine.add_vector(gene);
|
|
}
|
|
for record in climate_data {
|
|
engine.add_vector(record);
|
|
}
|
|
|
|
// Detect cross-domain patterns
|
|
let patterns = engine.detect_patterns();
|
|
// May discover: "Heat shock protein expression correlates with extreme temperature events"
|
|
```
|
|
|
|
### 2. Medical ↔ Genomics
|
|
|
|
Link genetic variants to disease outcomes:
|
|
|
|
```rust
|
|
// Get APOE4 variant (Alzheimer's risk)
|
|
let apoe4 = ncbi.get_snp("rs429358").await?;
|
|
|
|
// Search PubMed for Alzheimer's research
|
|
let papers = pubmed.search_articles("Alzheimer's disease APOE", 100).await?;
|
|
|
|
// Detect gene-disease associations
|
|
let patterns = engine.detect_patterns();
|
|
```
|
|
|
|
### 3. Economic ↔ Genomics
|
|
|
|
Correlate biotech market trends with genomic research:
|
|
|
|
```rust
|
|
// Fetch CRISPR-related genes
|
|
let crispr_genes = ncbi.search_genes("CRISPR", None).await?;
|
|
|
|
// Fetch biotech stock data
|
|
let biotech_stocks = alpha_vantage.fetch_stock("CRSP", "monthly").await?;
|
|
|
|
// Discover market-science correlations
|
|
let patterns = engine.detect_patterns();
|
|
```
|
|
|
|
## Error Handling
|
|
|
|
All clients return `Result<T, FrameworkError>`:
|
|
|
|
```rust
|
|
match ncbi.search_genes("BRCA1", Some("human")).await {
|
|
Ok(genes) => {
|
|
println!("Found {} genes", genes.len());
|
|
for gene in genes {
|
|
engine.add_vector(gene);
|
|
}
|
|
}
|
|
Err(FrameworkError::Network(e)) => {
|
|
eprintln!("Network error: {}", e);
|
|
}
|
|
Err(FrameworkError::Serialization(e)) => {
|
|
eprintln!("JSON parsing error: {}", e);
|
|
}
|
|
Err(e) => {
|
|
eprintln!("Other error: {}", e);
|
|
}
|
|
}
|
|
```
|
|
|
|
## Testing
|
|
|
|
Run the unit tests:
|
|
|
|
```bash
|
|
cargo test --lib genomics
|
|
```
|
|
|
|
Run the example:
|
|
|
|
```bash
|
|
cargo run --example genomics_discovery
|
|
```
|
|
|
|
## Performance Tips
|
|
|
|
1. **Use NCBI API key** for production workloads (10x rate limit)
|
|
2. **Batch operations** when possible (e.g., fetch 200 genes at once)
|
|
3. **Cache results** to avoid redundant API calls
|
|
4. **Use async/await** for concurrent requests across different APIs
|
|
|
|
```rust
|
|
// Concurrent fetching
|
|
let (genes, proteins, variants) = tokio::join!(
|
|
ncbi.search_genes("BRCA1", Some("human")),
|
|
uniprot.search_proteins("BRCA1", 10),
|
|
ensembl.get_variants("ENSG00000012048")
|
|
);
|
|
```
|
|
|
|
## Real-World Use Cases
|
|
|
|
### 1. Pharmacogenomics
|
|
|
|
Discover drug-gene interactions:
|
|
- Fetch CYP450 genes from NCBI
|
|
- Get protein structures from UniProt
|
|
- Find drug adverse events from FDA
|
|
- Detect patterns linking gene variants to drug response
|
|
|
|
### 2. Climate Adaptation Research
|
|
|
|
Study genetic adaptation to climate change:
|
|
- Fetch stress response genes (heat shock, cold tolerance)
|
|
- Get climate data (temperature, precipitation)
|
|
- Find GWAS associations for environmental traits
|
|
- Discover gene-environment correlations
|
|
|
|
### 3. Disease Risk Assessment
|
|
|
|
Build genetic risk profiles:
|
|
- Get disease-associated SNPs from GWAS
|
|
- Fetch gene function from UniProt
|
|
- Find variants from Ensembl
|
|
- Compute polygenic risk scores
|
|
|
|
## Contributing
|
|
|
|
When adding new genomics data sources:
|
|
|
|
1. Follow the existing client pattern (rate limiting, retry logic)
|
|
2. Convert to `SemanticVector` with `Domain::Genomics`
|
|
3. Include rich metadata for discovery
|
|
4. Add unit tests
|
|
5. Update this documentation
|
|
|
|
## References
|
|
|
|
- [NCBI Entrez API](https://www.ncbi.nlm.nih.gov/books/NBK25501/)
|
|
- [UniProt REST API](https://www.uniprot.org/help/api)
|
|
- [Ensembl REST API](https://rest.ensembl.org/)
|
|
- [GWAS Catalog API](https://www.ebi.ac.uk/gwas/rest/docs/api)
|
|
|
|
## License
|
|
|
|
Part of the RuVector project. See root LICENSE file.
|