23 KiB
ADR-012: Genomic Security and Privacy
Status: Accepted Date: 2026-02-11 Authors: RuVector Security Team Deciders: Architecture Review Board, Security Review Board Technical Area: Security / Privacy / Compliance
Version History
| Version | Date | Author | Changes |
|---|---|---|---|
| 1.0 | 2026-02-11 | RuVector Security Team | Initial security architecture |
Context and Problem Statement
Genomic data is the most sensitive personal information. A single genome:
- Uniquely identifies an individual (more reliable than fingerprints)
- Reveals disease risk for the individual AND their relatives
- Exposes ancestry, paternity, and family relationships
- Can be used for discrimination (insurance, employment under GINA violations)
- Never changes (cannot be "reset" like a password)
Threat Model: Genomic Data Risks
| Threat | Attack Vector | Impact | Likelihood |
|---|---|---|---|
| Re-identification attacks | Cross-reference genomic data with public databases (GEDmatch, OpenSNP) to identify anonymous individuals | Privacy violation, GINA violation | High |
| Data breach | Unauthorized access to genomic database via SQL injection, API exploit, or insider threat | Mass exposure of PHI, lawsuits, regulatory fines | Medium |
| Inference attacks | Use ML models to infer phenotypes from genomic data (disease risk, drug response, ancestry) without consent | Discrimination, privacy violation | High |
| Linkage attacks | Combine genomic data with non-genomic data (medical records, social media) to infer sensitive attributes | Targeted discrimination | Medium |
| Forensic abuse | Law enforcement access to genomic databases for criminal investigations without warrant (GEDmatch controversy) | Privacy violation, 4th Amendment | Low (but high impact) |
| Insurance discrimination | Insurers access genomic data to deny coverage or increase premiums (GINA applies to health, not life/disability) | Financial harm | Medium (legal for life insurance) |
| Ransomware | Encrypt genomic database and demand payment | Business disruption, data loss | Medium |
| Supply chain attack | Compromise sequencing equipment or analysis software to inject backdoors | Data exfiltration, tampering | Low (but critical impact) |
Regulatory Landscape
| Regulation | Jurisdiction | Key Requirements | Penalties |
|---|---|---|---|
| HIPAA (Health Insurance Portability and Accountability Act) | US | Encrypt PHI at rest and in transit; access controls; audit logs; breach notification | Up to $1.5M per violation category per year |
| GDPR (General Data Protection Regulation) | EU/EEA | Explicit consent for genomic data processing; right to erasure; data minimization; DPO required | Up to €20M or 4% global revenue |
| GINA (Genetic Information Nondiscrimination Act) | US | Prohibits health insurers and employers from using genomic data for discrimination | Criminal penalties + civil damages |
| CCPA/CPRA (California Consumer Privacy Act) | California | Opt-out of genomic data sale; right to deletion; transparency | $7,500 per intentional violation |
| PIPEDA (Personal Information Protection) | Canada | Consent for genomic data collection; security safeguards | Up to CAD 100,000 per violation |
Decision
Defense-in-Depth Security Architecture
Implement a layered security model with encryption at rest and in transit, differential privacy for aggregate queries, role-based access control (RBAC), and audit logging. All genomic data processing uses client-side execution where possible (WASM in browser) to minimize server-side PHI exposure.
Threat Model for Genomic Data
Data Classification
| Data Type | Sensitivity | Examples | Encryption Required | Retention Policy |
|---|---|---|---|---|
| Raw genomic data | Critical | FASTQ, BAM, CRAM, VCF files | ✅ AES-256 at rest, TLS 1.3 in transit | Unlimited (with consent) |
| Genomic embeddings | High | k-mer vectors, variant embeddings, HNSW indices | ✅ AES-256 at rest | Unlimited |
| Aggregate statistics | Medium | Allele frequencies, population stratification | ⚠️ Differential privacy (ε-budget) | Unlimited |
| Metadata | Medium | Sample IDs, sequencing dates, coverage metrics | ✅ AES-256 at rest | Per HIPAA/GDPR |
| Derived phenotypes | High | Disease risk scores, PGx predictions | ✅ AES-256 at rest | Per consent |
| Audit logs | Low | Access timestamps, user IDs | ❌ Plaintext (no PHI) | 7 years (HIPAA) |
Attack Surface
┌─────────────────────────────────────────────────────────────┐
│ EXTERNAL ATTACK SURFACE │
├─────────────────────────────────────────────────────────────┤
│ 1. Web API (ruvector-server) │
│ - Input validation (Zod schemas) │
│ - Rate limiting (100 req/min per IP) │
│ - CORS whitelist │
│ - JWT authentication (RS256, 15min expiry) │
├─────────────────────────────────────────────────────────────┤
│ 2. Browser WASM (client-side execution) │
│ - CSP: connect-src 'self'; script-src 'self' 'wasm-unsafe-eval' │
│ - SRI hashes on all WASM modules │
│ - Service worker blocks unauthorized network requests │
├─────────────────────────────────────────────────────────────┤
│ 3. File Upload Endpoints │
│ - Max file size: 10GB │
│ - Allowed MIME types: application/gzip, application/x-bam │
│ - Virus scan (ClamAV) before processing │
│ - Sandboxed processing (no shell access) │
└─────────────────────────────────────────────────────────────┘
Practical Encryption
1. Encryption at Rest (AES-256-GCM)
All genomic data encrypted before writing to disk:
use aes_gcm::{Aes256Gcm, Key, Nonce};
use aes_gcm::aead::{Aead, NewAead};
pub struct GenomicDataStore {
cipher: Aes256Gcm,
storage_path: PathBuf,
}
impl GenomicDataStore {
pub fn new(master_key: &[u8; 32], storage_path: PathBuf) -> Self {
let key = Key::from_slice(master_key);
let cipher = Aes256Gcm::new(key);
Self { cipher, storage_path }
}
pub fn encrypt_vcf(&self, sample_id: &str, vcf_data: &[u8]) -> Result<(), Error> {
// Generate random nonce (96 bits for AES-GCM)
let nonce = Nonce::from_slice(&generate_random_nonce());
// Encrypt VCF data
let ciphertext = self.cipher.encrypt(nonce, vcf_data)
.map_err(|_| Error::EncryptionFailed)?;
// Store: nonce (12 bytes) || ciphertext || auth_tag (16 bytes)
let mut encrypted_data = nonce.to_vec();
encrypted_data.extend_from_slice(&ciphertext);
let path = self.storage_path.join(format!("{}.vcf.enc", sample_id));
std::fs::write(&path, &encrypted_data)?;
// Set restrictive permissions (0600: owner read/write only)
#[cfg(unix)]
{
use std::os::unix::fs::PermissionsExt;
std::fs::set_permissions(&path, std::fs::Permissions::from_mode(0o600))?;
}
Ok(())
}
pub fn decrypt_vcf(&self, sample_id: &str) -> Result<Vec<u8>, Error> {
let path = self.storage_path.join(format!("{}.vcf.enc", sample_id));
let encrypted_data = std::fs::read(&path)?;
// Split nonce and ciphertext
let (nonce_bytes, ciphertext) = encrypted_data.split_at(12);
let nonce = Nonce::from_slice(nonce_bytes);
// Decrypt and verify auth tag
self.cipher.decrypt(nonce, ciphertext)
.map_err(|_| Error::DecryptionFailed)
}
}
Key management:
- Master key derived from HSM (Hardware Security Module) or AWS KMS
- Per-sample encryption keys derived via HKDF (HMAC-based Key Derivation Function)
- Key rotation every 90 days
- Old keys retained for decryption of historical data
Status: ✅ Implemented in ruvector-server
2. Encryption in Transit (TLS 1.3)
Mandatory TLS 1.3 with modern cipher suites:
# nginx configuration for ruvector-server
server {
listen 443 ssl http2;
server_name genomics.ruvector.ai;
# TLS 1.3 only
ssl_protocols TLSv1.3;
# Modern cipher suites (forward secrecy)
ssl_ciphers 'TLS_AES_256_GCM_SHA384:TLS_CHACHA20_POLY1305_SHA256:TLS_AES_128_GCM_SHA256';
ssl_prefer_server_ciphers off;
# OCSP stapling
ssl_stapling on;
ssl_stapling_verify on;
# HSTS (force HTTPS for 1 year)
add_header Strict-Transport-Security "max-age=31536000; includeSubDomains; preload" always;
# Certificate pinning (optional, high security)
add_header Public-Key-Pins 'pin-sha256="base64+primary=="; pin-sha256="base64+backup=="; max-age=5184000; includeSubDomains' always;
location /api/ {
proxy_pass http://localhost:3000;
proxy_ssl_protocols TLSv1.3;
}
}
Certificate requirements:
- Extended Validation (EV) certificate from DigiCert or Sectigo
- 2048-bit RSA or 256-bit ECDSA
- Certificate Transparency (CT) logs
Status: ✅ TLS 1.3 enforced in production
3. Client-Side Encryption (WASM in Browser)
For maximum privacy, encrypt genomic data in browser before upload:
// Client-side encryption using Web Crypto API
async function encryptVCFBeforeUpload(vcfFile, userPassword) {
// Derive encryption key from user password (PBKDF2)
const encoder = new TextEncoder();
const passwordKey = await crypto.subtle.importKey(
'raw',
encoder.encode(userPassword),
'PBKDF2',
false,
['deriveBits', 'deriveKey']
);
const salt = crypto.getRandomValues(new Uint8Array(16));
const encryptionKey = await crypto.subtle.deriveKey(
{
name: 'PBKDF2',
salt: salt,
iterations: 100000,
hash: 'SHA-256'
},
passwordKey,
{ name: 'AES-GCM', length: 256 },
false,
['encrypt']
);
// Encrypt VCF data
const iv = crypto.getRandomValues(new Uint8Array(12));
const vcfData = await vcfFile.arrayBuffer();
const ciphertext = await crypto.subtle.encrypt(
{ name: 'AES-GCM', iv: iv },
encryptionKey,
vcfData
);
// Return: salt || iv || ciphertext (server cannot decrypt without password)
return new Blob([salt, iv, ciphertext]);
}
// Upload encrypted blob
async function uploadEncryptedVCF(encryptedBlob, sampleId) {
const formData = new FormData();
formData.append('sample_id', sampleId);
formData.append('encrypted_vcf', encryptedBlob);
await fetch('/api/upload', {
method: 'POST',
body: formData,
headers: {
'Authorization': `Bearer ${getJWT()}`
}
});
}
Zero-knowledge architecture: Server stores encrypted VCF but cannot decrypt without user password.
Status: ⚠️ Prototype implemented, needs UX refinement
Differential Privacy for Allele Frequencies
Problem: Aggregate Statistics Leak Individual Genotypes
Publishing population allele frequencies can enable re-identification attacks. Example:
Published allele frequencies for 10,000 individuals:
- rs123456: MAF = 0.0251 (251 carriers)
Attacker queries with and without target individual:
- With target: MAF = 0.0251 → 251 carriers
- Without target: MAF = 0.0250 → 250 carriers
Conclusion: Target is a carrier of rs123456 (privacy leak)
Solution: Laplace Mechanism with ε-Differential Privacy
Add calibrated noise to allele frequencies before publication:
use rand::distributions::{Distribution, Laplace};
pub struct DifferentiallyPrivateFrequency {
epsilon: f64, // Privacy budget (lower = more private)
sensitivity: f64, // Global sensitivity of query
}
impl DifferentiallyPrivateFrequency {
pub fn new(epsilon: f64) -> Self {
// Sensitivity of allele frequency query: 1/n (adding/removing one individual)
Self { epsilon, sensitivity: 1.0 }
}
pub fn release_allele_frequency(
&self,
true_frequency: f64,
sample_size: usize
) -> f64 {
// Scale parameter for Laplace noise: sensitivity / epsilon
let scale = (1.0 / sample_size as f64) / self.epsilon;
// Sample from Laplace distribution
let laplace = Laplace::new(0.0, scale).unwrap();
let noise = laplace.sample(&mut rand::thread_rng());
// Add noise and clip to [0, 1]
(true_frequency + noise).clamp(0.0, 1.0)
}
}
// Example usage
fn publish_gnomad_frequencies(variants: &[Variant], epsilon: f64) {
let dp = DifferentiallyPrivateFrequency::new(epsilon);
for variant in variants {
let true_af = variant.alt_count as f64 / variant.total_count as f64;
let noisy_af = dp.release_allele_frequency(true_af, variant.total_count);
println!("Variant {}: AF = {:.6} (ε = {})", variant.id, noisy_af, epsilon);
}
}
ε-Budget Guidelines
| Use Case | ε Value | Privacy Guarantee | Noise Level |
|---|---|---|---|
| High privacy (clinical) | 0.1 | Very strong | High noise (±10% AF error) |
| Moderate privacy (research) | 1.0 | Strong | Moderate noise (±1% AF error) |
| Low privacy (public DB) | 10.0 | Weak | Low noise (±0.1% AF error) |
Composition theorem: If multiple queries consume ε₁, ε₂, ..., εₙ, total privacy budget is Σεᵢ. Must track cumulative ε per dataset.
Status: ✅ Implemented in aggregate statistics API
Access Control via ruvector-server/router
Role-Based Access Control (RBAC)
Five roles with hierarchical permissions:
#[derive(Debug, Clone, Copy, PartialEq, Eq)]
pub enum Role {
Patient, // Can view own genomic data only
Clinician, // Can view assigned patients' data
Researcher, // Can query aggregate statistics (DP-protected)
DataScientist, // Can access de-identified genomic data
Admin, // Full access to all data and system config
}
impl Role {
pub fn can_access_vcf(&self, requester_id: &str, sample_id: &str) -> bool {
match self {
Role::Patient => requester_id == sample_id, // Own data only
Role::Clinician => check_patient_assignment(requester_id, sample_id),
Role::DataScientist => is_deidentified(sample_id),
Role::Admin => true,
Role::Researcher => false, // Aggregate queries only
}
}
pub fn can_query_aggregate(&self) -> bool {
matches!(self, Role::Researcher | Role::DataScientist | Role::Admin)
}
}
JWT-Based Authentication
Access tokens with role claims:
use jsonwebtoken::{encode, decode, Header, Algorithm, Validation};
use serde::{Serialize, Deserialize};
#[derive(Debug, Serialize, Deserialize)]
struct Claims {
sub: String, // User ID
role: Role, // User role
exp: usize, // Expiration timestamp
iat: usize, // Issued at timestamp
iss: String, // Issuer (ruvector-auth)
aud: String, // Audience (ruvector-server)
}
pub fn generate_access_token(user_id: &str, role: Role) -> Result<String, Error> {
let claims = Claims {
sub: user_id.to_string(),
role,
exp: (chrono::Utc::now() + chrono::Duration::minutes(15)).timestamp() as usize,
iat: chrono::Utc::now().timestamp() as usize,
iss: "ruvector-auth".to_string(),
aud: "ruvector-server".to_string(),
};
// Sign with RS256 (asymmetric key)
let header = Header::new(Algorithm::RS256);
encode(&header, &claims, &get_private_key()?)
.map_err(|_| Error::TokenGenerationFailed)
}
pub fn verify_access_token(token: &str) -> Result<Claims, Error> {
let validation = Validation::new(Algorithm::RS256);
decode::<Claims>(token, &get_public_key()?, &validation)
.map(|data| data.claims)
.map_err(|_| Error::InvalidToken)
}
Token lifecycle:
- Access tokens: 15 minutes (short-lived)
- Refresh tokens: 7 days (stored in httpOnly secure cookie)
- Token rotation on every refresh
Status: ✅ Implemented in ruvector-server
Audit Logging
All data access logged to immutable audit trail:
pub struct AuditLog {
timestamp: DateTime<Utc>,
user_id: String,
role: Role,
action: Action,
resource: String,
ip_address: IpAddr,
user_agent: String,
success: bool,
}
#[derive(Debug)]
pub enum Action {
ViewVCF,
DownloadVCF,
UploadVCF,
DeleteVCF,
QueryAggregate,
ModifyPermissions,
}
impl AuditLog {
pub fn log_access(user_id: &str, role: Role, action: Action, resource: &str, success: bool) {
let entry = AuditLog {
timestamp: Utc::now(),
user_id: user_id.to_string(),
role,
action,
resource: resource.to_string(),
ip_address: get_request_ip(),
user_agent: get_request_user_agent(),
success,
};
// Write to append-only log (PostgreSQL with RLS or AWS CloudTrail)
write_audit_log(&entry);
// Alert on suspicious activity
if is_suspicious(&entry) {
alert_security_team(&entry);
}
}
}
Suspicious activity detection:
- Multiple failed access attempts (>5 in 1 hour)
- Access from unusual location (GeoIP check)
- Bulk downloads (>100 VCF files in 1 day)
- Role escalation attempts
Status: ✅ Implemented, logs retained for 7 years (HIPAA)
HIPAA/GDPR Compliance Checklist
HIPAA Security Rule
| Requirement | Implementation | Status |
|---|---|---|
| Administrative Safeguards | ||
| Security management process | Risk assessments quarterly, penetration testing annually | ✅ |
| Assigned security responsibility | CISO and security team | ✅ |
| Workforce security | Background checks, access termination procedures | ✅ |
| Security awareness training | Annual HIPAA training for all staff | ✅ |
| Physical Safeguards | ||
| Facility access controls | Badge-controlled data center, visitor logs | ✅ |
| Workstation security | Encrypted laptops, screen locks after 5min | ✅ |
| Device and media controls | Encrypted backups, secure disposal (NIST 800-88) | ✅ |
| Technical Safeguards | ||
| Access control | RBAC, JWT authentication, MFA for admin | ✅ |
| Audit controls | Immutable audit logs, 7-year retention | ✅ |
| Integrity controls | Digital signatures on VCF files, checksum verification | ✅ |
| Transmission security | TLS 1.3, VPN for internal traffic | ✅ |
| Breach Notification | ||
| Breach notification plan | Notify OCR within 60 days, affected individuals within 60 days | ✅ |
| Incident response plan | Documented runbook, tabletop exercises quarterly | ✅ |
GDPR Compliance
| Requirement | Implementation | Status |
|---|---|---|
| Lawful Basis (Article 6) | Explicit consent for genomic data processing | ✅ |
| Consent (Article 7) | Affirmative opt-in, granular consent (research vs clinical), withdraw anytime | ✅ |
| Right to Access (Article 15) | Self-service data export in VCF format | ✅ |
| Right to Rectification (Article 16) | Allow users to update metadata, request re-analysis | ✅ |
| Right to Erasure (Article 17) | Delete all genomic data within 30 days of request | ✅ |
| Data Portability (Article 20) | Export in machine-readable format (VCF, JSON) | ✅ |
| Privacy by Design (Article 25) | Client-side WASM execution, minimal server-side PHI | ✅ |
| Data Protection Officer (DPO) | Appointed DPO, contact: dpo@ruvector.ai | ✅ |
| Data Processing Agreement (DPA) | DPA with all third-party processors (AWS, sequencing vendors) | ✅ |
| Cross-Border Transfer | EU data stays in EU (AWS eu-west-1), SCCs for US transfer | ✅ |
| Breach Notification (Article 33) | Notify supervisory authority within 72 hours | ✅ |
Status: ✅ Compliant (verified by external audit, 2026-01)
Implementation Status
Security Components
| Component | Status | Notes |
|---|---|---|
| AES-256-GCM encryption at rest | ✅ Deployed | All VCF/BAM/CRAM files encrypted |
| TLS 1.3 in transit | ✅ Deployed | Enforced in production |
| Client-side encryption (WASM) | ⚠️ Prototype | Needs UX polish |
| Differential privacy (ε-budget) | ✅ Deployed | Used for aggregate stats API |
| RBAC with 5 roles | ✅ Deployed | Patient, Clinician, Researcher, DataScientist, Admin |
| JWT authentication (RS256) | ✅ Deployed | 15min access tokens, 7-day refresh |
| Audit logging | ✅ Deployed | 7-year retention in PostgreSQL |
| MFA for admin roles | ✅ Deployed | TOTP (Google Authenticator) |
| Intrusion detection (IDS) | ✅ Deployed | Suricata rules for genomic API |
| Penetration testing | ✅ Quarterly | Last test: 2026-01 (no critical findings) |
Compliance
| Standard | Status | Last Audit | Next Audit |
|---|---|---|---|
| HIPAA Security Rule | ✅ Compliant | 2026-01 | 2027-01 |
| GDPR | ✅ Compliant | 2026-01 | 2027-01 |
| GINA | ✅ Compliant | N/A (no audit required) | N/A |
| ISO 27001 | ⚠️ In progress | N/A | 2026-06 (target) |
| SOC 2 Type II | ⚠️ In progress | N/A | 2026-09 (target) |
References
- Gymrek, M., et al. (2013). "Identifying personal genomes by surname inference." Science, 339(6117), 321-324. (Re-identification attacks)
- Homer, N., et al. (2008). "Resolving individuals contributing trace amounts of DNA to highly complex mixtures." PLoS Genetics, 4(8), e1000167. (Mixture deconvolution attacks)
- Dwork, C., & Roth, A. (2014). "The Algorithmic Foundations of Differential Privacy." Foundations and Trends in Theoretical Computer Science, 9(3-4), 211-407.
- NIST Special Publication 800-53 Rev. 5. "Security and Privacy Controls for Information Systems and Organizations."
- FDA Guidance on Cybersecurity for Medical Devices (2023).
- 45 CFR Part 164 (HIPAA Security Rule).
- GDPR Articles 5, 6, 7, 15-22, 25, 32, 33 (EU Regulation 2016/679).
Related Decisions
- ADR-001: RuVector Core Architecture (HNSW index security)
- ADR-008: WASM Edge Genomics (client-side execution for privacy)
- ADR-009: Variant Calling Pipeline (encrypted variant storage)
Revision History
| Version | Date | Author | Changes |
|---|---|---|---|
| 1.0 | 2026-02-11 | RuVector Security Team | Initial security architecture, threat model, encryption, RBAC, compliance checklist |