Squashed 'vendor/ruvector/' content from commit b64c2172

git-subtree-dir: vendor/ruvector git-subtree-split: b64c21726f2bb37286d9ee36a7869fef60cc6900
2026-02-28 14:39:40 -05:00
commit d803bfe2b1
7854 changed files with 3522914 additions and 0 deletions
--- a/examples/exo-ai-2025/research/10-thermodynamic-learning/physics_foundations.md
+++ b/examples/exo-ai-2025/research/10-thermodynamic-learning/physics_foundations.md
@@ -0,0 +1,688 @@
+# Physics Foundations of Thermodynamic Learning
+## Mathematical Foundations and Physical Principles
+
+---
+
+## Table of Contents
+1. [Statistical Mechanics Primer](#1-statistical-mechanics-primer)
+2. [Information Theory and Physics](#2-information-theory-and-physics)
+3. [Landauer's Principle: Detailed Derivation](#3-landauers-principle-detailed-derivation)
+4. [Non-Equilibrium Thermodynamics](#4-non-equilibrium-thermodynamics)
+5. [Stochastic Thermodynamics](#5-stochastic-thermodynamics)
+6. [Free Energy and Variational Inference](#6-free-energy-and-variational-inference)
+7. [Energy-Based Models: Physical Interpretation](#7-energy-based-models-physical-interpretation)
+8. [Thermodynamic Bounds on Computation](#8-thermodynamic-bounds-on-computation)
+
+---
+
+## 1. Statistical Mechanics Primer
+
+### 1.1 Microcanonical Ensemble
+
+For an isolated system with energy E:
+```
+Ω(E) = number of microstates with energy E
+S = k ln Ω(E)  (Boltzmann entropy)
+```
+
+**Physical Meaning**: Entropy measures the logarithm of accessible microstates.
+
+### 1.2 Canonical Ensemble
+
+For a system in thermal contact with reservoir at temperature T:
+```
+p(E_i) = (1/Z) exp(-E_i / kT)
+Z = Σ_i exp(-E_i / kT)  (partition function)
+```
+
+**Thermodynamic quantities**:
+```
+Free Energy:  F = -kT ln Z = ⟨E⟩ - TS
+Entropy:      S = -k Σ_i p_i ln p_i = -k⟨ln p⟩
+Average E:    ⟨E⟩ = Σ_i p_i E_i
+Heat Capacity: C = d⟨E⟩/dT
+```
+
+### 1.3 Boltzmann Distribution
+
+The probability of state with energy E at temperature T:
+```
+p(E) ∝ exp(-E/kT) = exp(-βE)
+```
+
+where β = 1/(kT) is the **inverse temperature** (coldness).
+
+**Key Insight**: Physical systems naturally sample from probability distributions weighted by exp(-energy).
+
+### 1.4 Fluctuation-Dissipation Theorem
+
+Thermal fluctuations and dissipation are related:
+```
+⟨δx(t) δx(0)⟩ = (kT/γ) exp(-γt/m)
+```
+
+**Implication**: Cannot have low-noise system without dissipation. Thermal noise is fundamental at temperature T.
+
+---
+
+## 2. Information Theory and Physics
+
+### 2.1 Shannon Entropy
+
+For discrete probability distribution p(x):
+```
+H[p] = -Σ_x p(x) log₂ p(x)  (bits)
+     = -k Σ_x p(x) ln p(x)   (thermodynamic units)
+```
+
+**Connection to Thermodynamics**: Shannon entropy has same mathematical form as Boltzmann/Gibbs entropy.
+
+### 2.2 Mutual Information
+
+Information shared between variables X and Y:
+```
+I(X; Y) = H[X] + H[Y] - H[X,Y]
+        = Σ p(x,y) log[p(x,y) / (p(x)p(y))]
+```
+
+**Physical Meaning**: Mutual information quantifies correlations—how much knowing X tells you about Y.
+
+### 2.3 Kullback-Leibler Divergence
+
+"Distance" from distribution q to distribution p:
+```
+D_KL[q || p] = Σ q(x) log[q(x)/p(x)]
+             = ⟨log q - log p⟩_q
+```
+
+**Properties**:
+- Always non-negative: D_KL ≥ 0
+- Zero iff q = p almost everywhere
+- Not symmetric: D_KL[q||p] ≠ D_KL[p||q]
+
+**Physical Interpretation**: Excess entropy when using wrong distribution.
+
+### 2.4 Relative Entropy and Free Energy
+
+For canonical ensemble:
+```
+D_KL[q || p_β] = Σ q(x) log[q(x)] - Σ q(x) log[exp(-βE(x))/Z]
+                = -S[q]/k + β⟨E⟩_q + log Z
+                = β(F[q] - F[p])
+```
+
+**Key Insight**: KL divergence to Boltzmann distribution = free energy difference (in units of kT).
+
+---
+
+## 3. Landauer's Principle: Detailed Derivation
+
+### 3.1 Setup: Bit Erasure
+
+Consider a 1-bit memory:
+- **Initial state**: Unknown (0 or 1 with probabilities p₀, p₁)
+- **Final state**: Known (forced to 0)
+
+### 3.2 Information-Theoretic Analysis
+
+Initial entropy:
+```
+S_initial = -k[p₀ ln p₀ + p₁ ln p₁]
+```
+
+Final entropy:
+```
+S_final = 0  (definite state)
+```
+
+Change in information:
+```
+ΔI = S_initial - S_final = -k[p₀ ln p₀ + p₁ ln p₁]
+```
+
+For maximum erasure (p₀ = p₁ = 1/2):
+```
+ΔI = k ln 2
+```
+
+### 3.3 Thermodynamic Analysis
+
+**Second Law**: Total entropy (system + environment) cannot decrease:
+```
+ΔS_total = ΔS_system + ΔS_environment ≥ 0
+```
+
+For isothermal process:
+```
+ΔS_environment = Q/T
+```
+
+where Q is heat dissipated to environment.
+
+**Combining**:
+```
+ΔS_system + Q/T ≥ 0
+-k ln 2 + Q/T ≥ 0
+Q ≥ kT ln 2
+```
+
+**Landauer's Principle**: Erasing 1 bit of information requires dissipating at least kT ln 2 of heat.
+
+### 3.4 Physical Implementation: Szilard Engine
+
+**1-Molecule Gas Engine**:
+1. Single molecule in box (unknown side)
+2. Insert partition (0 information about position)
+3. Measure which side (gain 1 bit)
+4. Attach piston to occupied side
+5. Extract work kT ln 2 via isothermal expansion
+6. Remove partition
+7. **Erase measurement record** → Dissipate kT ln 2
+
+**Cycle**: Extract work using information, pay thermodynamic cost to erase memory.
+
+### 3.5 Generalization: Arbitrary Distribution
+
+For erasing memory in state with probability distribution p(x):
+```
+Q ≥ kT × H[p] = -kT Σ p(x) ln p(x)
+```
+
+**More uncertain initial state → More heat dissipated.**
+
+---
+
+## 4. Non-Equilibrium Thermodynamics
+
+### 4.1 Entropy Production
+
+For a system driven out of equilibrium:
+```
+dS/dt = d_iS/dt + d_eS/dt
+```
+
+- d_iS/dt = internal entropy production (≥ 0)
+- d_eS/dt = entropy flow from environment (can be negative)
+
+**Second Law**: d_iS/dt ≥ 0 always.
+
+### 4.2 Jarzynski Equality
+
+For a system driven from equilibrium at λ=0 to λ=1:
+```
+⟨exp(-βW)⟩ = exp(-βΔF)
+```
+
+Where:
+- W = work performed on system
+- ΔF = free energy difference
+- ⟨⟩ = average over many realizations
+
+**Implication**: Can extract equilibrium free energy from non-equilibrium processes.
+
+### 4.3 Crooks Fluctuation Theorem
+
+Ratio of forward to reverse process probabilities:
+```
+P(W_forward) / P(-W_reverse) = exp(β(W - ΔF))
+```
+
+**Special case (Jarzynski)**: Integrate over W.
+
+### 4.4 Entropy Production Rate
+
+For driven system:
+```
+Σ̇ = (1/T) Σ_i J_i X_i ≥ 0
+```
+
+Where:
+- J_i = thermodynamic flux (current)
+- X_i = thermodynamic force (gradient)
+
+**Examples**:
+- Heat flux: J = heat current, X = ∇(1/T)
+- Particle flux: J = particle current, X = -∇μ
+- Chemical reactions: J = reaction rate, X = -ΔG/T
+
+---
+
+## 5. Stochastic Thermodynamics
+
+### 5.1 Langevin Equation
+
+For a particle in potential V(x) with friction γ and thermal noise:
+```
+m(d²x/dt²) = -γ(dx/dt) - dV/dx + ξ(t)
+```
+
+Where noise satisfies:
+```
+⟨ξ(t)⟩ = 0
+⟨ξ(t)ξ(t')⟩ = 2γkT δ(t-t')  (fluctuation-dissipation)
+```
+
+**Overdamped limit** (low inertia):
+```
+γ(dx/dt) = -dV/dx + ξ(t)
+dx/dt = -(1/γ)dV/dx + √(2D) η(t)
+```
+
+where D = kT/γ (Einstein relation).
+
+### 5.2 Fokker-Planck Equation
+
+Evolution of probability distribution p(x,t):
+```
+∂p/∂t = -∂/∂x[v(x)p] + D ∂²p/∂x²
+```
+
+- First term: deterministic drift
+- Second term: diffusion
+
+**Steady state**: ∂p/∂t = 0 gives Boltzmann distribution.
+
+### 5.3 Stochastic Entropy Production
+
+Along a single trajectory:
+```
+Δs_tot = Δs_system + Δs_environment
+       = ln[p(x_initial)/p(x_final)] + βQ
+```
+
+**Average**: ⟨Δs_tot⟩ ≥ 0 (second law)
+
+### 5.4 Information-Theoretic Formulation
+
+For feedback control (Maxwell's demon):
+```
+⟨Δs_tot⟩ = ⟨Δs_system⟩ + ⟨Δs_environment⟩ - I
+         ≥ 0
+```
+
+Where I = mutual information between system and controller.
+
+**Sagawa-Ueda Generalized Second Law**:
+```
+⟨W⟩ ≥ ΔF - kT × I
+```
+
+Can extract up to kT×I extra work using information.
+
+---
+
+## 6. Free Energy and Variational Inference
+
+### 6.1 Helmholtz Free Energy
+
+For system at temperature T:
+```
+F = ⟨E⟩ - TS = U - TS
+```
+
+**Equilibrium condition**: F is minimized.
+
+**Physical meaning**:
+- U = ⟨E⟩ = average energy (favors low energy states)
+- -TS = entropy contribution (favors high entropy)
+- F balances energy and entropy
+
+### 6.2 Variational Free Energy (Friston)
+
+For generative model p(x,s) and observations s:
+```
+F[q] = E_q[E(x,s)] - H[q(x|s)]
+     = -E_q[log p(x,s)] + E_q[log q(x|s)]
+     = -log p(s) + D_KL[q(x|s) || p(x|s)]
+```
+
+Where:
+- x = hidden states
+- s = sensory observations
+- q(x|s) = approximate posterior (beliefs)
+- p(x|s) = true posterior
+
+**Key Properties**:
+1. F ≥ -log p(s) with equality when q = p
+2. Minimizing F ⟺ maximizing evidence p(s)
+3. F decomposes into energy and entropy
+
+### 6.3 Free Energy Principle
+
+**Biological systems minimize variational free energy:**
+```
+dF/dt ≤ 0
+```
+
+**Mechanisms**:
+1. **Perception**: Update beliefs q to minimize F (∂F/∂q)
+2. **Action**: Change sensory input s to minimize F (∂F/∂s)
+
+**Connection to Thermodynamics**:
+- Variational free energy ↔ Helmholtz free energy
+- Minimizing surprise ↔ Resisting disorder
+- Living systems are non-equilibrium steady states
+
+### 6.4 Active Inference
+
+Expected free energy for policy π:
+```
+G[π] = E_π[F[q]] + D_KL[q(s|π) || p(s)]
+```
+
+**Decomposition**:
+```
+G = Pragmatic value + Epistemic value
+  = E_π[log q(s)] - E_π[log p(s|x)]  (ambiguity)
+  + E_π[H[p(x|s)]]                    (risk)
+```
+
+**Interpretation**:
+- Pragmatic: Achieve preferred outcomes
+- Epistemic: Resolve uncertainty about world
+
+---
+
+## 7. Energy-Based Models: Physical Interpretation
+
+### 7.1 Boltzmann Machines
+
+Probability distribution over binary variables s_i ∈ {0,1}:
+```
+p(s) = (1/Z) exp(-E(s)/T)
+```
+
+Energy function:
+```
+E(s) = -Σ_ij W_ij s_i s_j - Σ_i b_i s_i
+```
+
+**Physical interpretation**:
+- W_ij = coupling strength (interaction energy)
+- b_i = external field (bias)
+- T = temperature (controls randomness)
+
+### 7.2 Hopfield Networks
+
+Symmetric weights, energy function:
+```
+E = -(1/2) Σ_ij W_ij s_i s_j - Σ_i b_i s_i
+```
+
+**Dynamics** (asynchronous update):
+```
+s_i(t+1) = sign(Σ_j W_ij s_j(t) + b_i)
+```
+
+**Energy decreases** (or stays constant) with each update:
+```
+ΔE = E(t+1) - E(t) ≤ 0
+```
+
+**Attractor dynamics**: System settles to local energy minima (memories).
+
+### 7.3 Equilibrium Propagation
+
+**Free phase**:
+```
+τ ds/dt = -∂E(s,y)/∂s
+```
+
+Settles to equilibrium s* where ∂E/∂s = 0.
+
+**Nudged phase**:
+```
+τ ds/dt = -∂E(s,y)/∂s - β(y - y_target)
+```
+
+Gently pushes toward target.
+
+**Learning rule**:
+```
+dW/dt ∝ ⟨s_i s_j⟩_nudged - ⟨s_i s_j⟩_free
+```
+
+**Physical interpretation**:
+- Free phase: Thermodynamic equilibration
+- Nudged phase: Perturbed equilibrium
+- Learning: Adjust weights to make nudge smaller
+
+### 7.4 Connection to Contrastive Divergence
+
+Gradient of log-likelihood for Boltzmann machine:
+```
+∂log p(s_data)/∂W_ij = ⟨s_i s_j⟩_data - ⟨s_i s_j⟩_model
+```
+
+**Positive phase**: ⟨⟩_data from observations
+**Negative phase**: ⟨⟩_model from sampling equilibrium
+
+**Equilibrium propagation** is continuous-time, deterministic version.
+
+---
+
+## 8. Thermodynamic Bounds on Computation
+
+### 8.1 Landauer Bound
+
+Already derived: Erasing n bits dissipates at least:
+```
+Q ≥ n × kT ln 2
+```
+
+### 8.2 Margolus-Levitin Bound
+
+Maximum speed of computation (orthogonal quantum states):
+```
+τ ≥ πℏ / (2E)
+```
+
+Where E is energy of system.
+
+**Interpretation**: Fundamental tradeoff between speed and energy. More energy → faster computation.
+
+### 8.3 Bekenstein Bound
+
+Maximum information in region of space:
+```
+I ≤ 2πRE / (ℏc ln 2)
+```
+
+Where R is radius, E is energy.
+
+**For spherical region**:
+```
+I ≤ (A/4) × (k ln 2 / (ℏG)) ≈ A/(4 L_P²)
+```
+
+Where A is surface area, L_P is Planck length.
+
+**Interpretation**: Holographic bound—information scales with area, not volume.
+
+### 8.4 Lloyd's Bound
+
+Ultimate speed of computation:
+```
+Operations/sec ≤ E / (πℏ) ≈ 10^51 × (E/1kg)
+```
+
+**Example**: 1 kg of matter → 10^51 ops/sec maximum.
+
+### 8.5 Synthesis: Multi-Dimensional Limits
+
+Computation is bounded by:
+
+| Resource | Bound | Limiting Constant |
+|----------|-------|-------------------|
+| Energy per bit erased | E ≥ kT ln 2 | Boltzmann constant k |
+| Speed vs. energy | τ ≥ πℏ/2E | Planck constant ℏ |
+| Information per energy | I ≤ E/(kT ln 2) | kT ln 2 |
+| Ops per energy | N ≤ E/(πℏ) | ℏ |
+| Info per volume | I ≤ A/(4L_P²) | Planck area |
+
+**Key Insight**: All fundamental limits trace back to h, k, c, G—the fundamental constants of physics.
+
+---
+
+## 9. Thermodynamic Cost of Learning
+
+### 9.1 Information-Theoretic View
+
+**Learning**: Extracting model θ from data D.
+
+**Information gained**:
+```
+I(D; θ) = H[θ] - H[θ|D]
+```
+
+**Minimum thermodynamic cost**:
+```
+Q ≥ kT × I(D; θ)
+```
+
+**Interpretation**: Must dissipate heat proportional to information extracted from data.
+
+### 9.2 PAC Learning Bounds
+
+Probably Approximately Correct (PAC) learning requires:
+```
+m ≥ (1/ε²) × [d log(1/ε) + log(1/δ)]
+```
+
+samples, where d = VC dimension.
+
+**Thermodynamic cost**:
+```
+Q ≥ kT × m × (log |X| + log |Y|)
+```
+
+**Implication**: Harder learning problems (larger d, smaller ε) have higher energy cost.
+
+### 9.3 Generalization and Thermodynamics
+
+**Hypothesis**: Thermodynamic cost of learning is related to generalization gap.
+
+**Intuition**:
+- Memorization: High mutual information I(D; θ)
+- Generalization: Low mutual information (compressed representation)
+
+**Possible bound**:
+```
+Generalization gap ∝ I(D; θ) / |D|
+```
+
+**Thermodynamic consequence**:
+- Overparameterized models: High I(D; θ) → High energy cost
+- Regularized models: Low I(D; θ) → Low energy cost
+
+**Prediction**: Energy-efficient learning favors generalizable models.
+
+---
+
+## 10. Mathematical Toolbox
+
+### 10.1 Useful Inequalities
+
+**Jensen's Inequality**: For convex function f:
+```
+f(E[X]) ≤ E[f(X)]
+```
+
+**Gibbs Inequality**: D_KL[p||q] ≥ 0
+
+**Log-Sum Inequality**:
+```
+Σ a_i log(a_i/b_i) ≥ (Σ a_i) log[(Σ a_i)/(Σ b_i)]
+```
+
+### 10.2 Variational Principles
+
+**ELBO (Evidence Lower Bound)**:
+```
+log p(x) ≥ E_q[log p(x,z)] - E_q[log q(z)]
+         = -F[q]
+```
+
+**Variational inference**: Maximize ELBO ⟺ Minimize free energy.
+
+### 10.3 Calculus of Variations
+
+To minimize functional F[q]:
+```
+δF/δq = 0
+```
+
+**Example**: Find q that minimizes F = E_q[E] - TS[q]:
+```
+q(x) = (1/Z) exp(-E(x)/T)  (Boltzmann distribution)
+```
+
+---
+
+## 11. Summary: Key Equations
+
+### Fundamental Constants
+```
+k = 1.381 × 10⁻²³ J/K      (Boltzmann)
+ℏ = 1.055 × 10⁻³⁴ J·s      (Planck)
+c = 3 × 10⁸ m/s             (Speed of light)
+```
+
+### Thermodynamic Relations
+```
+F = U - TS                  (Helmholtz free energy)
+dF = -SdT - PdV             (Fundamental relation)
+S = -k Σ p_i ln p_i         (Entropy)
+p_i = (1/Z) exp(-E_i/kT)    (Boltzmann distribution)
+```
+
+### Information Theory
+```
+H[p] = -Σ p(x) log p(x)                    (Shannon entropy)
+I(X;Y) = H[X] - H[X|Y]                     (Mutual information)
+D_KL[q||p] = Σ q(x) log[q(x)/p(x)]         (KL divergence)
+```
+
+### Landauer and Computation
+```
+E_erase ≥ kT ln 2           (Landauer bound)
+τ_min ≥ πℏ/(2E)            (Margolus-Levitin)
+I_max ≤ 2πRE/(ℏc ln 2)     (Bekenstein)
+```
+
+### Learning Bounds
+```
+E_learn ≥ kT × I(D; θ)     (Information cost)
+F[q] = E_q[E] - TS          (Variational free energy)
+```
+
+---
+
+## 12. Further Reading
+
+**Classical Thermodynamics**:
+- Callen, *Thermodynamics and an Introduction to Thermostatistics*
+- Chandler, *Introduction to Modern Statistical Mechanics*
+
+**Information Theory**:
+- Cover & Thomas, *Elements of Information Theory*
+- MacKay, *Information Theory, Inference, and Learning Algorithms*
+
+**Information Thermodynamics**:
+- Sagawa & Ueda, "Minimal energy cost for thermodynamic information processing"
+- Parrondo et al., "Thermodynamics of information," *Nature Physics* (2015)
+
+**Free Energy Principle**:
+- Friston, "The free-energy principle: a unified brain theory?" (2010)
+- Parr, Pezzulo, Friston, *Active Inference: The Free Energy Principle in Mind, Brain, and Behavior* (MIT Press, 2022)
+
+**Energy-Based Learning**:
+- Scellier & Bengio, "Equilibrium Propagation" (2017)
+- Hinton, "Training Products of Experts by Minimizing Contrastive Divergence" (2002)
+
+---
+
+**Status**: Comprehensive mathematical foundation for thermodynamic learning
+**Last Updated**: December 2025
+**Prerequisites**: Statistical mechanics, information theory, calculus
+**Next**: Apply these principles to implement Landauer-optimal learning systems