# Physics Foundations of Thermodynamic Learning ## Mathematical Foundations and Physical Principles --- ## Table of Contents 1. [Statistical Mechanics Primer](#1-statistical-mechanics-primer) 2. [Information Theory and Physics](#2-information-theory-and-physics) 3. [Landauer's Principle: Detailed Derivation](#3-landauers-principle-detailed-derivation) 4. [Non-Equilibrium Thermodynamics](#4-non-equilibrium-thermodynamics) 5. [Stochastic Thermodynamics](#5-stochastic-thermodynamics) 6. [Free Energy and Variational Inference](#6-free-energy-and-variational-inference) 7. [Energy-Based Models: Physical Interpretation](#7-energy-based-models-physical-interpretation) 8. [Thermodynamic Bounds on Computation](#8-thermodynamic-bounds-on-computation) --- ## 1. Statistical Mechanics Primer ### 1.1 Microcanonical Ensemble For an isolated system with energy E: ``` Ω(E) = number of microstates with energy E S = k ln Ω(E) (Boltzmann entropy) ``` **Physical Meaning**: Entropy measures the logarithm of accessible microstates. ### 1.2 Canonical Ensemble For a system in thermal contact with reservoir at temperature T: ``` p(E_i) = (1/Z) exp(-E_i / kT) Z = Σ_i exp(-E_i / kT) (partition function) ``` **Thermodynamic quantities**: ``` Free Energy: F = -kT ln Z = ⟨E⟩ - TS Entropy: S = -k Σ_i p_i ln p_i = -k⟨ln p⟩ Average E: ⟨E⟩ = Σ_i p_i E_i Heat Capacity: C = d⟨E⟩/dT ``` ### 1.3 Boltzmann Distribution The probability of state with energy E at temperature T: ``` p(E) ∝ exp(-E/kT) = exp(-βE) ``` where β = 1/(kT) is the **inverse temperature** (coldness). **Key Insight**: Physical systems naturally sample from probability distributions weighted by exp(-energy). ### 1.4 Fluctuation-Dissipation Theorem Thermal fluctuations and dissipation are related: ``` ⟨δx(t) δx(0)⟩ = (kT/γ) exp(-γt/m) ``` **Implication**: Cannot have low-noise system without dissipation. Thermal noise is fundamental at temperature T. --- ## 2. Information Theory and Physics ### 2.1 Shannon Entropy For discrete probability distribution p(x): ``` H[p] = -Σ_x p(x) log₂ p(x) (bits) = -k Σ_x p(x) ln p(x) (thermodynamic units) ``` **Connection to Thermodynamics**: Shannon entropy has same mathematical form as Boltzmann/Gibbs entropy. ### 2.2 Mutual Information Information shared between variables X and Y: ``` I(X; Y) = H[X] + H[Y] - H[X,Y] = Σ p(x,y) log[p(x,y) / (p(x)p(y))] ``` **Physical Meaning**: Mutual information quantifies correlations—how much knowing X tells you about Y. ### 2.3 Kullback-Leibler Divergence "Distance" from distribution q to distribution p: ``` D_KL[q || p] = Σ q(x) log[q(x)/p(x)] = ⟨log q - log p⟩_q ``` **Properties**: - Always non-negative: D_KL ≥ 0 - Zero iff q = p almost everywhere - Not symmetric: D_KL[q||p] ≠ D_KL[p||q] **Physical Interpretation**: Excess entropy when using wrong distribution. ### 2.4 Relative Entropy and Free Energy For canonical ensemble: ``` D_KL[q || p_β] = Σ q(x) log[q(x)] - Σ q(x) log[exp(-βE(x))/Z] = -S[q]/k + β⟨E⟩_q + log Z = β(F[q] - F[p]) ``` **Key Insight**: KL divergence to Boltzmann distribution = free energy difference (in units of kT). --- ## 3. Landauer's Principle: Detailed Derivation ### 3.1 Setup: Bit Erasure Consider a 1-bit memory: - **Initial state**: Unknown (0 or 1 with probabilities p₀, p₁) - **Final state**: Known (forced to 0) ### 3.2 Information-Theoretic Analysis Initial entropy: ``` S_initial = -k[p₀ ln p₀ + p₁ ln p₁] ``` Final entropy: ``` S_final = 0 (definite state) ``` Change in information: ``` ΔI = S_initial - S_final = -k[p₀ ln p₀ + p₁ ln p₁] ``` For maximum erasure (p₀ = p₁ = 1/2): ``` ΔI = k ln 2 ``` ### 3.3 Thermodynamic Analysis **Second Law**: Total entropy (system + environment) cannot decrease: ``` ΔS_total = ΔS_system + ΔS_environment ≥ 0 ``` For isothermal process: ``` ΔS_environment = Q/T ``` where Q is heat dissipated to environment. **Combining**: ``` ΔS_system + Q/T ≥ 0 -k ln 2 + Q/T ≥ 0 Q ≥ kT ln 2 ``` **Landauer's Principle**: Erasing 1 bit of information requires dissipating at least kT ln 2 of heat. ### 3.4 Physical Implementation: Szilard Engine **1-Molecule Gas Engine**: 1. Single molecule in box (unknown side) 2. Insert partition (0 information about position) 3. Measure which side (gain 1 bit) 4. Attach piston to occupied side 5. Extract work kT ln 2 via isothermal expansion 6. Remove partition 7. **Erase measurement record** → Dissipate kT ln 2 **Cycle**: Extract work using information, pay thermodynamic cost to erase memory. ### 3.5 Generalization: Arbitrary Distribution For erasing memory in state with probability distribution p(x): ``` Q ≥ kT × H[p] = -kT Σ p(x) ln p(x) ``` **More uncertain initial state → More heat dissipated.** --- ## 4. Non-Equilibrium Thermodynamics ### 4.1 Entropy Production For a system driven out of equilibrium: ``` dS/dt = d_iS/dt + d_eS/dt ``` - d_iS/dt = internal entropy production (≥ 0) - d_eS/dt = entropy flow from environment (can be negative) **Second Law**: d_iS/dt ≥ 0 always. ### 4.2 Jarzynski Equality For a system driven from equilibrium at λ=0 to λ=1: ``` ⟨exp(-βW)⟩ = exp(-βΔF) ``` Where: - W = work performed on system - ΔF = free energy difference - ⟨⟩ = average over many realizations **Implication**: Can extract equilibrium free energy from non-equilibrium processes. ### 4.3 Crooks Fluctuation Theorem Ratio of forward to reverse process probabilities: ``` P(W_forward) / P(-W_reverse) = exp(β(W - ΔF)) ``` **Special case (Jarzynski)**: Integrate over W. ### 4.4 Entropy Production Rate For driven system: ``` Σ̇ = (1/T) Σ_i J_i X_i ≥ 0 ``` Where: - J_i = thermodynamic flux (current) - X_i = thermodynamic force (gradient) **Examples**: - Heat flux: J = heat current, X = ∇(1/T) - Particle flux: J = particle current, X = -∇μ - Chemical reactions: J = reaction rate, X = -ΔG/T --- ## 5. Stochastic Thermodynamics ### 5.1 Langevin Equation For a particle in potential V(x) with friction γ and thermal noise: ``` m(d²x/dt²) = -γ(dx/dt) - dV/dx + ξ(t) ``` Where noise satisfies: ``` ⟨ξ(t)⟩ = 0 ⟨ξ(t)ξ(t')⟩ = 2γkT δ(t-t') (fluctuation-dissipation) ``` **Overdamped limit** (low inertia): ``` γ(dx/dt) = -dV/dx + ξ(t) dx/dt = -(1/γ)dV/dx + √(2D) η(t) ``` where D = kT/γ (Einstein relation). ### 5.2 Fokker-Planck Equation Evolution of probability distribution p(x,t): ``` ∂p/∂t = -∂/∂x[v(x)p] + D ∂²p/∂x² ``` - First term: deterministic drift - Second term: diffusion **Steady state**: ∂p/∂t = 0 gives Boltzmann distribution. ### 5.3 Stochastic Entropy Production Along a single trajectory: ``` Δs_tot = Δs_system + Δs_environment = ln[p(x_initial)/p(x_final)] + βQ ``` **Average**: ⟨Δs_tot⟩ ≥ 0 (second law) ### 5.4 Information-Theoretic Formulation For feedback control (Maxwell's demon): ``` ⟨Δs_tot⟩ = ⟨Δs_system⟩ + ⟨Δs_environment⟩ - I ≥ 0 ``` Where I = mutual information between system and controller. **Sagawa-Ueda Generalized Second Law**: ``` ⟨W⟩ ≥ ΔF - kT × I ``` Can extract up to kT×I extra work using information. --- ## 6. Free Energy and Variational Inference ### 6.1 Helmholtz Free Energy For system at temperature T: ``` F = ⟨E⟩ - TS = U - TS ``` **Equilibrium condition**: F is minimized. **Physical meaning**: - U = ⟨E⟩ = average energy (favors low energy states) - -TS = entropy contribution (favors high entropy) - F balances energy and entropy ### 6.2 Variational Free Energy (Friston) For generative model p(x,s) and observations s: ``` F[q] = E_q[E(x,s)] - H[q(x|s)] = -E_q[log p(x,s)] + E_q[log q(x|s)] = -log p(s) + D_KL[q(x|s) || p(x|s)] ``` Where: - x = hidden states - s = sensory observations - q(x|s) = approximate posterior (beliefs) - p(x|s) = true posterior **Key Properties**: 1. F ≥ -log p(s) with equality when q = p 2. Minimizing F ⟺ maximizing evidence p(s) 3. F decomposes into energy and entropy ### 6.3 Free Energy Principle **Biological systems minimize variational free energy:** ``` dF/dt ≤ 0 ``` **Mechanisms**: 1. **Perception**: Update beliefs q to minimize F (∂F/∂q) 2. **Action**: Change sensory input s to minimize F (∂F/∂s) **Connection to Thermodynamics**: - Variational free energy ↔ Helmholtz free energy - Minimizing surprise ↔ Resisting disorder - Living systems are non-equilibrium steady states ### 6.4 Active Inference Expected free energy for policy π: ``` G[π] = E_π[F[q]] + D_KL[q(s|π) || p(s)] ``` **Decomposition**: ``` G = Pragmatic value + Epistemic value = E_π[log q(s)] - E_π[log p(s|x)] (ambiguity) + E_π[H[p(x|s)]] (risk) ``` **Interpretation**: - Pragmatic: Achieve preferred outcomes - Epistemic: Resolve uncertainty about world --- ## 7. Energy-Based Models: Physical Interpretation ### 7.1 Boltzmann Machines Probability distribution over binary variables s_i ∈ {0,1}: ``` p(s) = (1/Z) exp(-E(s)/T) ``` Energy function: ``` E(s) = -Σ_ij W_ij s_i s_j - Σ_i b_i s_i ``` **Physical interpretation**: - W_ij = coupling strength (interaction energy) - b_i = external field (bias) - T = temperature (controls randomness) ### 7.2 Hopfield Networks Symmetric weights, energy function: ``` E = -(1/2) Σ_ij W_ij s_i s_j - Σ_i b_i s_i ``` **Dynamics** (asynchronous update): ``` s_i(t+1) = sign(Σ_j W_ij s_j(t) + b_i) ``` **Energy decreases** (or stays constant) with each update: ``` ΔE = E(t+1) - E(t) ≤ 0 ``` **Attractor dynamics**: System settles to local energy minima (memories). ### 7.3 Equilibrium Propagation **Free phase**: ``` τ ds/dt = -∂E(s,y)/∂s ``` Settles to equilibrium s* where ∂E/∂s = 0. **Nudged phase**: ``` τ ds/dt = -∂E(s,y)/∂s - β(y - y_target) ``` Gently pushes toward target. **Learning rule**: ``` dW/dt ∝ ⟨s_i s_j⟩_nudged - ⟨s_i s_j⟩_free ``` **Physical interpretation**: - Free phase: Thermodynamic equilibration - Nudged phase: Perturbed equilibrium - Learning: Adjust weights to make nudge smaller ### 7.4 Connection to Contrastive Divergence Gradient of log-likelihood for Boltzmann machine: ``` ∂log p(s_data)/∂W_ij = ⟨s_i s_j⟩_data - ⟨s_i s_j⟩_model ``` **Positive phase**: ⟨⟩_data from observations **Negative phase**: ⟨⟩_model from sampling equilibrium **Equilibrium propagation** is continuous-time, deterministic version. --- ## 8. Thermodynamic Bounds on Computation ### 8.1 Landauer Bound Already derived: Erasing n bits dissipates at least: ``` Q ≥ n × kT ln 2 ``` ### 8.2 Margolus-Levitin Bound Maximum speed of computation (orthogonal quantum states): ``` τ ≥ πℏ / (2E) ``` Where E is energy of system. **Interpretation**: Fundamental tradeoff between speed and energy. More energy → faster computation. ### 8.3 Bekenstein Bound Maximum information in region of space: ``` I ≤ 2πRE / (ℏc ln 2) ``` Where R is radius, E is energy. **For spherical region**: ``` I ≤ (A/4) × (k ln 2 / (ℏG)) ≈ A/(4 L_P²) ``` Where A is surface area, L_P is Planck length. **Interpretation**: Holographic bound—information scales with area, not volume. ### 8.4 Lloyd's Bound Ultimate speed of computation: ``` Operations/sec ≤ E / (πℏ) ≈ 10^51 × (E/1kg) ``` **Example**: 1 kg of matter → 10^51 ops/sec maximum. ### 8.5 Synthesis: Multi-Dimensional Limits Computation is bounded by: | Resource | Bound | Limiting Constant | |----------|-------|-------------------| | Energy per bit erased | E ≥ kT ln 2 | Boltzmann constant k | | Speed vs. energy | τ ≥ πℏ/2E | Planck constant ℏ | | Information per energy | I ≤ E/(kT ln 2) | kT ln 2 | | Ops per energy | N ≤ E/(πℏ) | ℏ | | Info per volume | I ≤ A/(4L_P²) | Planck area | **Key Insight**: All fundamental limits trace back to h, k, c, G—the fundamental constants of physics. --- ## 9. Thermodynamic Cost of Learning ### 9.1 Information-Theoretic View **Learning**: Extracting model θ from data D. **Information gained**: ``` I(D; θ) = H[θ] - H[θ|D] ``` **Minimum thermodynamic cost**: ``` Q ≥ kT × I(D; θ) ``` **Interpretation**: Must dissipate heat proportional to information extracted from data. ### 9.2 PAC Learning Bounds Probably Approximately Correct (PAC) learning requires: ``` m ≥ (1/ε²) × [d log(1/ε) + log(1/δ)] ``` samples, where d = VC dimension. **Thermodynamic cost**: ``` Q ≥ kT × m × (log |X| + log |Y|) ``` **Implication**: Harder learning problems (larger d, smaller ε) have higher energy cost. ### 9.3 Generalization and Thermodynamics **Hypothesis**: Thermodynamic cost of learning is related to generalization gap. **Intuition**: - Memorization: High mutual information I(D; θ) - Generalization: Low mutual information (compressed representation) **Possible bound**: ``` Generalization gap ∝ I(D; θ) / |D| ``` **Thermodynamic consequence**: - Overparameterized models: High I(D; θ) → High energy cost - Regularized models: Low I(D; θ) → Low energy cost **Prediction**: Energy-efficient learning favors generalizable models. --- ## 10. Mathematical Toolbox ### 10.1 Useful Inequalities **Jensen's Inequality**: For convex function f: ``` f(E[X]) ≤ E[f(X)] ``` **Gibbs Inequality**: D_KL[p||q] ≥ 0 **Log-Sum Inequality**: ``` Σ a_i log(a_i/b_i) ≥ (Σ a_i) log[(Σ a_i)/(Σ b_i)] ``` ### 10.2 Variational Principles **ELBO (Evidence Lower Bound)**: ``` log p(x) ≥ E_q[log p(x,z)] - E_q[log q(z)] = -F[q] ``` **Variational inference**: Maximize ELBO ⟺ Minimize free energy. ### 10.3 Calculus of Variations To minimize functional F[q]: ``` δF/δq = 0 ``` **Example**: Find q that minimizes F = E_q[E] - TS[q]: ``` q(x) = (1/Z) exp(-E(x)/T) (Boltzmann distribution) ``` --- ## 11. Summary: Key Equations ### Fundamental Constants ``` k = 1.381 × 10⁻²³ J/K (Boltzmann) ℏ = 1.055 × 10⁻³⁴ J·s (Planck) c = 3 × 10⁸ m/s (Speed of light) ``` ### Thermodynamic Relations ``` F = U - TS (Helmholtz free energy) dF = -SdT - PdV (Fundamental relation) S = -k Σ p_i ln p_i (Entropy) p_i = (1/Z) exp(-E_i/kT) (Boltzmann distribution) ``` ### Information Theory ``` H[p] = -Σ p(x) log p(x) (Shannon entropy) I(X;Y) = H[X] - H[X|Y] (Mutual information) D_KL[q||p] = Σ q(x) log[q(x)/p(x)] (KL divergence) ``` ### Landauer and Computation ``` E_erase ≥ kT ln 2 (Landauer bound) τ_min ≥ πℏ/(2E) (Margolus-Levitin) I_max ≤ 2πRE/(ℏc ln 2) (Bekenstein) ``` ### Learning Bounds ``` E_learn ≥ kT × I(D; θ) (Information cost) F[q] = E_q[E] - TS (Variational free energy) ``` --- ## 12. Further Reading **Classical Thermodynamics**: - Callen, *Thermodynamics and an Introduction to Thermostatistics* - Chandler, *Introduction to Modern Statistical Mechanics* **Information Theory**: - Cover & Thomas, *Elements of Information Theory* - MacKay, *Information Theory, Inference, and Learning Algorithms* **Information Thermodynamics**: - Sagawa & Ueda, "Minimal energy cost for thermodynamic information processing" - Parrondo et al., "Thermodynamics of information," *Nature Physics* (2015) **Free Energy Principle**: - Friston, "The free-energy principle: a unified brain theory?" (2010) - Parr, Pezzulo, Friston, *Active Inference: The Free Energy Principle in Mind, Brain, and Behavior* (MIT Press, 2022) **Energy-Based Learning**: - Scellier & Bengio, "Equilibrium Propagation" (2017) - Hinton, "Training Products of Experts by Minimizing Contrastive Divergence" (2002) --- **Status**: Comprehensive mathematical foundation for thermodynamic learning **Last Updated**: December 2025 **Prerequisites**: Statistical mechanics, information theory, calculus **Next**: Apply these principles to implement Landauer-optimal learning systems