Files
wifi-densepose/examples/exo-ai-2025/research/10-thermodynamic-learning/physics_foundations.md
ruv d803bfe2b1 Squashed 'vendor/ruvector/' content from commit b64c2172
git-subtree-dir: vendor/ruvector
git-subtree-split: b64c21726f2bb37286d9ee36a7869fef60cc6900
2026-02-28 14:39:40 -05:00

16 KiB
Raw Blame History

Physics Foundations of Thermodynamic Learning

Mathematical Foundations and Physical Principles


Table of Contents

  1. Statistical Mechanics Primer
  2. Information Theory and Physics
  3. Landauer's Principle: Detailed Derivation
  4. Non-Equilibrium Thermodynamics
  5. Stochastic Thermodynamics
  6. Free Energy and Variational Inference
  7. Energy-Based Models: Physical Interpretation
  8. Thermodynamic Bounds on Computation

1. Statistical Mechanics Primer

1.1 Microcanonical Ensemble

For an isolated system with energy E:

Ω(E) = number of microstates with energy E
S = k ln Ω(E)  (Boltzmann entropy)

Physical Meaning: Entropy measures the logarithm of accessible microstates.

1.2 Canonical Ensemble

For a system in thermal contact with reservoir at temperature T:

p(E_i) = (1/Z) exp(-E_i / kT)
Z = Σ_i exp(-E_i / kT)  (partition function)

Thermodynamic quantities:

Free Energy:  F = -kT ln Z = ⟨E⟩ - TS
Entropy:      S = -k Σ_i p_i ln p_i = -k⟨ln p⟩
Average E:    ⟨E⟩ = Σ_i p_i E_i
Heat Capacity: C = d⟨E⟩/dT

1.3 Boltzmann Distribution

The probability of state with energy E at temperature T:

p(E) ∝ exp(-E/kT) = exp(-βE)

where β = 1/(kT) is the inverse temperature (coldness).

Key Insight: Physical systems naturally sample from probability distributions weighted by exp(-energy).

1.4 Fluctuation-Dissipation Theorem

Thermal fluctuations and dissipation are related:

⟨δx(t) δx(0)⟩ = (kT/γ) exp(-γt/m)

Implication: Cannot have low-noise system without dissipation. Thermal noise is fundamental at temperature T.


2. Information Theory and Physics

2.1 Shannon Entropy

For discrete probability distribution p(x):

H[p] = -Σ_x p(x) log₂ p(x)  (bits)
     = -k Σ_x p(x) ln p(x)   (thermodynamic units)

Connection to Thermodynamics: Shannon entropy has same mathematical form as Boltzmann/Gibbs entropy.

2.2 Mutual Information

Information shared between variables X and Y:

I(X; Y) = H[X] + H[Y] - H[X,Y]
        = Σ p(x,y) log[p(x,y) / (p(x)p(y))]

Physical Meaning: Mutual information quantifies correlations—how much knowing X tells you about Y.

2.3 Kullback-Leibler Divergence

"Distance" from distribution q to distribution p:

D_KL[q || p] = Σ q(x) log[q(x)/p(x)]
             = ⟨log q - log p⟩_q

Properties:

  • Always non-negative: D_KL ≥ 0
  • Zero iff q = p almost everywhere
  • Not symmetric: D_KL[q||p] ≠ D_KL[p||q]

Physical Interpretation: Excess entropy when using wrong distribution.

2.4 Relative Entropy and Free Energy

For canonical ensemble:

D_KL[q || p_β] = Σ q(x) log[q(x)] - Σ q(x) log[exp(-βE(x))/Z]
                = -S[q]/k + β⟨E⟩_q + log Z
                = β(F[q] - F[p])

Key Insight: KL divergence to Boltzmann distribution = free energy difference (in units of kT).


3. Landauer's Principle: Detailed Derivation

3.1 Setup: Bit Erasure

Consider a 1-bit memory:

  • Initial state: Unknown (0 or 1 with probabilities p₀, p₁)
  • Final state: Known (forced to 0)

3.2 Information-Theoretic Analysis

Initial entropy:

S_initial = -k[p₀ ln p₀ + p₁ ln p₁]

Final entropy:

S_final = 0  (definite state)

Change in information:

ΔI = S_initial - S_final = -k[p₀ ln p₀ + p₁ ln p₁]

For maximum erasure (p₀ = p₁ = 1/2):

ΔI = k ln 2

3.3 Thermodynamic Analysis

Second Law: Total entropy (system + environment) cannot decrease:

ΔS_total = ΔS_system + ΔS_environment ≥ 0

For isothermal process:

ΔS_environment = Q/T

where Q is heat dissipated to environment.

Combining:

ΔS_system + Q/T ≥ 0
-k ln 2 + Q/T ≥ 0
Q ≥ kT ln 2

Landauer's Principle: Erasing 1 bit of information requires dissipating at least kT ln 2 of heat.

3.4 Physical Implementation: Szilard Engine

1-Molecule Gas Engine:

  1. Single molecule in box (unknown side)
  2. Insert partition (0 information about position)
  3. Measure which side (gain 1 bit)
  4. Attach piston to occupied side
  5. Extract work kT ln 2 via isothermal expansion
  6. Remove partition
  7. Erase measurement record → Dissipate kT ln 2

Cycle: Extract work using information, pay thermodynamic cost to erase memory.

3.5 Generalization: Arbitrary Distribution

For erasing memory in state with probability distribution p(x):

Q ≥ kT × H[p] = -kT Σ p(x) ln p(x)

More uncertain initial state → More heat dissipated.


4. Non-Equilibrium Thermodynamics

4.1 Entropy Production

For a system driven out of equilibrium:

dS/dt = d_iS/dt + d_eS/dt
  • d_iS/dt = internal entropy production (≥ 0)
  • d_eS/dt = entropy flow from environment (can be negative)

Second Law: d_iS/dt ≥ 0 always.

4.2 Jarzynski Equality

For a system driven from equilibrium at λ=0 to λ=1:

⟨exp(-βW)⟩ = exp(-βΔF)

Where:

  • W = work performed on system
  • ΔF = free energy difference
  • ⟨⟩ = average over many realizations

Implication: Can extract equilibrium free energy from non-equilibrium processes.

4.3 Crooks Fluctuation Theorem

Ratio of forward to reverse process probabilities:

P(W_forward) / P(-W_reverse) = exp(β(W - ΔF))

Special case (Jarzynski): Integrate over W.

4.4 Entropy Production Rate

For driven system:

Σ̇ = (1/T) Σ_i J_i X_i ≥ 0

Where:

  • J_i = thermodynamic flux (current)
  • X_i = thermodynamic force (gradient)

Examples:

  • Heat flux: J = heat current, X = ∇(1/T)
  • Particle flux: J = particle current, X = -∇μ
  • Chemical reactions: J = reaction rate, X = -ΔG/T

5. Stochastic Thermodynamics

5.1 Langevin Equation

For a particle in potential V(x) with friction γ and thermal noise:

m(d²x/dt²) = -γ(dx/dt) - dV/dx + ξ(t)

Where noise satisfies:

⟨ξ(t)⟩ = 0
⟨ξ(t)ξ(t')⟩ = 2γkT δ(t-t')  (fluctuation-dissipation)

Overdamped limit (low inertia):

γ(dx/dt) = -dV/dx + ξ(t)
dx/dt = -(1/γ)dV/dx + √(2D) η(t)

where D = kT/γ (Einstein relation).

5.2 Fokker-Planck Equation

Evolution of probability distribution p(x,t):

∂p/∂t = -∂/∂x[v(x)p] + D ∂²p/∂x²
  • First term: deterministic drift
  • Second term: diffusion

Steady state: ∂p/∂t = 0 gives Boltzmann distribution.

5.3 Stochastic Entropy Production

Along a single trajectory:

Δs_tot = Δs_system + Δs_environment
       = ln[p(x_initial)/p(x_final)] + βQ

Average: ⟨Δs_tot⟩ ≥ 0 (second law)

5.4 Information-Theoretic Formulation

For feedback control (Maxwell's demon):

⟨Δs_tot⟩ = ⟨Δs_system⟩ + ⟨Δs_environment⟩ - I
         ≥ 0

Where I = mutual information between system and controller.

Sagawa-Ueda Generalized Second Law:

⟨W⟩ ≥ ΔF - kT × I

Can extract up to kT×I extra work using information.


6. Free Energy and Variational Inference

6.1 Helmholtz Free Energy

For system at temperature T:

F = ⟨E⟩ - TS = U - TS

Equilibrium condition: F is minimized.

Physical meaning:

  • U = ⟨E⟩ = average energy (favors low energy states)
  • -TS = entropy contribution (favors high entropy)
  • F balances energy and entropy

6.2 Variational Free Energy (Friston)

For generative model p(x,s) and observations s:

F[q] = E_q[E(x,s)] - H[q(x|s)]
     = -E_q[log p(x,s)] + E_q[log q(x|s)]
     = -log p(s) + D_KL[q(x|s) || p(x|s)]

Where:

  • x = hidden states
  • s = sensory observations
  • q(x|s) = approximate posterior (beliefs)
  • p(x|s) = true posterior

Key Properties:

  1. F ≥ -log p(s) with equality when q = p
  2. Minimizing F ⟺ maximizing evidence p(s)
  3. F decomposes into energy and entropy

6.3 Free Energy Principle

Biological systems minimize variational free energy:

dF/dt ≤ 0

Mechanisms:

  1. Perception: Update beliefs q to minimize F (∂F/∂q)
  2. Action: Change sensory input s to minimize F (∂F/∂s)

Connection to Thermodynamics:

  • Variational free energy ↔ Helmholtz free energy
  • Minimizing surprise ↔ Resisting disorder
  • Living systems are non-equilibrium steady states

6.4 Active Inference

Expected free energy for policy π:

G[π] = E_π[F[q]] + D_KL[q(s|π) || p(s)]

Decomposition:

G = Pragmatic value + Epistemic value
  = E_π[log q(s)] - E_π[log p(s|x)]  (ambiguity)
  + E_π[H[p(x|s)]]                    (risk)

Interpretation:

  • Pragmatic: Achieve preferred outcomes
  • Epistemic: Resolve uncertainty about world

7. Energy-Based Models: Physical Interpretation

7.1 Boltzmann Machines

Probability distribution over binary variables s_i ∈ {0,1}:

p(s) = (1/Z) exp(-E(s)/T)

Energy function:

E(s) = -Σ_ij W_ij s_i s_j - Σ_i b_i s_i

Physical interpretation:

  • W_ij = coupling strength (interaction energy)
  • b_i = external field (bias)
  • T = temperature (controls randomness)

7.2 Hopfield Networks

Symmetric weights, energy function:

E = -(1/2) Σ_ij W_ij s_i s_j - Σ_i b_i s_i

Dynamics (asynchronous update):

s_i(t+1) = sign(Σ_j W_ij s_j(t) + b_i)

Energy decreases (or stays constant) with each update:

ΔE = E(t+1) - E(t) ≤ 0

Attractor dynamics: System settles to local energy minima (memories).

7.3 Equilibrium Propagation

Free phase:

τ ds/dt = -∂E(s,y)/∂s

Settles to equilibrium s* where ∂E/∂s = 0.

Nudged phase:

τ ds/dt = -∂E(s,y)/∂s - β(y - y_target)

Gently pushes toward target.

Learning rule:

dW/dt ∝ ⟨s_i s_j⟩_nudged - ⟨s_i s_j⟩_free

Physical interpretation:

  • Free phase: Thermodynamic equilibration
  • Nudged phase: Perturbed equilibrium
  • Learning: Adjust weights to make nudge smaller

7.4 Connection to Contrastive Divergence

Gradient of log-likelihood for Boltzmann machine:

∂log p(s_data)/∂W_ij = ⟨s_i s_j⟩_data - ⟨s_i s_j⟩_model

Positive phase: ⟨⟩_data from observations Negative phase: ⟨⟩_model from sampling equilibrium

Equilibrium propagation is continuous-time, deterministic version.


8. Thermodynamic Bounds on Computation

8.1 Landauer Bound

Already derived: Erasing n bits dissipates at least:

Q ≥ n × kT ln 2

8.2 Margolus-Levitin Bound

Maximum speed of computation (orthogonal quantum states):

τ ≥ πℏ / (2E)

Where E is energy of system.

Interpretation: Fundamental tradeoff between speed and energy. More energy → faster computation.

8.3 Bekenstein Bound

Maximum information in region of space:

I ≤ 2πRE / (ℏc ln 2)

Where R is radius, E is energy.

For spherical region:

I ≤ (A/4) × (k ln 2 / (ℏG)) ≈ A/(4 L_P²)

Where A is surface area, L_P is Planck length.

Interpretation: Holographic bound—information scales with area, not volume.

8.4 Lloyd's Bound

Ultimate speed of computation:

Operations/sec ≤ E / (πℏ) ≈ 10^51 × (E/1kg)

Example: 1 kg of matter → 10^51 ops/sec maximum.

8.5 Synthesis: Multi-Dimensional Limits

Computation is bounded by:

Resource Bound Limiting Constant
Energy per bit erased E ≥ kT ln 2 Boltzmann constant k
Speed vs. energy τ ≥ πℏ/2E Planck constant ℏ
Information per energy I ≤ E/(kT ln 2) kT ln 2
Ops per energy N ≤ E/(πℏ)
Info per volume I ≤ A/(4L_P²) Planck area

Key Insight: All fundamental limits trace back to h, k, c, G—the fundamental constants of physics.


9. Thermodynamic Cost of Learning

9.1 Information-Theoretic View

Learning: Extracting model θ from data D.

Information gained:

I(D; θ) = H[θ] - H[θ|D]

Minimum thermodynamic cost:

Q ≥ kT × I(D; θ)

Interpretation: Must dissipate heat proportional to information extracted from data.

9.2 PAC Learning Bounds

Probably Approximately Correct (PAC) learning requires:

m ≥ (1/ε²) × [d log(1/ε) + log(1/δ)]

samples, where d = VC dimension.

Thermodynamic cost:

Q ≥ kT × m × (log |X| + log |Y|)

Implication: Harder learning problems (larger d, smaller ε) have higher energy cost.

9.3 Generalization and Thermodynamics

Hypothesis: Thermodynamic cost of learning is related to generalization gap.

Intuition:

  • Memorization: High mutual information I(D; θ)
  • Generalization: Low mutual information (compressed representation)

Possible bound:

Generalization gap ∝ I(D; θ) / |D|

Thermodynamic consequence:

  • Overparameterized models: High I(D; θ) → High energy cost
  • Regularized models: Low I(D; θ) → Low energy cost

Prediction: Energy-efficient learning favors generalizable models.


10. Mathematical Toolbox

10.1 Useful Inequalities

Jensen's Inequality: For convex function f:

f(E[X]) ≤ E[f(X)]

Gibbs Inequality: D_KL[p||q] ≥ 0

Log-Sum Inequality:

Σ a_i log(a_i/b_i) ≥ (Σ a_i) log[(Σ a_i)/(Σ b_i)]

10.2 Variational Principles

ELBO (Evidence Lower Bound):

log p(x) ≥ E_q[log p(x,z)] - E_q[log q(z)]
         = -F[q]

Variational inference: Maximize ELBO ⟺ Minimize free energy.

10.3 Calculus of Variations

To minimize functional F[q]:

δF/δq = 0

Example: Find q that minimizes F = E_q[E] - TS[q]:

q(x) = (1/Z) exp(-E(x)/T)  (Boltzmann distribution)

11. Summary: Key Equations

Fundamental Constants

k = 1.381 × 10⁻²³ J/K      (Boltzmann)
ℏ = 1.055 × 10⁻³⁴ J·s      (Planck)
c = 3 × 10⁸ m/s             (Speed of light)

Thermodynamic Relations

F = U - TS                  (Helmholtz free energy)
dF = -SdT - PdV             (Fundamental relation)
S = -k Σ p_i ln p_i         (Entropy)
p_i = (1/Z) exp(-E_i/kT)    (Boltzmann distribution)

Information Theory

H[p] = -Σ p(x) log p(x)                    (Shannon entropy)
I(X;Y) = H[X] - H[X|Y]                     (Mutual information)
D_KL[q||p] = Σ q(x) log[q(x)/p(x)]         (KL divergence)

Landauer and Computation

E_erase ≥ kT ln 2           (Landauer bound)
τ_min ≥ πℏ/(2E)            (Margolus-Levitin)
I_max ≤ 2πRE/(ℏc ln 2)     (Bekenstein)

Learning Bounds

E_learn ≥ kT × I(D; θ)     (Information cost)
F[q] = E_q[E] - TS          (Variational free energy)

12. Further Reading

Classical Thermodynamics:

  • Callen, Thermodynamics and an Introduction to Thermostatistics
  • Chandler, Introduction to Modern Statistical Mechanics

Information Theory:

  • Cover & Thomas, Elements of Information Theory
  • MacKay, Information Theory, Inference, and Learning Algorithms

Information Thermodynamics:

  • Sagawa & Ueda, "Minimal energy cost for thermodynamic information processing"
  • Parrondo et al., "Thermodynamics of information," Nature Physics (2015)

Free Energy Principle:

  • Friston, "The free-energy principle: a unified brain theory?" (2010)
  • Parr, Pezzulo, Friston, Active Inference: The Free Energy Principle in Mind, Brain, and Behavior (MIT Press, 2022)

Energy-Based Learning:

  • Scellier & Bengio, "Equilibrium Propagation" (2017)
  • Hinton, "Training Products of Experts by Minimizing Contrastive Divergence" (2002)

Status: Comprehensive mathematical foundation for thermodynamic learning Last Updated: December 2025 Prerequisites: Statistical mechanics, information theory, calculus Next: Apply these principles to implement Landauer-optimal learning systems