Files

ruv d803bfe2b1 Squashed 'vendor/ruvector/' content from commit b64c2172

git-subtree-dir: vendor/ruvector
git-subtree-split: b64c21726f2bb37286d9ee36a7869fef60cc6900

2026-02-28 14:39:40 -05:00

16 KiB

Raw Blame History

Physics Foundations of Thermodynamic Learning

Mathematical Foundations and Physical Principles

Statistical Mechanics Primer
Information Theory and Physics
Landauer's Principle: Detailed Derivation
Non-Equilibrium Thermodynamics
Stochastic Thermodynamics
Free Energy and Variational Inference
Energy-Based Models: Physical Interpretation
Thermodynamic Bounds on Computation

1. Statistical Mechanics Primer

1.1 Microcanonical Ensemble

For an isolated system with energy E:

Ω(E) = number of microstates with energy E
S = k ln Ω(E)  (Boltzmann entropy)

Physical Meaning: Entropy measures the logarithm of accessible microstates.

1.2 Canonical Ensemble

For a system in thermal contact with reservoir at temperature T:

p(E_i) = (1/Z) exp(-E_i / kT)
Z = Σ_i exp(-E_i / kT)  (partition function)

Thermodynamic quantities:

Free Energy:  F = -kT ln Z = ⟨E⟩ - TS
Entropy:      S = -k Σ_i p_i ln p_i = -k⟨ln p⟩
Average E:    ⟨E⟩ = Σ_i p_i E_i
Heat Capacity: C = d⟨E⟩/dT

1.3 Boltzmann Distribution

The probability of state with energy E at temperature T:

p(E) ∝ exp(-E/kT) = exp(-βE)

where β = 1/(kT) is the inverse temperature (coldness).

Key Insight: Physical systems naturally sample from probability distributions weighted by exp(-energy).

1.4 Fluctuation-Dissipation Theorem

Thermal fluctuations and dissipation are related:

⟨δx(t) δx(0)⟩ = (kT/γ) exp(-γt/m)

Implication: Cannot have low-noise system without dissipation. Thermal noise is fundamental at temperature T.

2. Information Theory and Physics

2.1 Shannon Entropy

For discrete probability distribution p(x):

H[p] = -Σ_x p(x) log₂ p(x)  (bits)
     = -k Σ_x p(x) ln p(x)   (thermodynamic units)

Connection to Thermodynamics: Shannon entropy has same mathematical form as Boltzmann/Gibbs entropy.

2.2 Mutual Information

Information shared between variables X and Y:

I(X; Y) = H[X] + H[Y] - H[X,Y]
        = Σ p(x,y) log[p(x,y) / (p(x)p(y))]

Physical Meaning: Mutual information quantifies correlations—how much knowing X tells you about Y.

2.3 Kullback-Leibler Divergence

"Distance" from distribution q to distribution p:

D_KL[q || p] = Σ q(x) log[q(x)/p(x)]
             = ⟨log q - log p⟩_q

Properties:

Always non-negative: D_KL ≥ 0
Zero iff q = p almost everywhere
Not symmetric: D_KL[q||p] ≠ D_KL[p||q]

Physical Interpretation: Excess entropy when using wrong distribution.

2.4 Relative Entropy and Free Energy

For canonical ensemble:

D_KL[q || p_β] = Σ q(x) log[q(x)] - Σ q(x) log[exp(-βE(x))/Z]
                = -S[q]/k + β⟨E⟩_q + log Z
                = β(F[q] - F[p])

Key Insight: KL divergence to Boltzmann distribution = free energy difference (in units of kT).

3. Landauer's Principle: Detailed Derivation

3.1 Setup: Bit Erasure

Consider a 1-bit memory:

Initial state: Unknown (0 or 1 with probabilities p₀, p₁)
Final state: Known (forced to 0)

3.2 Information-Theoretic Analysis

Initial entropy:

S_initial = -k[p₀ ln p₀ + p₁ ln p₁]

Final entropy:

S_final = 0  (definite state)

Change in information:

ΔI = S_initial - S_final = -k[p₀ ln p₀ + p₁ ln p₁]

For maximum erasure (p₀ = p₁ = 1/2):

ΔI = k ln 2

3.3 Thermodynamic Analysis

Second Law: Total entropy (system + environment) cannot decrease:

ΔS_total = ΔS_system + ΔS_environment ≥ 0

For isothermal process:

ΔS_environment = Q/T

where Q is heat dissipated to environment.

Combining:

ΔS_system + Q/T ≥ 0
-k ln 2 + Q/T ≥ 0
Q ≥ kT ln 2

Landauer's Principle: Erasing 1 bit of information requires dissipating at least kT ln 2 of heat.

3.4 Physical Implementation: Szilard Engine

1-Molecule Gas Engine:

Single molecule in box (unknown side)
Insert partition (0 information about position)
Measure which side (gain 1 bit)
Attach piston to occupied side
Extract work kT ln 2 via isothermal expansion
Remove partition
Erase measurement record → Dissipate kT ln 2

Cycle: Extract work using information, pay thermodynamic cost to erase memory.

3.5 Generalization: Arbitrary Distribution

For erasing memory in state with probability distribution p(x):

Q ≥ kT × H[p] = -kT Σ p(x) ln p(x)

More uncertain initial state → More heat dissipated.

4. Non-Equilibrium Thermodynamics

4.1 Entropy Production

For a system driven out of equilibrium:

dS/dt = d_iS/dt + d_eS/dt

d_iS/dt = internal entropy production (≥ 0)
d_eS/dt = entropy flow from environment (can be negative)

Second Law: d_iS/dt ≥ 0 always.

4.2 Jarzynski Equality

For a system driven from equilibrium at λ=0 to λ=1:

⟨exp(-βW)⟩ = exp(-βΔF)

Where:

W = work performed on system
ΔF = free energy difference
⟨⟩ = average over many realizations

Implication: Can extract equilibrium free energy from non-equilibrium processes.

4.3 Crooks Fluctuation Theorem

Ratio of forward to reverse process probabilities:

P(W_forward) / P(-W_reverse) = exp(β(W - ΔF))

Special case (Jarzynski): Integrate over W.

4.4 Entropy Production Rate

For driven system:

Σ̇ = (1/T) Σ_i J_i X_i ≥ 0

Where:

J_i = thermodynamic flux (current)
X_i = thermodynamic force (gradient)

Examples:

Heat flux: J = heat current, X = ∇(1/T)
Particle flux: J = particle current, X = -∇μ
Chemical reactions: J = reaction rate, X = -ΔG/T

5. Stochastic Thermodynamics

5.1 Langevin Equation

For a particle in potential V(x) with friction γ and thermal noise:

m(d²x/dt²) = -γ(dx/dt) - dV/dx + ξ(t)

Where noise satisfies:

⟨ξ(t)⟩ = 0
⟨ξ(t)ξ(t')⟩ = 2γkT δ(t-t')  (fluctuation-dissipation)

Overdamped limit (low inertia):

γ(dx/dt) = -dV/dx + ξ(t)
dx/dt = -(1/γ)dV/dx + √(2D) η(t)

where D = kT/γ (Einstein relation).

5.2 Fokker-Planck Equation

Evolution of probability distribution p(x,t):

∂p/∂t = -∂/∂x[v(x)p] + D ∂²p/∂x²

First term: deterministic drift
Second term: diffusion

Steady state: ∂p/∂t = 0 gives Boltzmann distribution.

5.3 Stochastic Entropy Production

Along a single trajectory:

Δs_tot = Δs_system + Δs_environment
       = ln[p(x_initial)/p(x_final)] + βQ

Average: ⟨Δs_tot⟩ ≥ 0 (second law)

5.4 Information-Theoretic Formulation

For feedback control (Maxwell's demon):

⟨Δs_tot⟩ = ⟨Δs_system⟩ + ⟨Δs_environment⟩ - I
         ≥ 0

Where I = mutual information between system and controller.

Sagawa-Ueda Generalized Second Law:

⟨W⟩ ≥ ΔF - kT × I

Can extract up to kT×I extra work using information.

6. Free Energy and Variational Inference

6.1 Helmholtz Free Energy

For system at temperature T:

F = ⟨E⟩ - TS = U - TS

Equilibrium condition: F is minimized.

Physical meaning:

U = ⟨E⟩ = average energy (favors low energy states)
-TS = entropy contribution (favors high entropy)
F balances energy and entropy

6.2 Variational Free Energy (Friston)

For generative model p(x,s) and observations s:

F[q] = E_q[E(x,s)] - H[q(x|s)]
     = -E_q[log p(x,s)] + E_q[log q(x|s)]
     = -log p(s) + D_KL[q(x|s) || p(x|s)]

Where:

x = hidden states
s = sensory observations
q(x|s) = approximate posterior (beliefs)
p(x|s) = true posterior

Key Properties:

F ≥ -log p(s) with equality when q = p
Minimizing F ⟺ maximizing evidence p(s)
F decomposes into energy and entropy

6.3 Free Energy Principle

Biological systems minimize variational free energy:

dF/dt ≤ 0

Mechanisms:

Perception: Update beliefs q to minimize F (∂F/∂q)
Action: Change sensory input s to minimize F (∂F/∂s)

Connection to Thermodynamics:

Variational free energy ↔ Helmholtz free energy
Minimizing surprise ↔ Resisting disorder
Living systems are non-equilibrium steady states

6.4 Active Inference

Expected free energy for policy π:

G[π] = E_π[F[q]] + D_KL[q(s|π) || p(s)]

Decomposition:

G = Pragmatic value + Epistemic value
  = E_π[log q(s)] - E_π[log p(s|x)]  (ambiguity)
  + E_π[H[p(x|s)]]                    (risk)

Interpretation:

Pragmatic: Achieve preferred outcomes
Epistemic: Resolve uncertainty about world

7. Energy-Based Models: Physical Interpretation

7.1 Boltzmann Machines

Probability distribution over binary variables s_i ∈ {0,1}:

p(s) = (1/Z) exp(-E(s)/T)

Energy function:

E(s) = -Σ_ij W_ij s_i s_j - Σ_i b_i s_i

Physical interpretation:

W_ij = coupling strength (interaction energy)
b_i = external field (bias)
T = temperature (controls randomness)

7.2 Hopfield Networks

Symmetric weights, energy function:

E = -(1/2) Σ_ij W_ij s_i s_j - Σ_i b_i s_i

Dynamics (asynchronous update):

s_i(t+1) = sign(Σ_j W_ij s_j(t) + b_i)

Energy decreases (or stays constant) with each update:

ΔE = E(t+1) - E(t) ≤ 0

Attractor dynamics: System settles to local energy minima (memories).

7.3 Equilibrium Propagation

Free phase:

τ ds/dt = -∂E(s,y)/∂s

Settles to equilibrium s* where ∂E/∂s = 0.

Nudged phase:

τ ds/dt = -∂E(s,y)/∂s - β(y - y_target)

Gently pushes toward target.

Learning rule:

dW/dt ∝ ⟨s_i s_j⟩_nudged - ⟨s_i s_j⟩_free

Physical interpretation:

Free phase: Thermodynamic equilibration
Nudged phase: Perturbed equilibrium
Learning: Adjust weights to make nudge smaller

7.4 Connection to Contrastive Divergence

Gradient of log-likelihood for Boltzmann machine:

∂log p(s_data)/∂W_ij = ⟨s_i s_j⟩_data - ⟨s_i s_j⟩_model

Positive phase: ⟨⟩_data from observations Negative phase: ⟨⟩_model from sampling equilibrium

Equilibrium propagation is continuous-time, deterministic version.

8. Thermodynamic Bounds on Computation

8.1 Landauer Bound

Already derived: Erasing n bits dissipates at least:

Q ≥ n × kT ln 2

8.2 Margolus-Levitin Bound

Maximum speed of computation (orthogonal quantum states):

τ ≥ πℏ / (2E)

Where E is energy of system.

Interpretation: Fundamental tradeoff between speed and energy. More energy → faster computation.

8.3 Bekenstein Bound

Maximum information in region of space:

I ≤ 2πRE / (ℏc ln 2)

Where R is radius, E is energy.

For spherical region:

I ≤ (A/4) × (k ln 2 / (ℏG)) ≈ A/(4 L_P²)

Where A is surface area, L_P is Planck length.

Interpretation: Holographic bound—information scales with area, not volume.

8.4 Lloyd's Bound

Ultimate speed of computation:

Operations/sec ≤ E / (πℏ) ≈ 10^51 × (E/1kg)

Example: 1 kg of matter → 10^51 ops/sec maximum.

8.5 Synthesis: Multi-Dimensional Limits

Computation is bounded by:

Resource	Bound	Limiting Constant
Energy per bit erased	E ≥ kT ln 2	Boltzmann constant k
Speed vs. energy	τ ≥ πℏ/2E	Planck constant ℏ
Information per energy	I ≤ E/(kT ln 2)	kT ln 2
Ops per energy	N ≤ E/(πℏ)	ℏ
Info per volume	I ≤ A/(4L_P²)	Planck area

Key Insight: All fundamental limits trace back to h, k, c, G—the fundamental constants of physics.

9. Thermodynamic Cost of Learning

9.1 Information-Theoretic View

Learning: Extracting model θ from data D.

Information gained:

I(D; θ) = H[θ] - H[θ|D]

Minimum thermodynamic cost:

Q ≥ kT × I(D; θ)

Interpretation: Must dissipate heat proportional to information extracted from data.

9.2 PAC Learning Bounds

Probably Approximately Correct (PAC) learning requires:

m ≥ (1/ε²) × [d log(1/ε) + log(1/δ)]

samples, where d = VC dimension.

Thermodynamic cost:

Q ≥ kT × m × (log |X| + log |Y|)

Implication: Harder learning problems (larger d, smaller ε) have higher energy cost.

9.3 Generalization and Thermodynamics

Hypothesis: Thermodynamic cost of learning is related to generalization gap.

Intuition:

Memorization: High mutual information I(D; θ)
Generalization: Low mutual information (compressed representation)

Possible bound:

Generalization gap ∝ I(D; θ) / |D|

Thermodynamic consequence:

Overparameterized models: High I(D; θ) → High energy cost
Regularized models: Low I(D; θ) → Low energy cost

Prediction: Energy-efficient learning favors generalizable models.

10. Mathematical Toolbox

10.1 Useful Inequalities

Jensen's Inequality: For convex function f:

f(E[X]) ≤ E[f(X)]

Gibbs Inequality: D_KL[p||q] ≥ 0

Log-Sum Inequality:

Σ a_i log(a_i/b_i) ≥ (Σ a_i) log[(Σ a_i)/(Σ b_i)]

10.2 Variational Principles

ELBO (Evidence Lower Bound):

log p(x) ≥ E_q[log p(x,z)] - E_q[log q(z)]
         = -F[q]

Variational inference: Maximize ELBO ⟺ Minimize free energy.

10.3 Calculus of Variations

To minimize functional F[q]:

δF/δq = 0

Example: Find q that minimizes F = E_q[E] - TS[q]:

q(x) = (1/Z) exp(-E(x)/T)  (Boltzmann distribution)

11. Summary: Key Equations

Fundamental Constants

k = 1.381 × 10⁻²³ J/K      (Boltzmann)
ℏ = 1.055 × 10⁻³⁴ J·s      (Planck)
c = 3 × 10⁸ m/s             (Speed of light)

Thermodynamic Relations

F = U - TS                  (Helmholtz free energy)
dF = -SdT - PdV             (Fundamental relation)
S = -k Σ p_i ln p_i         (Entropy)
p_i = (1/Z) exp(-E_i/kT)    (Boltzmann distribution)

Information Theory

H[p] = -Σ p(x) log p(x)                    (Shannon entropy)
I(X;Y) = H[X] - H[X|Y]                     (Mutual information)
D_KL[q||p] = Σ q(x) log[q(x)/p(x)]         (KL divergence)

Landauer and Computation

E_erase ≥ kT ln 2           (Landauer bound)
τ_min ≥ πℏ/(2E)            (Margolus-Levitin)
I_max ≤ 2πRE/(ℏc ln 2)     (Bekenstein)

Learning Bounds

E_learn ≥ kT × I(D; θ)     (Information cost)
F[q] = E_q[E] - TS          (Variational free energy)

12. Further Reading

Classical Thermodynamics:

Callen, Thermodynamics and an Introduction to Thermostatistics
Chandler, Introduction to Modern Statistical Mechanics

Information Theory:

Cover & Thomas, Elements of Information Theory
MacKay, Information Theory, Inference, and Learning Algorithms

Information Thermodynamics:

Sagawa & Ueda, "Minimal energy cost for thermodynamic information processing"
Parrondo et al., "Thermodynamics of information," Nature Physics (2015)

Free Energy Principle:

Friston, "The free-energy principle: a unified brain theory?" (2010)
Parr, Pezzulo, Friston, Active Inference: The Free Energy Principle in Mind, Brain, and Behavior (MIT Press, 2022)

Energy-Based Learning:

Scellier & Bengio, "Equilibrium Propagation" (2017)
Hinton, "Training Products of Experts by Minimizing Contrastive Divergence" (2002)

Status: Comprehensive mathematical foundation for thermodynamic learning Last Updated: December 2025 Prerequisites: Statistical mechanics, information theory, calculus Next: Apply these principles to implement Landauer-optimal learning systems

16 KiB Raw Blame History Unescape Escape

Physics Foundations of Thermodynamic Learning

Mathematical Foundations and Physical Principles

Table of Contents

1. Statistical Mechanics Primer

1.1 Microcanonical Ensemble

1.2 Canonical Ensemble

1.3 Boltzmann Distribution

1.4 Fluctuation-Dissipation Theorem

2. Information Theory and Physics

2.1 Shannon Entropy

2.2 Mutual Information

2.3 Kullback-Leibler Divergence

2.4 Relative Entropy and Free Energy

3. Landauer's Principle: Detailed Derivation

3.1 Setup: Bit Erasure

3.2 Information-Theoretic Analysis

3.3 Thermodynamic Analysis

3.4 Physical Implementation: Szilard Engine

3.5 Generalization: Arbitrary Distribution

4. Non-Equilibrium Thermodynamics

4.1 Entropy Production

4.2 Jarzynski Equality

4.3 Crooks Fluctuation Theorem

4.4 Entropy Production Rate

5. Stochastic Thermodynamics

5.1 Langevin Equation

5.2 Fokker-Planck Equation

5.3 Stochastic Entropy Production

5.4 Information-Theoretic Formulation

6. Free Energy and Variational Inference

6.1 Helmholtz Free Energy

6.2 Variational Free Energy (Friston)

6.3 Free Energy Principle

6.4 Active Inference

7. Energy-Based Models: Physical Interpretation

7.1 Boltzmann Machines

7.2 Hopfield Networks

7.3 Equilibrium Propagation

7.4 Connection to Contrastive Divergence

8. Thermodynamic Bounds on Computation

8.1 Landauer Bound

8.2 Margolus-Levitin Bound

8.3 Bekenstein Bound

8.4 Lloyd's Bound

8.5 Synthesis: Multi-Dimensional Limits

9. Thermodynamic Cost of Learning

9.1 Information-Theoretic View

9.2 PAC Learning Bounds

9.3 Generalization and Thermodynamics

10. Mathematical Toolbox

10.1 Useful Inequalities

10.2 Variational Principles

10.3 Calculus of Variations

11. Summary: Key Equations

Fundamental Constants

Thermodynamic Relations

Information Theory

Landauer and Computation

Learning Bounds

12. Further Reading

16 KiB

Raw Blame History