People usually talk about artificial neural networks in terms of how well they perform. But if you want to be a bit more precise, you can think of them as carefully structured stacks of functions with parameters, all optimized under certain constraints. So instead of focusing on what they do in practice, it's useful to look at what's going on mathematically underneath———things like function composition, geometry during learning, invariants, scaling behavior, and how the structure changes in controlled ways.

Neural Networks as Compositions of Functions

At the end of the day, a neural network is just a bunch of functions composed together. A feedforward network looks like this:

$$F(x; \theta) = f_L \circ f_{L-1} \circ \cdots \circ f_1 (x)$$

Each layer has the form:

$$f_i(x) = \sigma_i(W_i x + b_i)$$

where the parameters are

$$\theta = {(W_i, b_i)}_{i=1}^L$$

with weight matrices $W_i \in \mathbb{R}^{n_i \times n_{i-1}}$ and nonlinearities $\sigma_i : \mathbb{R}^{n_i} \to \mathbb{R}^{n_i}$.

From a math point of view, the whole network defines a set of possible functions:

$$\mathcal{H} = {F(\cdot; \theta) : \theta \in \Theta}$$

Then some natural questions pop up:

  • Approximation: When can $\mathcal{H}$ approximate any continuous function on a compact set $K \subset \mathbb{R}^n$? (this is the universal approximation idea)
  • Depth vs. expressivity: What does adding more layers actually buy you? For example, with ReLU networks, the number of linear regions grows combinatorially with depth $L$.
  • Information flow: The Jacobian of the full network is

$$J_F(x) = \prod_{i=L}^1 Df_i(x_i)$$ This tells you how sensitive the output is to input changes. If the rank drops, you're basically losing information.

So yeah, not really black boxes———more like structured objects you can analyze with algebra and calculus.

Learning as a Geometric Process

Training is just an optimization problem:

$$\min_{\theta \in \Theta} \; \mathcal{L}(\theta) = \mathbb{E}_{(x,y) \sim \mathcal{D}} \, \ell(F(x;\theta), y)$$

If you use gradient descent, you get an update rule:

$$\theta_{t+1} = \theta_t - \eta \nabla \mathcal{L}(\theta_t)$$

If you imagine the step size going to zero, this becomes continuous:

$$\frac{d\theta}{dt} = -\nabla \mathcal{L}(\theta)$$

The shape of the loss function is captured by the Hessian:

$$H(\theta) = \nabla^2 \mathcal{L}(\theta)$$

  • If the Hessian is positive definite at a point, then that point is a local minimum of the loss function.
  • If the Hessian is indefinite, then the point corresponds to a saddle point.
  • If the Hessian has near-zero eigenvalues, then the loss surface is locally flat in the directions associated with those eigenvalues.

In big (overparameterized) models, you often get:

$$\dim(\ker H(\theta^*)) \gg 0$$

which basically means there are tons of equally good solutions sitting in flat regions.

There's also this neat approximation called the neural tangent kernel (NTK):

$$F(x;\theta) \approx F(x;\theta_0) + \nabla_\theta F(x;\theta_0) \cdot (\theta - \theta_0)$$

This gives you a kernel:

$$K(x,x') = \nabla_\theta F(x;\theta_0) \cdot \nabla_\theta F(x';\theta_0)$$

and in the infinite-width limit, training starts to look like kernel methods.

Invariants in Representation and Training

There are built-in symmetries in neural networks:

  • Permutation symmetry: You can reorder hidden units without changing the function.
  • Scaling symmetry (ReLU):

$$(W_i, W_{i+1}) \mapsto (\alpha W_i, \alpha^{-1} W_{i+1})$$ doesn't change the output.

So different parameter values can represent the same function:

$$\theta \sim \theta' \quad \text{if} \quad F(\cdot;\theta) = F(\cdot;\theta')$$

Which means training is really happening over equivalence classes, not raw parameters.

On the representation side, you can track covariance:

$$\Sigma_i = \mathbb{E}[x_i x_i^\top]$$

with

$$\Sigma_{i+1} = \mathbb{E}[\sigma(W_i x_i)\sigma(W_i x_i)^\top]$$

Stable points of this tell you how representations settle. If the rank drops:

$$\mathrm{rank}(\Sigma_i) \ll n_i$$

then everything's collapsing onto a lower-dimensional structure.

The Role of Scale and Boundary Conditions

When you scale things up, the behavior simplifies:

  • Infinite width (mean-field):

$$\partial_t \rho_t = \nabla \cdot (\rho_t \nabla \frac{\delta \mathcal{L}}{\delta \rho_t})$$

  • NTK regime: training becomes almost linear

If you go the other way (small networks), you can talk about capacity like:

$$\mathrm{VCdim}(\mathcal{H}) = O(W \log W)$$

Adversarial examples are about pushing inputs slightly:

$$\sup_{\|\delta\| \leq \epsilon} \|F(x+\delta) - F(x)\|$$

This depends on the Jacobian norm:

$$\|J_F(x)\|$$

So robustness is basically about controlling how big those derivatives get.

Controlled Structural Deformation as a Method

One way to understand networks is to tweak them and see what breaks:

  • Linearize it: set $\sigma(x) = x$, then

$$F(x) = W_L \cdots W_1 x$$

  • Low-rank constraints:

$$\mathrm{rank}(W_i) \leq r$$

  • Regularization:

$$\min_\theta \mathcal{L}(\theta) + \lambda \|\theta\|^2$$

  • Perturb parameters:

$$\theta \mapsto \theta + \epsilon v$$

sensitivity:

$$ \frac{d}{d\epsilon} F(x;\theta + \epsilon v) \big| _ {\epsilon_0} = \nabla_{\theta} F(x;\theta)\cdot v $$

If something stays true under these changes, it's probably fundamental.

Wrapping It Up

The big question isn't really "how powerful are neural networks?" It's more like: what's the minimum structure needed for learning to even work?

That boils down to figuring out:

  • Which function classes behave nicely under composition
  • What kinds of loss landscapes are easy to optimize
  • What symmetries define equivalent solutions
  • What happens in different scaling limits

Right now, a lot of the field works amazingly well in practice, but the theory is still catching up. A deeper understanding would tie together approximation theory, geometry, and dynamical systems into one clean picture explaining not just what neural networks do———but why they work at all.