Mathematics of Machine Learning
A comprehensive derivation-first guide to every mathematical tool powering modern ML: linear algebra, multivariable calculus, probability theory, optimization theory, and statistical learning theory. Equations are shown in full — nothing hand-waved.
Contents
1. Linear Algebra for Machine Learning
Machine learning is fundamentally a field of linear algebra. Datasets are matrices, model parameters are vectors, predictions are matrix-vector products, and almost every learning algorithm is phrased as an optimization problem over a vector space.
1.1 Matrix Operations and Notation
An m x n matrix A has m rows and n columns. The (i, j) entry is written A_ij or [A]_ij. Transposing swaps rows and columns: (A^T)_ij = A_ji. Key identities used constantly in ML:
1.2 Inner Products, Norms, and Distance
The standard inner product on R^n is x . y = x^T y = sum_i x_i y_i. This gives the Euclidean norm ||x||_2 = sqrt(x^T x) and the cosine similarity cos(theta) = x^T y / (||x|| ||y||).
ML uses several norms regularly:
1.3 Eigendecomposition
For a square matrix A, a scalar lambda and nonzero vector v satisfying Av = lambda v are called an eigenvalue and eigenvector. The characteristic polynomial det(A - lambda I) = 0 gives the eigenvalues. A symmetric matrix A = A^T has all real eigenvalues and an orthonormal eigenbasis:
A positive semi-definite (PSD) matrix has all eigenvalues lambda_i >= 0. Covariance matrices and kernel matrices are always PSD. This fact is exploited throughout ML — PSD guarantees convexity of quadratic forms x^T A x >= 0.
Worked Example: 2x2 Eigendecomposition
1.4 Matrix Calculus
Gradients of scalar functions with respect to vectors and matrices are the engine of optimization. The gradient of a scalar f with respect to vector x is a vector with the same shape as x. Key identities (using denominator-layout convention):
2. Singular Value Decomposition (SVD)
SVD is the most important matrix factorization in applied mathematics. Every real m x n matrix A can be written as:
where U is m x m orthogonal (columns are left singular vectors), Sigma is m x n with non-negative diagonal entries sigma_1 >= sigma_2 >= ... >= sigma_r > 0 (singular values, r = rank(A)), and V is n x n orthogonal (columns are right singular vectors). Existence is guaranteed for any real matrix.
2.1 Geometric Interpretation
Any linear map can be decomposed into three steps: V^T rotates/reflects the input, Sigma stretches along coordinate axes by the singular values, and U rotates/reflects into the output space. The singular values measure how much A stretches in each direction. ||A||_2 (spectral norm) = sigma_1 (largest singular value).
2.2 Eckart-Young Theorem (Best Low-Rank Approximation)
No rank-k matrix is closer to A in either the Frobenius or spectral norm. This theorem justifies PCA, latent semantic analysis, matrix completion, and any ML approach that approximates a full matrix with a low-rank model.
2.3 SVD and Least Squares
When A is not invertible or not square, the pseudoinverse A^+ = V Sigma^+ U^T gives the minimum-norm least-squares solution to Ax = b: x* = A^+ b. Sigma^+ replaces each nonzero sigma_i with 1/sigma_i and leaves zeros as zeros.
3. PCA and Dimensionality Reduction
3.1 PCA Derivation
Given n data points in R^d stored as rows of X (n x d), center the data: X_c = X - 1 mu^T where mu = (1/n) X^T 1. The sample covariance matrix is:
PCA seeks the unit vector w_1 maximizing variance of projected data:
The optimal w_1 is the eigenvector of C with the largest eigenvalue lambda_1. The objective value equals lambda_1. Each subsequent principal component is the eigenvector with the next largest eigenvalue, constrained to be orthogonal to all previous components. Retaining k components captures fraction (lambda_1 + ... + lambda_k) / (lambda_1 + ... + lambda_d) of total variance.
3.2 PCA via SVD
3.3 Linear Discriminant Analysis (LDA)
LDA is a supervised dimensionality reduction method. For C classes with means mu_c, it finds projections maximizing between-class variance relative to within-class variance. Define scatter matrices:
3.4 t-SNE
t-SNE (t-distributed Stochastic Neighbor Embedding) is a nonlinear method for 2D/3D visualization. For high-dimensional points x_i, it defines similarity as a conditional Gaussian probability p_j|i = exp(-||x_i - x_j||^2 / 2 sigma_i^2) / normalization. In the low-dimensional embedding y_i, it uses a Student-t distribution (heavier tails): q_ij = (1 + ||y_i - y_j||^2)^(-1) / sum_(k != l) (1 + ||y_k - y_l||^2)^(-1). The cost function is the KL divergence KL(P || Q) = sum_ij p_ij log(p_ij / q_ij), minimized by gradient descent. The t-distribution prevents crowding that would occur with a Gaussian in the low-dimensional space.
4. Calculus for Machine Learning
4.1 Gradients and Directional Derivatives
For f: R^n -> R, the gradient grad f(x) is the vector of partial derivatives. The directional derivative in direction v (||v|| = 1) is D_v f(x) = grad f(x) . v, maximized when v = grad f(x) / ||grad f(x)||. Hence the gradient points in the direction of steepest ascent.
The Hessian H(x) = d^2 f / dx^2 is the matrix of second partial derivatives: H_ij = d^2 f / (dx_i dx_j). A critical point where grad f = 0 is a local minimum if H is PSD (all eigenvalues >= 0), a local maximum if H is NSD, and a saddle point if H has both positive and negative eigenvalues.
4.2 Chain Rule in Multiple Dimensions
4.3 Backpropagation Derivation
Consider a fully-connected network with L layers. Forward pass for layer l:
Define the error signal at layer l as delta^(l) = dL / dz^(l). For the output layer L:
Backprop recurrence (chain rule applied layer by layer, moving backward):
This achieves O(parameters) computation by storing intermediate activations during the forward pass and reusing them during the backward pass — a classic dynamic programming argument. Memory cost is O(depth x width) for storing all activations.
4.4 Computational Graphs and Automatic Differentiation
Modern frameworks (PyTorch, JAX) build a computational graph of operations. Forward mode AD computes Jv (Jacobian-vector product) in one pass useful when inputs are few. Reverse mode AD (backprop) computes v^T J (vector-Jacobian product) in one backward pass — efficient when outputs are few (loss is a scalar). For ML with many parameters and scalar loss, reverse mode is always preferred.
5. Probability Theory for ML
5.1 Bayes' Theorem
In the ML context: P(theta | data) = P(data | theta) P(theta) / P(data). The posterior P(theta | data) combines the likelihood P(data | theta) and the prior P(theta). P(data) = integral P(data | theta) P(theta) d theta is the marginal likelihood (evidence), which normalizes the posterior but is often intractable.
5.2 Common Distributions
| Distribution | PDF / PMF | Mean | Variance |
|---|---|---|---|
| Bernoulli(p) | p^x (1-p)^(1-x) | p | p(1-p) |
| Gaussian N(mu, sigma^2) | (1/sqrt(2pi sigma^2)) exp(-(x-mu)^2 / 2sigma^2) | mu | sigma^2 |
| Categorical(p) | prod_k p_k^(x_k) | p | diag(p)-pp^T |
| Beta(a, b) | x^(a-1)(1-x)^(b-1)/B(a,b) | a/(a+b) | ab / (a+b)^2(a+b+1) |
| Dirichlet(alpha) | prod_k x_k^(alpha_k - 1) / B(alpha) | alpha_k / alpha_0 | alpha_k(alpha_0 - alpha_k) / (alpha_0^2 (alpha_0+1)) |
5.3 Gaussian Distribution in ML
The multivariate Gaussian N(mu, Sigma) has density:
The term (x-mu)^T Sigma^(-1) (x-mu) is the Mahalanobis distance squared. If Sigma = I, it reduces to squared Euclidean distance. The Mahalanobis distance accounts for correlations and different scales — crucial for Gaussian discriminant analysis, Gaussian processes, and Kalman filters.
5.4 Information Theory
Entropy, cross-entropy, and KL divergence appear throughout ML as loss functions and theoretical measures:
6. MLE, MAP Estimation, and Conjugate Priors
6.1 Maximum Likelihood Estimation
Given i.i.d. data D = (x_1, ..., x_n) drawn from p(x | theta), the likelihood is:
MLE is asymptotically efficient: it achieves the Cramer-Rao lower bound as n grows. For Gaussian data with unknown mean, MLE gives sample mean. For Bernoulli data, MLE gives the empirical proportion. MLE can overfit severely with small n.
Worked Example: MLE for Gaussian Mean
6.2 Maximum A Posteriori Estimation
MAP adds a prior over parameters:
6.3 Conjugate Priors
A prior is conjugate to a likelihood if the posterior is in the same family as the prior. Conjugate pairs allow closed-form Bayesian updates — no numerical integration required:
| Likelihood | Conjugate Prior | Posterior Update |
|---|---|---|
| Bernoulli(p) | Beta(a, b) | Beta(a + successes, b + failures) |
| Categorical(p) | Dirichlet(alpha) | Dirichlet(alpha + counts) |
| Gaussian(mu, sigma^2 known) | N(mu_0, tau_0^2) | N with precision-weighted mean |
| Poisson(lambda) | Gamma(a, b) | Gamma(a + sum(x_i), b + n) |
7. Optimization Theory
7.1 Convexity
A function f is convex if for all x, y and t in [0,1]: f(tx + (1-t)y) <= t f(x) + (1-t) f(y). Equivalently (for twice-differentiable f): the Hessian H(x) is PSD everywhere. For convex f, every local minimum is a global minimum. Logistic regression loss, SVM hinge loss, and L2-regularized objectives are all convex.
7.2 Gradient Descent and Convergence
For L-smooth convex f (||grad f(x) - grad f(y)|| <= L ||x-y||), gradient descent with step size alpha = 1/L converges as:
For strongly convex f (mu-strongly convex: f(y) >= f(x) + grad f(x)^T (y-x) + (mu/2)||y-x||^2), gradient descent converges geometrically: f(theta_T) - f* <= (1 - mu/L)^T (f(theta_0) - f*). The condition number kappa = L/mu determines convergence speed.
7.3 Stochastic Gradient Descent (SGD)
SGD estimates the gradient using a random mini-batch B of size b:
SGD requires a decaying learning rate schedule (e.g., alpha_t = alpha_0 / sqrt(t)) for convergence guarantees. In practice, SGD with momentum and learning rate warmup-then-decay often reaches the same or better solutions than full-batch GD due to implicit regularization from gradient noise.
7.4 Momentum and Adam
7.5 Saddle Points in Non-Convex Optimization
Deep network loss surfaces are non-convex and have many saddle points where grad L = 0 but the Hessian has both positive and negative eigenvalues. Empirically, most local minima in deep networks have similar loss to the global minimum — the "proliferation of local minima" problem is less severe than feared. Saddle points are the real obstacle: gradient descent can slow near them. Perturbation methods and stochastic noise in SGD help escape saddle points.
7.6 Second-Order Methods
Newton's method uses the Hessian for quadratic convergence: theta_(t+1) = theta_t - H^(-1) grad L(theta_t). Computing H^(-1) costs O(d^3) — infeasible for large models. Quasi-Newton methods (L-BFGS) approximate H^(-1) using a limited history of gradient differences: O(md) cost per step for m stored vectors. Natural gradient descent uses the Fisher information matrix (the Riemannian metric in parameter space) instead of the Euclidean Hessian — invariant to reparametrization.
8. Linear Regression
8.1 Ordinary Least Squares (OLS)
Given design matrix X (n x d) and targets y (n x 1), minimize:
The hat matrix H = X(X^T X)^(-1) X^T projects y onto the column space of X: y_hat = Hy. The residuals e = y - y_hat lie in the null space of X^T and are orthogonal to all columns of X: X^T e = 0. OLS is BLUE (Best Linear Unbiased Estimator) by the Gauss-Markov theorem when errors are homoscedastic and uncorrelated.
8.2 Ridge Regression (L2 Regularization)
Ridge adds lambda to every diagonal of X^T X, making it always invertible (resolving multicollinearity). Ridge shrinks all coefficients toward zero proportionally. Via Bayesian interpretation: w_ridge = MAP estimate with prior w ~ N(0, (1/lambda) I).
8.3 Lasso (L1 Regularization)
Lasso has no closed-form solution because ||w||_1 is not differentiable at zero. Coordinate descent solves it: for each coordinate j, the solution has a "soft-thresholding" form: w_j = sign(rho_j) max(|rho_j| - lambda/2, 0), where rho_j = X_j^T (y - X_(-j) w_(-j)). The L1 penalty drives many coefficients to exactly zero, producing sparse models — automatic feature selection. Elastic Net combines L1 and L2: lambda_1 ||w||_1 + lambda_2 ||w||^2.
8.4 Probabilistic Interpretation
Linear regression with Gaussian noise y = Xw + epsilon, epsilon ~ N(0, sigma^2 I) gives likelihood p(y | X, w) = N(Xw, sigma^2 I). Maximizing the log-likelihood is equivalent to minimizing ||y - Xw||^2 — OLS is the MLE under Gaussian noise. Adding a Gaussian prior on w gives ridge regression MAP. The posterior over w is also Gaussian, enabling full Bayesian linear regression with uncertainty quantification.
9. Logistic Regression
9.1 Sigmoid and Log-Odds
For binary classification y in (0, 1), logistic regression models:
9.2 Cross-Entropy Loss Derivation
The negative log-likelihood for n i.i.d. binary outcomes:
The gradient is elegantly the sum of (prediction - label) times features — same structure as linear regression. The Hessian is H = X^T S X where S = diag(sigma_i (1-sigma_i)), which is PSD, confirming the cross-entropy loss is convex in w. Newton's method (iteratively reweighted least squares) converges quadratically.
9.3 Multiclass: Softmax Regression
The softmax is the gradient of the log-sum-exp: d/dz log sum_k exp(z_k) = softmax(z). A key numerical trick: subtract max(z) before exponentiating to prevent overflow — the result is mathematically identical but numerically stable.
10. Neural Networks
10.1 Universal Approximation Theorem
The Universal Approximation Theorem (Cybenko 1989, Hornik 1991) states: a feedforward network with a single hidden layer of sufficient width can approximate any continuous function on a compact subset of R^n to arbitrary precision, for any non-constant, bounded, monotonically-increasing continuous activation function. The theorem guarantees existence but says nothing about learnability or required width. Modern variants show depth can exponentially reduce required width — some functions need width exponential in depth to approximate with shallow networks, but polynomial width with deep networks.
10.2 Activation Functions
| Activation | Formula | Derivative | Notes |
|---|---|---|---|
| Sigmoid | 1/(1+e^(-z)) | sigma(1-sigma) | Vanishing gradients for large |z| |
| Tanh | (e^z - e^(-z))/(e^z + e^(-z)) | 1 - tanh^2(z) | Zero-centered; still saturates |
| ReLU | max(0, z) | 1(z>0) | Dead neurons if z always < 0 |
| Leaky ReLU | max(alpha z, z) | 1(z>0) + alpha*1(z<=0) | Prevents dead neurons (alpha ~ 0.01) |
| GELU | z * Phi(z) | Phi(z) + z phi(z) | Smooth, used in Transformers |
| Swish | z * sigma(z) | sigma(z) + z sigma(z)(1-sigma(z)) | Non-monotone; often outperforms ReLU |
10.3 Vanishing and Exploding Gradients
In a network of depth L with weight matrices W^(1), ..., W^(L), the gradient at layer 1 involves the product of L Jacobians:
Solutions: (1) ReLU activation (derivative 1 on positive half, no saturation); (2) Batch normalization normalizes z^(l) before activation to N(0,1) then applies learnable scale and shift; (3) Residual connections y = F(x) + x allow gradients to flow through identity path: dL/dx = dL/dy (1 + dF/dx) — the 1 prevents vanishing; (4) He initialization: W ~ N(0, 2/n_in) keeps activation variance stable through layers.
10.4 Batch Normalization
11. Support Vector Machines
11.1 Hard-Margin SVM: Margin Maximization
For linearly separable data (x_i, y_i) with y_i in (-1, +1), the decision boundary is w^T x + b = 0. The signed distance from x_i to the boundary is y_i(w^T x_i + b) / ||w||. The margin is 2/||w||. Maximizing the margin is equivalent to:
The support vectors are the training points where the constraint is tight: y_i (w^T x_i + b) = 1. The decision boundary depends only on the support vectors.
11.2 Soft-Margin SVM (C-SVM)
Slack variables xi_i allow misclassification with penalty C. Large C gives a narrow margin (low bias, high variance); small C gives a wide margin (high bias, low variance). The hinge loss formulation: min_w (1/2)||w||^2 + C sum_i max(0, 1 - y_i(w^T x_i + b)).
11.3 Dual Formulation
Forming the Lagrangian and applying KKT conditions converts the primal SVM to the dual:
11.4 Kernel Trick
The dual only uses x_i . x_j inner products. Replace with K(x_i, x_j) = phi(x_i) . phi(x_j) to implicitly map to feature space phi without computing phi:
Mercer's theorem: K is a valid kernel iff for any finite set (x_1, ..., x_n), the Gram matrix K_ij = K(x_i, x_j) is symmetric and positive semi-definite. The RBF kernel corresponds to an infinite-dimensional feature space — yet SVM with RBF kernel computes in O(n^2) kernel evaluations rather than infinite-dimensional dot products.
12. Statistical Learning Theory
12.1 Bias-Variance Decomposition
For regression with squared loss, taking expectation over training sets:
Regularization reduces variance at the cost of increased bias. Cross-validation selects model complexity to minimize the sum. Ensemble methods (bagging) reduce variance by averaging many high-variance models; boosting reduces bias by sequentially correcting residuals.
12.2 VC Dimension
The Vapnik-Chervonenkis (VC) dimension of a hypothesis class H is the largest set of points that H can shatter (classify in all 2^n possible ways). For linear classifiers in R^d, VC dim = d + 1.
The bound grows with VC dimension d and shrinks with n. To generalize to within epsilon error with probability 1 - delta, you need roughly n = O(d log(d/epsilon) + log(1/delta)/epsilon) samples — the fundamental sample complexity result.
12.3 PAC Learning
Probably Approximately Correct (PAC) learning: a hypothesis class H is PAC-learnable if there exists an algorithm A such that for any target concept c in H, any distribution D, and any epsilon, delta > 0, if A receives n = poly(1/epsilon, 1/delta, size) samples, it outputs h with L(h) <= epsilon with probability at least 1 - delta.
Finite hypothesis class: n = O(log(|H|/delta) / epsilon) samples suffice (union bound argument). Agnostic PAC (realizable assumption dropped): generalization error increases by the approximation error (best achievable in H). The fundamental theorem of statistical learning: H is PAC-learnable iff H has finite VC dimension.
12.4 Rademacher Complexity
Rademacher complexity measures the ability of a hypothesis class to fit random noise:
Rademacher complexity provides tighter bounds than VC theory for specific function classes (e.g., norm-bounded linear classifiers, neural networks). For linear classifiers with ||w|| <= B and ||x|| <= C: R_n <= BC/sqrt(n) — independent of dimension, explaining why high-dimensional linear classifiers can generalize well.
12.5 Double Descent and Modern Overparameterized Models
Classical bias-variance theory predicts a U-shaped test error curve as model size grows. Modern deep learning reveals double descent: test error decreases, then peaks at the interpolation threshold (enough parameters to fit training data exactly), then decreases again into the overparameterized regime. Implicit regularization from gradient descent — which finds minimum-norm solutions — explains why overparameterized models generalize. The minimum-norm interpolant for linear regression is w_MN = X^T (X X^T)^(-1) y (right pseudoinverse).
Algorithm Pseudocode Reference
PCA Algorithm
Input: Data matrix X (n x d), number of components k Output: k-dimensional projections Z (n x k), principal components V_k (d x k) 1. Compute mean: mu = mean(X, axis=0) 2. Center data: X_c = X - mu (broadcast) 3. Compute SVD: U, s, Vt = SVD(X_c, full_matrices=False) -- OR covariance eigendecomposition: -- C = X_c.T @ X_c / (n-1) -- eigenvalues, V = eigh(C) [sorted descending] 4. Top-k components: V_k = Vt.T[:, :k] (d x k matrix) 5. Project: Z = X_c @ V_k (n x k matrix) 6. Explained variance ratio: s[:k]^2 / sum(s^2) Return Z, V_k
Backpropagation Algorithm
Input: Network params (W^(l), b^(l) for l=1..L), input x, target y
Output: Gradients dL/dW^(l), dL/db^(l) for all l
--- FORWARD PASS ---
a^(0) = x
for l = 1 to L:
z^(l) = W^(l) @ a^(l-1) + b^(l)
a^(l) = activation(z^(l)) [elementwise]
y_hat = a^(L)
Compute loss L = loss_fn(y_hat, y)
--- BACKWARD PASS ---
delta^(L) = dL/da^(L) * activation'(z^(L)) [elementwise multiply]
for l = L-1 down to 1:
dL/dW^(l+1) = delta^(l+1) @ a^(l).T
dL/db^(l+1) = delta^(l+1)
delta^(l) = W^(l+1).T @ delta^(l+1) * activation'(z^(l))
dL/dW^(1) = delta^(1) @ a^(0).T
dL/db^(1) = delta^(1)
Return all gradientsAdam Optimizer
Input: Initial params theta_0, learning rate alpha=0.001,
beta_1=0.9, beta_2=0.999, epsilon=1e-8, num_steps T
Output: Optimized params theta_T
m_0 = 0 (initialize 1st moment vector)
v_0 = 0 (initialize 2nd moment vector)
for t = 1 to T:
g_t = grad L(theta_{t-1}) [compute gradient]
m_t = beta_1 * m_{t-1} + (1 - beta_1) * g_t
v_t = beta_2 * v_{t-1} + (1 - beta_2) * g_t^2 [elementwise square]
m_hat = m_t / (1 - beta_1^t) [bias correction]
v_hat = v_t / (1 - beta_2^t) [bias correction]
theta_t = theta_{t-1} - alpha * m_hat / (sqrt(v_hat) + epsilon)
Return theta_TFrequently Asked Questions
What is the chain rule and why is it central to backpropagation?
The chain rule states that if y = f(g(x)), then dy/dx = (df/dg)(dg/dx). In a neural network the loss L is a composition of many functions — layer activations, weight matrices, bias additions. Backpropagation applies the chain rule recursively from the output layer back to the input layer to compute the gradient of L with respect to every weight. The error signal delta^(l) = dL/dz^(l) propagates backward as delta^(l) = (W^(l+1))^T delta^(l+1) * sigma'(z^(l)). This avoids recomputing repeated sub-expressions and runs in O(parameters) time — the same as the forward pass.
What is singular value decomposition (SVD) and how is it used in machine learning?
SVD decomposes any real matrix A (m x n) as A = U Sigma V^T, where U is m x m orthogonal (left singular vectors), Sigma is diagonal with non-negative singular values, and V is n x n orthogonal (right singular vectors). In ML, SVD powers PCA (top-k right singular vectors are principal components), latent semantic analysis, collaborative filtering, and low-rank approximation. The Eckart-Young theorem guarantees the rank-k truncated SVD is the best possible rank-k approximation in both Frobenius and spectral norm — no other rank-k matrix is closer to A.
What is the difference between MLE and MAP estimation?
MLE finds theta maximizing the likelihood P(data | theta) — treating parameters as fixed unknowns with no prior belief. MAP maximizes the posterior P(theta | data), which combines the likelihood with a prior P(theta). With a Gaussian prior N(0, tau^2 I) on parameters, MAP is exactly equivalent to L2 (ridge) regularized MLE. With a Laplace prior, MAP gives L1 (lasso) regularization. As the dataset grows, both converge because the likelihood dominates. MAP shrinks estimates toward the prior on small data, reducing overfitting.
How does gradient descent work and what makes Adam different?
Gradient descent updates parameters by moving opposite the gradient: theta_(t+1) = theta_t - alpha * grad L. Adam maintains exponential moving averages of gradients (m_t) and squared gradients (v_t), bias-corrects both, then divides the gradient update by sqrt(v_hat) — this adapts the effective learning rate per parameter. Parameters with historically large gradients get smaller updates; parameters with small gradients get relatively larger updates. Adam is robust to sparse gradients and works well without careful learning rate tuning, making it the default for deep learning.
What is the kernel trick in support vector machines?
The SVM dual problem only requires inner products x_i^T x_j between training points. Replacing x_i^T x_j with K(x_i, x_j) = phi(x_i)^T phi(x_j) implicitly computes dot products in a high- (or infinite-) dimensional feature space without ever computing phi explicitly. This allows SVMs to learn non-linear boundaries with the same computational cost as linear SVMs (in terms of kernel evaluations). The RBF kernel K(x,z) = exp(-||x-z||^2 / 2sigma^2) corresponds to infinite-dimensional feature space but can be evaluated in O(d) time.
What is the bias-variance tradeoff?
The expected mean squared error of any estimator decomposes as MSE = Bias^2 + Variance + irreducible noise. High bias (underfitting) means the model is systematically wrong regardless of training data — the model is too simple. High variance (overfitting) means predictions vary wildly across different training sets — the model is too sensitive to noise. As model complexity increases, bias decreases and variance increases. Regularization reduces variance at the cost of some bias. Ensemble methods like bagging reduce variance by averaging many models.
What is PCA and how is it derived mathematically?
PCA finds orthogonal directions of maximum variance in centered data. The first principal component maximizes w^T C w subject to ||w|| = 1, where C = X_c^T X_c / (n-1) is the sample covariance. By Lagrange multipliers, the solution satisfies Cw = lambda w, so w must be an eigenvector of C with the largest eigenvalue. Each subsequent PC is the eigenvector with the next largest eigenvalue. Equivalently via SVD: X_c = U Sigma V^T, and the principal components are the columns of V. PCA is unsupervised; for supervised dimensionality reduction, use LDA which maximizes between-class variance relative to within-class variance.
What is the vanishing gradient problem and how is it addressed?
During backpropagation, gradients involve a product of L Jacobians for a depth-L network. If activation derivatives are consistently less than 1 (e.g., sigmoid saturates near 0 or 1 with derivative approaching 0), the product decays exponentially, making early layers receive near-zero gradient and learn very slowly. Solutions include: ReLU activations (derivative is 1 for positive inputs); batch normalization (keeps pre-activation distributions stable); residual skip connections (gradient can bypass layers via identity path); and He initialization W ~ N(0, 2/n_in) which keeps forward activation variance constant.
Related Math Topics
Test Your ML Math Knowledge
Practice backpropagation derivations, SVD computations, gradient descent analysis, and more with adaptive problem sets.
Start Practicing