Mathematical Optimization
From the critical points of calculus to the algorithms that train neural networks, optimization is the engine of modern mathematics, machine learning, and operations research. This guide covers the full spectrum — theory, algorithms, and applications.
1. Unconstrained Optimization
Unconstrained optimization seeks the minimum or maximum of a function f: R^n to R without any restrictions on the domain. The foundational tools are derivatives, critical points, and the second-order conditions that classify them.
1.1 Critical Points and First-Order Conditions
A point x* is a critical point of f if the gradient vanishes there: grad f(x*) = 0. In one dimension this reduces to f'(x*) = 0. Critical points are candidates for local minima, local maxima, or saddle points — the first-order condition alone cannot distinguish among them.
Definition: Critical Point
x* is a critical point of f if grad f(x*) = 0. For f: R to R, this means f'(x*) = 0. Every local extremum of a differentiable function is a critical point, but not every critical point is a local extremum.
1.2 Second Derivative Test
The second derivative (or its higher-dimensional analogue, the Hessian) classifies critical points. In one dimension: if f'(x*) = 0 and f''(x*) greater than 0, then x* is a local minimum; if f''(x*) less than 0, it is a local maximum; if f''(x*) = 0, the test is inconclusive.
Example
Let f(x) = x^3 - 3x. Then f'(x) = 3x^2 - 3 = 0 gives x = plus or minus 1.
- f''(x) = 6x, so f''(1) = 6 greater than 0: x = 1 is a local minimum.
- f''(-1) = -6 less than 0: x = -1 is a local maximum.
1.3 The Gradient and Hessian in Higher Dimensions
For f: R^n to R, the gradient grad f is the vector of partial derivatives. The Hessian H(f) is the n-by-n matrix of second partial derivatives, H_ij = d^2 f / (dx_i dx_j). The Hessian is symmetric when f has continuous second partials (Clairaut's theorem).
Theorem: Second-Order Optimality (Multivariate)
If grad f(x*) = 0 and H(f)(x*) is positive definite, then x* is a strict local minimum. If H(f)(x*) is negative definite, x* is a strict local maximum. If H(f)(x*) is indefinite, x* is a saddle point.
A matrix is positive definite if all its eigenvalues are strictly positive, which can also be verified via Sylvester's criterion (all leading principal minors are positive).
2. Constrained Optimization
Real problems rarely permit free choice of variables. Constrained optimization adds equality constraints g_i(x) = 0 and inequality constraints h_j(x) less than or equal to 0. The challenge is to find the best feasible point.
2.1 Lagrange Multipliers (Equality Constraints)
For the problem: minimize f(x) subject to g(x) = 0, introduce a scalar lambda (the Lagrange multiplier) and form the Lagrangian L(x, lambda) = f(x) + lambda * g(x). At a constrained optimum x*, the gradients of f and g are parallel: grad f(x*) = -lambda * grad g(x*), together with g(x*) = 0.
Theorem: Lagrange Multiplier Condition
If x* is a local minimum of f subject to g(x) = 0, and grad g(x*) is not zero, then there exists lambda* such that grad_x L(x*, lambda*) = 0 and g(x*) = 0. The multiplier lambda* has the economic interpretation: it is the marginal cost of tightening the constraint.
Example: Maximize on an Ellipse
Maximize f(x,y) = xy subject to g(x,y) = x^2/4 + y^2/9 - 1 = 0. The Lagrange conditions give y = lambda * x/2 and x = lambda * 2y/9. Combining with the constraint yields the solution x = plus or minus sqrt(2), y = plus or minus 3/sqrt(2).
2.2 KKT Conditions (Inequality Constraints)
The Karush-Kuhn-Tucker (KKT) conditions generalize Lagrange multipliers to inequality constraints. For minimize f(x) subject to h_j(x) less than or equal to 0 and g_i(x) = 0, the KKT conditions at an optimum x* are:
- Stationarity: grad f(x*) + sum mu_j * grad h_j(x*) + sum lambda_i * grad g_i(x*) = 0
- Primal feasibility: h_j(x*) less than or equal to 0, g_i(x*) = 0
- Dual feasibility: mu_j greater than or equal to 0
- Complementary slackness: mu_j * h_j(x*) = 0 for all j
Complementary slackness encodes the insight that a multiplier is zero when the corresponding inequality is not active (slack), and nonzero only when the constraint is binding. KKT conditions are necessary for optimality under constraint qualification and sufficient for convex problems.
3. Linear Programming
Linear programming (LP) optimizes a linear objective function over a polyhedron defined by linear constraints. LP is one of the most practically important areas of optimization, with millions of variables solved routinely in logistics, finance, and scheduling.
3.1 Standard Form and Feasible Region
The standard form LP is: minimize c^T x subject to Ax = b, x greater than or equal to 0. Inequality constraints are converted to equalities by adding slack variables. For a constraint a^T x less than or equal to b, introduce s greater than or equal to 0 so that a^T x + s = b.
Definition: Feasible Region
The feasible region is the set of all x satisfying the constraints. For an LP, this is a convex polytope — the intersection of finitely many half-spaces. The fundamental theorem of LP states that if an optimal solution exists, it is attained at a vertex (extreme point) of the polytope.
3.2 The Simplex Method
Developed by George Dantzig in 1947, the simplex method exploits the vertex-optimality property. It maintains a basis — a set of m linearly independent columns of A defining a basic feasible solution — and pivots to adjacent bases that reduce the objective.
Simplex Algorithm Steps
- Find an initial basic feasible solution (e.g., via Phase I or Big-M method).
- Compute reduced costs for nonbasic variables.
- If all reduced costs are non-negative (minimization), the current BFS is optimal — stop.
- Otherwise, select an entering variable with a negative reduced cost (Bland's rule or min-ratio rule).
- Perform a pivot: determine the leaving variable by the minimum ratio test to maintain feasibility.
- Update the basis and return to step 2.
While the simplex method can cycle in degenerate problems (resolved by perturbation or Bland's rule), it is remarkably efficient in practice, typically requiring O(m) pivots for a problem with m constraints. Interior-point methods (Karmarkar, 1984) provide polynomial worst-case guarantees and are preferred for very large problems.
3.3 Duality in Linear Programming
Every LP (primal) has a dual LP. For primal: minimize c^T x subject to Ax greater than or equal to b, x greater than or equal to 0, the dual is: maximize b^T y subject to A^T y less than or equal to c, y greater than or equal to 0.
Strong Duality Theorem
If the primal LP has a finite optimal value, the dual LP also has a finite optimal value, and the two optimal values are equal: min c^T x = max b^T y. This zero duality gap is a fundamental result absent in general nonlinear programming.
Duality has deep economic meaning: dual variables are shadow prices measuring the marginal value of relaxing each constraint. Complementary slackness conditions (a primal constraint is slack iff its dual variable is zero, and vice versa) connect the LP dual to the KKT conditions of the general theory.
4. Convex Optimization
Convex optimization is the most tractable class of nonlinear optimization. The defining property — every local minimum is global — makes convex problems fundamentally different from general optimization, where local traps are ubiquitous.
4.1 Convex Sets
A set C is convex if for any two points x, y in C and any t in (0,1), the point t*x + (1-t)*y also lies in C — the line segment between x and y stays inside the set. Key convex sets include half-spaces, ellipsoids, polyhedra, and the positive semidefinite cone.
Convex Set Preservation
The intersection of any collection of convex sets is convex. The image of a convex set under a linear map is convex. These closure properties make it easy to verify that constraint sets in many practical problems are convex.
4.2 Convex Functions
A function f: C to R on a convex set C is convex if for all x, y in C and t in (0,1): f(t*x + (1-t)*y) less than or equal to t*f(x) + (1-t)*f(y). Geometrically, the chord between any two points on the graph lies above (or on) the graph.
Jensen's Inequality
For a convex function f and a random variable X: f(E[X]) less than or equal to E[f(X)]. Equality holds iff X is almost surely constant or f is affine. Jensen's inequality is a pillar of probability theory, information theory, and statistics, underpinning results from the AM-GM inequality to the EM algorithm.
A twice-differentiable function is convex iff its Hessian is positive semidefinite everywhere on its domain. Common convex functions: x^2, exp(x), -log(x) for x greater than 0, norms, and maximum of affine functions.
4.3 Convex Programming and Global Optimality
Theorem: Local = Global for Convex Problems
If f is convex and the feasible set is convex, then every local minimum is a global minimum. Furthermore, the set of global minimizers is convex. This theorem justifies the use of local algorithms (gradient descent, interior-point) for guaranteed global solutions in convex settings.
Important subclasses include quadratic programming (QP), second-order cone programming (SOCP), and semidefinite programming (SDP), each solvable by specialized algorithms with polynomial complexity. Tools such as CVXPY allow practitioners to specify and solve convex programs with minimal programming effort.
5. Gradient Descent and Variants
Gradient descent is the workhorse algorithm of modern machine learning. Its simplicity, scalability, and adaptability to large-scale problems have made it indispensable in training deep neural networks with billions of parameters.
5.1 Gradient Descent Basics
Starting from an initial point x_0, gradient descent iterates: x(k+1) = x(k) - alpha * grad f(x(k)), where alpha greater than 0 is the learning rate (step size). The update moves in the direction of steepest descent, which is the negative gradient.
Learning Rate Selection
- Too large: oscillation or divergence.
- Too small: slow convergence, impractical for large models.
- Optimal fixed rate for L-smooth convex f: alpha = 1/L where L is the Lipschitz constant of grad f.
- Adaptive methods (Adam, RMSProp) estimate a per-parameter effective learning rate automatically.
5.2 Convergence Analysis
For a convex, L-smooth function, gradient descent with alpha = 1/L satisfies: f(x_k) - f(x*) less than or equal to L * ||x_0 - x*||^2 / (2k). This is an O(1/k) convergence rate. For mu-strongly convex functions, the rate improves to linear (geometric) convergence: f(x_k) - f(x*) less than or equal to (1 - mu/L)^k times the initial suboptimality.
Nesterov Acceleration
Nesterov's accelerated gradient method adds a momentum term and achieves the optimal O(1/k^2) rate for smooth convex functions — provably better than vanilla gradient descent without any additional function evaluations per iteration.
5.3 Stochastic Gradient Descent (SGD)
In machine learning, the objective is often a sum: f(x) = (1/n) * sum f_i(x), where each f_i is the loss on a single training example. Computing the full gradient costs O(n) per step. Stochastic gradient descent replaces grad f with grad f_i for a randomly chosen i, reducing per-step cost to O(1) at the price of noisy updates.
Mini-batch SGD averages gradients over a small batch of examples (typically 32-512), balancing gradient noise against computational cost and enabling GPU parallelism. Despite the noise, SGD converges to a neighborhood of the optimum and often generalizes better than exact methods due to implicit regularization.
Adaptive Optimizers
- Momentum SGD: accumulates a velocity vector to dampen oscillations.
- RMSProp: divides the learning rate by a running average of squared gradients.
- Adam: combines momentum and RMSProp with bias-correction terms.
- AdaGrad: accumulates all past squared gradients, effective for sparse data.
6. Newton's Method for Optimization
Newton's method uses second-order (curvature) information to take more precise steps than first-order gradient methods. Near a minimum, it achieves quadratic convergence — the number of correct digits roughly doubles each iteration.
6.1 The Newton Step
The Newton update is: x(k+1) = x(k) - H(f)(x(k))^(-1) * grad f(x(k)), where H(f)(x(k)) is the Hessian at x(k). This arises by minimizing the second-order Taylor approximation of f around x(k). When the Hessian is positive definite, the Newton step is a descent direction.
Theorem: Quadratic Convergence
Near a non-degenerate local minimum x* where the Hessian is positive definite and Lipschitz continuous, Newton's method converges quadratically: ||x(k+1) - x*|| less than or equal to C * ||x(k) - x*||^2 for a constant C. This means that once the iterates are close enough, convergence is extremely rapid.
6.2 Quasi-Newton Methods
Computing and inverting the full Hessian costs O(n^3) per step, prohibitive for large n. Quasi-Newton methods approximate H^-1 using gradient differences. The BFGS algorithm maintains a positive-definite approximation updated after each step, achieving superlinear convergence with O(n^2) storage. L-BFGS reduces storage to O(n) by keeping only the m most recent update pairs, making it the standard optimizer for large-scale smooth unconstrained problems.
6.3 Newton's Method vs. Gradient Descent
| Property | Gradient Descent | Newton's Method |
|---|---|---|
| Order | First-order | Second-order |
| Convergence rate | Linear O(1/k) | Quadratic |
| Per-step cost | O(n) | O(n^3) (exact), O(n^2) (BFGS) |
| Scale | Billions of parameters | Up to tens of thousands |
| Robustness | Global with proper step size | Local (needs good initialization) |
7. Nonlinear and Integer Programming
7.1 Nonlinear Programming (NLP)
Nonlinear programming is the general case: minimize a nonlinear f subject to nonlinear constraints. Unlike convex programming, NLP may have multiple local minima, and there is no polynomial-time algorithm guaranteed to find the global optimum. Practical approaches include:
- Sequential quadratic programming (SQP): solve a sequence of QP subproblems.
- Interior-point (barrier) methods: add a log-barrier to enforce constraints, solve a sequence of unconstrained problems.
- Penalty methods: add a penalty for constraint violation to the objective.
- Augmented Lagrangian methods: combine penalty and multiplier updates.
- Global methods: branch and bound, genetic algorithms, simulated annealing for certified global solutions.
7.2 Integer Programming
Integer programming (IP) requires some or all variables to take integer values. Mixed-integer programming (MIP) combines continuous and integer variables. IP models discrete decisions: which routes to assign, which facilities to open, which jobs to schedule.
Branch and Bound
The standard IP algorithm relaxes integrality (solving the LP relaxation), then branches on a fractional variable to create two subproblems, and uses LP bounds to prune branches that cannot improve on the best integer solution found. Modern MIP solvers (Gurobi, CPLEX) add cutting planes (Gomory cuts, lift-and-project) that tighten the relaxation before branching.
Integer programming is NP-hard in general, but many practical instances are solved quickly by commercial solvers due to tight LP relaxations and effective branching heuristics. The TSP (traveling salesman problem), facility location, and portfolio optimization with cardinality constraints are classic IP applications.
8. Multi-Objective Optimization
Many real problems involve competing objectives — minimize cost while maximizing quality, minimize risk while maximizing return. Multi-objective optimization acknowledges that there is no single best solution but rather a frontier of trade-offs.
8.1 Pareto Dominance and the Pareto Frontier
Solution x dominates solution y if x is at least as good as y in every objective and strictly better in at least one. The Pareto frontier (Pareto front) is the set of non-dominated solutions — those for which no objective can be improved without degrading another.
Definition: Pareto Optimality
A feasible point x* is Pareto optimal if there is no feasible x such that f_i(x) less than or equal to f_i(x*) for all i and f_j(x) less than f_j(x*) for some j. The set of all Pareto optimal points is the Pareto frontier and represents the complete trade-off surface.
8.2 Scalarization Methods
The simplest approach is weighted sum: minimize sum w_i * f_i(x) with w_i greater than or equal to 0, sum w_i = 1. Varying the weights traces out the Pareto frontier for convex problems. The epsilon-constraint method minimizes one objective while bounding the others, generating non-convex parts of the frontier inaccessible to weighted sum.
8.3 Evolutionary and Metaheuristic Approaches
For complex nonlinear multi-objective problems, evolutionary algorithms such as NSGA-II (Non-dominated Sorting Genetic Algorithm II) maintain a population of solutions and evolve them toward the Pareto frontier using selection based on Pareto rank and crowding distance. These methods require no gradient information and naturally produce a diverse set of Pareto-optimal solutions in a single run.
9. Real-World Applications
9.1 Machine Learning
Optimization is the core of machine learning. Training a neural network means minimizing a loss function (cross-entropy, MSE) over hundreds of millions of parameters using SGD variants. Support vector machines are solved as convex QPs. Lasso and ridge regression are convex optimization problems with closed-form or efficiently computed solutions. Hyperparameter tuning uses Bayesian optimization — a form of sequential, acquisition-function-guided optimization over a probabilistic surrogate model.
9.2 Operations Research
Airlines solve large LP/IP problems daily: crew scheduling (set partitioning), fleet assignment (network flow IP), revenue management (LP). Logistics companies solve vehicle routing problems (VRP) as integer programs. Manufacturing uses LP for production planning and mixed-integer programs for lot sizing. Supply chain optimization involves stochastic programming to hedge against demand uncertainty.
9.3 Economics and Finance
Portfolio optimization (Markowitz, 1952) is a QP: minimize portfolio variance subject to a return target and budget constraint. The efficient frontier in mean-variance space is directly analogous to the Pareto frontier. General equilibrium models solve large systems of optimization problems simultaneously. Option pricing under risk-neutral measure is a linear programming problem when markets are incomplete.
9.4 Engineering Design
Structural topology optimization uses PDE-constrained optimization to find material distributions maximizing stiffness for a given weight. Aerodynamic shape optimization minimizes drag subject to lift and stress constraints. Control theory formulates optimal control as an infinite-dimensional optimization problem solved via the Pontryagin maximum principle or dynamic programming. Circuit design uses convex optimization for filter synthesis and power minimization.
Frequently Asked Questions
What is mathematical optimization?
Mathematical optimization is the process of finding the maximum or minimum value of an objective function, subject to constraints. It is the backbone of machine learning, operations research, economics, and engineering design.
What is the difference between unconstrained and constrained optimization?
Unconstrained optimization finds extrema of a function without restrictions on the variables. Constrained optimization requires that the solution satisfy equality or inequality constraints, and uses tools such as Lagrange multipliers and KKT conditions.
What are Lagrange multipliers?
Lagrange multipliers are scalar variables introduced to convert a constrained optimization problem into an unconstrained one. For minimize f(x) subject to g(x) = 0, the Lagrangian is L(x, lambda) = f(x) + lambda * g(x), and the solution satisfies grad f = -lambda * grad g together with the constraint. The multiplier has a natural economic interpretation as a shadow price.
What is the simplex method?
The simplex method is an algorithm for solving linear programming problems. It starts at a vertex of the feasible region and iteratively moves along edges to adjacent vertices that improve the objective function, stopping when no improving direction exists. Though exponential in worst case, it is highly efficient in practice.
What is convex optimization and why does it matter?
Convex optimization is optimization over a convex set with a convex objective function. It matters because any local minimum is also a global minimum, and efficient algorithms such as interior-point methods solve large-scale convex problems to certified global optimality. Most tractable optimization problems in practice are convex or are reformulated as convex.
How does gradient descent work?
Gradient descent iteratively updates the current solution x by moving in the direction of the negative gradient: x(k+1) = x(k) - alpha * grad f(x(k)), where alpha is the learning rate. It converges to a local minimum for smooth functions and to a global minimum for convex functions. Stochastic and mini-batch variants make it scalable to massive datasets.
What is the Pareto frontier in multi-objective optimization?
The Pareto frontier is the set of solutions in a multi-objective optimization problem such that no objective can be improved without worsening at least one other. It represents the complete trade-off surface between competing goals and is the fundamental output of multi-objective optimization algorithms.
Key Terms Quick Reference
Objective Function
The function to be minimized or maximized.
Feasible Region
The set of all points satisfying the constraints.
Critical Point
A point where the gradient of f is zero.
Hessian Matrix
Matrix of second partial derivatives; determines curvature.
Lagrange Multiplier
Scalar lambda encoding the constraint's shadow price.
KKT Conditions
First-order optimality conditions for constrained NLP.
Slack Variable
Variable converting an inequality constraint to equality.
Convex Function
Function where the chord between any two points lies above the graph.
Jensen's Inequality
f(E[X]) is less than or equal to E[f(X)] for convex f.
Gradient Descent
Iterative algorithm moving in the steepest descent direction.
Learning Rate
Step size alpha controlling gradient descent update magnitude.
Pareto Optimality
No objective can improve without degrading another.
Dual Problem
Companion LP obtained by Lagrangian relaxation of the primal.
Strong Duality
Primal and dual LP optimal values are equal.
Branch and Bound
IP algorithm using LP relaxation bounds to prune the search tree.
Ready to Practice Optimization Problems?
Apply Lagrange multipliers, solve LP problems, and test your understanding of gradient descent with our interactive math problem sets and exam simulators.
Open Math Practice App